Archive

Archive for the ‘Programming’ Category

The $2,000 one line bug fix

August 4th, 2009

Today is the last of 4 days I spent (mostly) on a bug. The actual bug was fixed by removing one line of code, which should have never been there in the first place!

What was happening? Sometimes and randomly, our application was crashing on exit. The crash was always some dodgy memory access violation. This happened after calling a SendMessage() to HWND_BROADCAST… I did not quite understand what was wrong with it, but calling that SendMessage was stupid enough in the first place. I removed it, replaced it by something much better, and claimed to have fixed the bug. Not surprisingly, the next day the QA had another similar crash in DestroyWindow(), albeit much harder to reproduce.

This time was spent doing some proper investigation. A few days later, after unrolling tons and tons of code, I found miles away from the code that was crashing this little jewel:

FrameWidget::~FrameWidget()
{
  m_hWnd = 0;
  if (IsWindow())
    DestroyWindow();
  // etc...
}

Suddenly the whole picture gets clearer. The first line resets the window handle. The next one checks if the handle represents a window. Of course not! So no need to destroy the window object associated to the handle, right?! Then, by the time you get out of the destructor, your message map/callback functions have been destroyed and point to garbage memory… All it takes to crash it now, is one message routed to this window. And thanks to Murphy’s law, it did happen!

What SendMessage() to HWND_BROADCAST and DestroyWindow() have in common, is that they actually did send a message to this phantom window. In case you did not know, DestroyWindow() does purge the message loop…

Why the handle was set to null in the first place is beyond me, but it sure was a bitch to find!

Reference:

Programming, Windows, win32

Don’t use standard library/CRT functions in static initializers/DllMain!

July 30th, 2009

Today, my colleague and  I, we found a bug in OpenSceneGraph. Everybody knows that singletons are evil and may cause cancer,  but that does not prevent OpenSceneGraph from using them all over the place. One of them in particular is initialized in the static context, and happens to call the standard library function getenv(). This was the beginning of all our troubles: from time to time, this seemingly benign initialization would cause a deadlock in our software.

Static initialization occurs for instance when you construct an object outside of any function body by writing something like static ObjectBlah myGlobalVar = new ObjectBlah(). The constructors are called before the entry point of your program (ie the main() function). In the case of a shared library, the static initialization generally occurs when the shared library is loaded. On Windows, the DllMain callback function is used for that purpose.

The rest of this post will detail the specific behavior on Windows, and explain why it is prone to deadlocks. I think it is generally a bad practice to call standard library functions on other platforms too, although I do not have a detailed proof to back my words (yet!).

If you read the MSDN documentation of DllMain, you will realize that this function is quite dangerous to use. I extracted and highlighted some parts below:

Warning There are serious limits on what you can do in a DLL entry point. To provide more complex initialization, create an initialization routine for the DLL. You can require applications to call the initialization routine before calling any other routines in the DLL.

Because Kernel32.dll is guaranteed to be loaded in the process address space when the entry-point function is called, calling functions in Kernel32.dll does not result in the DLL being used before its initialization code has been executed. Therefore, the entry-point function can call functions in Kernel32.dll that do not load other DLLs. For example, DllMain can create synchronization objects such as critical sections and mutexes, and use TLS. Unfortunately, there is not a comprehensive list of safe functions in Kernel32.dll.

Windows 2000: Do not create a named synchronization object in DllMain because the system will then load an additional DLL.

One of the reasons why it is so critical to not use LoadLibrary() or any kernel32.dll functions susceptible to call LoadLibrary() (like user32.dll) is because the process private critical section is locked while DllMain is running. This “loader lock” is taken any time a library is loaded but also when functions like GetModuleHandle() or GetModuleFileName() are used.

At this point you might think it is safe to call a standard library function as long as it does not appear to require anything more than kernel32.dll functions and is not using GetModuleHandle() or GetModuleFileName(). This relies on implementation details of the standard library, so it’s a bit border line, but still might work, right? No.. wrong! Now all you need to cause a deadlock is another lock. Guess what? There are plenty of locks taken in the standard library functions…

Principle: do not call any standard library functions that acquire locks in DllMain().

It’s quite another thing to know whether a function uses advanced kernel functions (unlikely to change) than to know whether it does acquire internal locks (subject to change every time Microsoft releases a new CRT). Therefore the following corollary can be deduced:

Corollary: do not use any standard library functions at all in DllMain() (because you really have no idea whether it will change and acquire a lock in the future).

If you use CRT functions in your DllMain() the following events might occur in the wrong preemption order and cause a deadlock. Imagine that you have one thread that calls LoadLibrary() on your Dll. It acquires the loader lock, then executes DllMain() which finally executes your static initializer. At this stage if you use a CRT function, you will try to acquire an internal CRT lock… Meanwhile in the rest of you program you might want to access another CRT function that happens to do the following: attempt to acquire the same CRT lock, then attempt to call LoadLibrary (because it uses an advanced kernel function that requires a new Dll to be loaded). LoadLibrary will attempt to acquire the loader lock too. Boom!

In practice, the bug we discovered involved the following race condition:

  • Thread A
    1. LoadLibrary() called. Acquire loader lock.
    2. DllMain() executes getenv(). Acquire _ENV_LOCK.
  • Thread B
    1. stat() is called. Acquire _ENV_LOCK
    2. stat() calls GetTimeZoneInformation(), which requires ntdll.dll to be loaded.
    3. Then calls LoadLibrary() to get ntdll. Acquire loader lock.

If your threads are preempted such as 1,3,4,2,5 are executed in sequence, then you have a deadlock!

I discovered a posteriori that Richard Chen and Larry Osterman documented the problem. Although in their case they do not explicitly mention the C runtime, it is obvious that their remarks are relevant once you know that the CRT uses internal locks.

References:

Correction:

  • I previously wrote “the rest of this post will detail the behavior on Windows, but the same general principles are true for other platforms as well.“. This was a gross exageration. I’m still convinced that using libc in static initializers is prone to troubles on other platforms as well. Having it all work correctly depends on so many implementation details: kernel (how are shared objects loaded, how system calls works), libc (how does it interact with the kernel), and the linker (does it load stuff in the right order w.r.t. static initializers).

Programming, Windows, c++, win32

How to use WM_SETICON properly

July 17th, 2009

It’s not rocket science, but you’d be surprised to see how many people get this wrong. Here’s a code snippet of what people usually do:

const HICON hicon = ::LoadIcon(
  ::GetModuleHandle(0),
  MAKEINTRESOURCE(IDR_MAINFRAME)
  );
if (hicon)
{
  ::SendMessage(hwnd, WM_SETICON, ICON_BIG, reinterpret_cast(hicon));
  ::SendMessage(hwnd, WM_SETICON, ICON_SMALL, reinterpret_cast(hicon));
}

This code would set the large icon (as displayed in the ALT-TAB switch window dialog) and the small icon (as displayed in the window title bar or the task bar) of the window (whose handle is hwnd).

Except for the error checking which is not particularly thorough, most people would consider this code to be correct. You could also have used LoadImage with LR_DEFAULTSIZE.

The problem reveals itself with an icon which has a sligthly different design for the small 16×16 size compared to the large 32×32 size. The 16×16 icon will look like a scaled down version of the 32×32 one. The reason why the smaller icon looks different in the first place is usually because the large icon’s design is a bit overloaded and cannot be downscaled without becoming unrecognizable; so you will notice!

What’s wrong? If you dig into the documentation you will found the following information:

  • LoadIcon loads the most appropriate icon, at one fixed size, which is SM_CXICON by SM_CYICON pixels, ie the size of a large icon.
  • The small icon should be SM_CXSMICON by SM_CYSMICON pixels.
  • WM_SETICON for ICON_SMALL would downscale a larger icon to SM_CXSMICON by SM_CYSMICON pixels automatically.
  • This odd behavior is due to backward compatibility with legacy versions of windows, which only had one size of icons.

Therefore the fix is easy, all you need to do is call LoadImage twice and explicitely provide the size of your icon.

const HANDLE hbicon = ::LoadImage(
  ::GetModuleHandle(0),
  MAKEINTRESOURCE(IDR_MAINFRAME),
  IMAGE_ICON,
  ::GetSystemMetrics(SM_CXICON),
  ::GetSystemMetrics(SM_CYICON),
  0);
if (hbicon)
  ::SendMessage(hwnd, WM_SETICON, ICON_BIG, reinterpret_cast(hbicon));

const HANDLE hsicon = ::LoadImage(
  ::GetModuleHandle(0),
  MAKEINTRESOURCE(IDR_MAINFRAME),
  IMAGE_ICON,
  ::GetSystemMetrics(SM_CXSMICON),
  ::GetSystemMetrics(SM_CYSMICON),
  0);
if (hsicon)
  ::SendMessage(hwnd, WM_SETICON, ICON_SMALL, reinterpret_cast(hsicon));

Reference (MSDN):

Programming, Uncategorized, Windows, win32