I was reviewing a hang dump for really important service with WinDbg recently, externally it was obvious that our responses were extremely slow and just a casual look at the CPU showed it was hovering between 70 and 90%.
Whenever I see sporadically slow running applications where the data layer is not causing issues I tend to first look at the Thread pool. The command is as follows:
Thread pool command shows the CPU and the threads that are currently in use:
CPU utilization: 81%
Worker Thread: Total: 1 Running: 0 Idle: 1 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 0
Number of Timers: 1
Completion Port Thread:Total: 2 Free: 2 MaxFree: 8 CurrentLimit: 2 MaxLimit: 1000 MinLimit: 4
Now the machine.config is where MaxLimit of 32767 threads and we are currently using none of them, which means our thread pools are not exhausted or over taxed (that max number is weird by the way).
The CPU is infinitely more interesting in this case, because as a rule if the CPU is over 80% new threads do not get allocated. It was also really interesting that we happened to catch this hang dump at exactly 81% because the garbage collector artificially also raises the CPU to 81% to prevent thread allocation, obviously if we cannot allocate threads we cannot get work done.
We can run the following command and look at the stack of each thread:
0cbeeae0 7c822124 77e6baa8 000003ec 00000000 ntdll!KiFastSystemCallRet
0cbeeae4 77e6baa8 000003ec 00000000 00000000 ntdll!NtWaitForSingleObject+0xc.
0cbeeb54 77e6ba12 000003ec ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
0cbeeb68 791fed5c 000003ec ffffffff 791fe901 kernel32!WaitForSingleObject+0x12
0cbeeb74 791fe901 00000001 025372e8 00000001 mscorsvr!GCHeap::WaitUntilGCComplete+0x4f
Ah see! This particular issue is being caused by a GC (WaitUntilGCComplete) so the problem may be that we are being forced to perform too many garbage collections, maybe because of some poorly written code. To confirm this theory over time I started looking at performance counters, specifically ".NET CLR Memory->% Time in GC" which happened to be way over 60%, not good.