Kestrel is the new cross platform .NET web server (based on libuv) which runs on Linux, Mac and Windows 10 and will, eventually, run on Raspberry Pi. One the outstanding improvements is the sheer speed. According to some measure it is about 20 times faster than ASP.NET running on IIS. This is clearly amazing and having been a ward for a high performance platform I was curious about some of the changes that were introduce to accelerate performance in such a drastic way.

The most important changes from my stand point were the ideas behind reducing Garbage Collection pressure and taking advantage of advanced CPU instructions, to be clear there are other changes but I understood these the clearest, and could envision applying them to existing apps.

Quick Garbage Collection Primer

Most developer I converse with are fully aware that .NET garbage collection (GC) is organized into generations (0, 1 and 2), however, the premise that GC further divides objects up into small and large object heaps is occasionally missed. When an object is large (greater than 85000 bytes) some attributes and actions associated with it become more significant than if the object is small. For instance, compacting it, meaning copying the memory elsewhere on the heap, is considered an expensive operation for larger objects. I then tend to think of garbage collection in the following logical and physical layouts:

Logical view of the GC Heap

  • Generation 0 – For short lived objects
  • Generation 1 - Objects that have survived gen 0, this is a buffer between gen 0 and the long lived gen 2
  • Generation 2 - Objects that have survived gen 1, and large objects (> 85000 bytes)

Physical view of the managed heap segment

  • Small object heap (SOH) < 85000 bytes [starts in gen 0]
  • Large object heap (LOH) > 85000 bytes [starts in gen 2, because compaction is expensive]

A GC occurs if one of the following three conditions presents:

  • Allocation exceeds the generation 0 or large object threshold
  • System is in a low memory situation
  • System.GC.Collect is called manually

Reducing Garbage Collection Pressure

Managing memory on our behalf helps us immeasurably but the process is not cost free, so understanding the GC design is essential for efficient application servers. We need to be concerned with reducing GC pressure, by that I mean, reducing the conditions under which a GC is triggered. One really clever way to do this is to reduce the continuous need to allocate strings (which begin in Gen 0) by converting them to bytes and then ensuring that they live in the LOH by using a Memory Pool.

Remember, the goal is reducing the need for objects to be promoted through the generations unnecessarily (really CPU intensive). So in Kestrel all the strings that are important to the HTTP request/response life cycle (GET, POST, HEAD, etc) are created as static bytes, and are made part of a contiguous Memory Pool and then pinned. Pinning the Memory Pool simply prevents the object from being moved around which frees the GC from the responsibility of constantly checking whether it needs to be decommitted, further reducing GC pressure. During a normal gen 2 the GC will take the opportunity to release segments that have no live objects on them back to the OS (by calling VirtualFree) but for pinned objects this would skipped.

Large Object Heap (LOH) Graph

Avoiding Strings

All HTTP requests arrive at the designated ports as bytes and normally we would go about the process of converting them to strings, but as I noted Kestrel has gone about the business of defining known strings as bytes so all common comparisons become a mathematic operation, rather than a string comparison. Luckily enough the verbs and headers are all 8 bytes (or less) and so you can define each of them with a static long. This then means Kestrel can process many request without dealing with strings. No strings means no allocations, no deallocation, which means reduced GC pressure … yay.

So when you retrieve a POST from the wire you can do a bitwise compare against a statically assigned constant (actual code is here):

public const string HttpPostMethod = "POST";
private readonly static long _httpGetMethodLong = GetAsciiStringAsLong("GET \0\0\0\0");
private readonly static long _httpPostMethodLong = GetAsciiStringAsLong("POST \0\0\0");

///  
/// Checks that up to 8 bytes from  correspond to a known HTTP method. 
///  
public static bool GetKnownMethod(this MemoryPoolIterator begin, out string knownMethod) 
{ 
    knownMethod = null; 
    var value = begin.PeekLong(); 
    
    if ((value & _mask4Chars) == _httpGetMethodLong) 
    { 
        knownMethod = HttpGetMethod; 
        return true; 
    } 
    foreach (var x in _knownMethods) 
    { 
        if ((value & x.Item1) == x.Item2) 
        { 
            knownMethod = x.Item3; 
            return true; 
        } 
    } 
    return false; 
}

private readonly static Tuple[] _knownMethods = new Tuple[8];

static MemoryPoolIteratorExtensions() 
{ 
    /// ... 
    _knownMethods[1] = Tuple.Create(_mask5Chars, _httpPostMethodLong, HttpPostMethod); 
    /// ...
}

Why do this when a simple IndexOf, Compare or EndsWith method exists? Well at this layer you are obligated to care about every allocation because it has a direct consequence on the overall speed. Every microsecond counts.

CPU Instructions

It has been a long time since I have been directly concerned with CPU instructions, however, Intel has long since introduced Advanced Vector Instructions (AVX) which allow Single Instruction Multiple Data (SIMD) operations on Intel architecture CPUs. In simple terms this means Kestrel can look at more than one byte at a time in a single CPU instruction, however, to do this you will need to write your code in Assembly language … ugh.

Not to worry, .NET Core uses the RyuJIT compiler (also used by .NET 4.6) and that allows us to emit byte code that uses AVX (check out System.Numerics.Vector).

So what does this really mean? It permits you to perform operations on data larger than the register size of the CPU. So for a 64 bit CPU you can actually perform 128 bit operations by using CPU extensions (up to 512 bits for AVX 3). You can operate on 16 bytes or 2 longs at a time rather than looping through individual bytes as they are retrieved from the wire.

Summary

Understanding the Garbage Collection process is critical to building high performance platforms like Kestrel, and while this kind of design consideration is an edge case for most of us, understanding the basics of GC can improve even the simplest modern applications. Kestrel's performance (clocked at 5 million request a second I believe) is a testament to a dedicated Microsoft Team and its commitment to collaborating with the open source community.

References:



Comment Section

Comments are closed.