# An Optimization Exercise

Nick Craver , one of my coworkers at Stack Overflow , tweets out snippets of the Stack Overflow occasionally.  About a week ago he showed off a ContainsToken method which has been tuned for performance.  This set off a little bit of a benchmarking contest , which I participated in.  My final attempt (which is among the fastest) ended up using a lot of tricks, which I think may be of general interest – so I’m going to walk through how the code works.

First, we need to define the problem.  We’re creating a function, which:

• Determines if, when split on a given delimiter, a value string contains a given token
• Ordinal comparison of characters is acceptable (this wasn’t clear from the tweet initially)
• Does not allocate

I’m also allowing for optimizations that exploit Stack Overflow’s production environment, so 64-bit Intel and .NET 4.6+ specific tricks are allowed.  Since Stack Overflow already has some unsafe code in it, unsafe is also acceptable..

My full solution is available on GitHub , my solution is in Program.cs along with several others.  Credit to mattwarren for creating the Benchmark gist .

Here’s my full solution, sans comments, which I’ll be going over line-by-line.

`public static bool ContainsTokenMonty(string value, string token, char delimiter = ';') {   const int charsPerLong = sizeof(long) / sizeof(char);   const int charsPerInt = sizeof(int) / sizeof(char);   const int bytesPerChar = sizeof(char) / sizeof(byte);    if (string.IsNullOrEmpty(token)) return false;   if (string.IsNullOrEmpty(value)) return false;    var delimiterTwice = (delimiter << 16) | delimiter;    var valueLength = value.Length;   var tokenLength = token.Length;    if (tokenLength > valueLength) return false;    int tokenLongs;   bool tokenTrailingInt, tokenTrailingChar;   {     var remainingChars = tokenLength;     tokenLongs = remainingChars / charsPerLong;     tokenTrailingInt = (tokenLength & 0x02) != 0;     tokenTrailingChar = (tokenLength & 0x01) != 0;   }    var tokenByteLength = tokenLength * bytesPerChar;    unsafe   {     fixed (char* valuePtr = value, tokenPtr = token)     {       var curValuePtr = valuePtr;       var endValuePtr = valuePtr + valueLength;        while (true)       {         long* tokenLongPtr = (long*)tokenPtr;         {           for (var i = 0; i < tokenLongs; i++)           {             var tokenLong = *tokenLongPtr;              var valueLong = *((long*)curValuePtr);              if (tokenLong == valueLong)             {               tokenLongPtr++;               curValuePtr += charsPerLong;               continue;             }             else             {               goto advanceToNextDelimiter;             }           }         }          int* tokenIntPtr = (int*)tokenLongPtr;         if (tokenTrailingInt)         {           var tokenInt = *tokenIntPtr;            var valueInt = *((int*)curValuePtr);            if (tokenInt == valueInt)           {             tokenIntPtr++;             curValuePtr += charsPerInt;           }           else           {             goto advanceToNextDelimiter;           }         }          char* tokenCharPtr = (char*)tokenIntPtr;         if (tokenTrailingChar)         {           var tokenChar = *tokenCharPtr;            var valueChar = *curValuePtr;            if (tokenChar == valueChar)           {             tokenCharPtr++;             curValuePtr++;           }           else           {             goto advanceToNextDelimiter;           }         }          if (curValuePtr == endValuePtr || *curValuePtr == delimiter)         {           return true;         }          advanceToNextDelimiter:          while (true)         {           var curVal = *((int*)curValuePtr);            var masked = curVal ^ delimiterTwice;           var temp = masked & 0x7FFF7FFF;           temp = temp + 0x7FFF7FFF;           temp = (int)(temp & 0x80008000);           temp = temp | masked;           temp = temp | 0x7FFF7FFF;           temp = ~temp;           var neitherMatch = temp == 0;            if (neitherMatch)           {             curValuePtr += charsPerInt;             if(curValuePtr >= endValuePtr)             {               return false;             }             continue;           }            var top16 = temp & 0xFFFF0000;           if (top16 != 0)           {             curValuePtr += charsPerInt;              break;           }            var bottom16 = temp & 0x0000FFFF;           if (bottom16 != 0)           {             curValuePtr += 1;           }         }          var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);         if (remainingBytesInValue < tokenByteLength)         {           return false;         }       }     }   } }`

It’s no accident that this resembles C, performance critical C# often does.  Take note that there are no news (or calls out to functions which may allocate), so this code does zero allocations.

The overall algorithm is as follows:

1. Scan forward in `token` and `value` so long as they are equal
2. If they are not equal, move back to the start of `token` and advance in `value` until after the next `delimiter`
1. If there is no next `delimiter` , return false
3. If the end of `token` is reached…
1. If the next character in `value` is `delimiter` , or the end of `value` has been reached, return true
2. Otherwise, reset to the start of `token`  and advance in `value` until after the next `delimiter`
4. If the end of `value` is reached, return false

Now for the line-by-line.

`const int charsPerLong = sizeof(long) / sizeof(char); const int charsPerInt = sizeof(int) / sizeof(char); const int bytesPerChar = sizeof(char) / sizeof(byte);`

Some convenient constants for later.  These could be hardcoded since long, int,and char are always 8, 4, and 2 bytes long in C# – but I find this style easier to reason about.

`if (string.IsNullOrEmpty(token)) return false; if (string.IsNullOrEmpty(value)) return false;`

This code is copied from the initial tweet, no real optimizations are available here.  The check is important to maintain compatibility with the original code.

`var delimiterTwice = (delimiter << 16) | delimiter;  var valueLength = value.Length; var tokenLength = token.Length;`

Some values I’ll need later.  Doubling the delimiter will be used to test for it in value, and the lengths will be used to for bounds calculations.

`if (tokenLength > valueLength) return false;`

The rest of the code assumes that `token` is no larger than the remaining parts of `value` , so I check it right here to make sure that’s true.  If `token` couldn’t even fit in `value` , we don’t even bother to look at either string and just return false.

`int tokenLongs; bool tokenTrailingInt, tokenTrailingChar; {   var remainingChars = tokenLength;   tokenLongs = remainingChars / charsPerLong;   tokenTrailingInt = (tokenLength & 0x02) != 0;   tokenTrailingChar = (tokenLength & 0x01) != 0; }`

I going to compare `token` to value in 8-byte/4-char/1-long chunks, so I need to know how many `long` s can fit in token.  Because `token` isn’t always a multiple of 4 in length, I also need to see if there are 0, 1, 2, or 3 characters after the `long` s.  Note that the single division is by a constant power-of-two, so it gets compiled into something much faster than actual division.

`var tokenByteLength = tokenLength * bytesPerChar;`

One more useful value before we start actually comparing strings.  Since there will be some pointer math later, knowing the number of bytes (importantly not the number of characters) in `token` will come in handy.

`unsafe {   fixed (char* valuePtr = value, tokenPtr = token)   {     var curValuePtr = valuePtr;     var endValuePtr = valuePtr + valueLength;`

Now for the first scary bits.  The rest of this method is in an unsafe block which means: we can use pointers, and we have to do our own bounds checking with those pointers.

I take and pin pointers to `value` and `token` , make a copy of the value pointer (so it can be mutated, which pointers declared in a fixed block cannot be), and calculate `endValuePtr` so we can use it to do bounds checks on value.

`while (true) {   long* tokenLongPtr = (long*)tokenPtr;   {     for (var i = 0; i < tokenLongs; i++)     {       var tokenLong = *tokenLongPtr;        var valueLong = *((long*)curValuePtr);        if (tokenLong == valueLong)       {         tokenLongPtr++;         curValuePtr += charsPerLong;         continue;       }       else       {         goto advanceToNextDelimiter;       }     }   }`

Here I start comparing the `token` and `value` strings.  Earlier I calculated how many `long` s fit in value, so it’s a simple for loop to make sure they all match.  Each iteration compares 4 characters worth of `value` to `token` , if they match I continue the loop if they don’t we `goto advanceToNextDelimiter` .  This is a big win for larger tokens, having a quarter the comparisons and branches.

Important things to note: the `long*` casts are free, and `curValuePtr` being incremented is faster than you might assume (since `charsPerLong` is a const, the increment value is pre-calculated instead of involving a multiplication).

`int* tokenIntPtr = (int*)tokenLongPtr; if (tokenTrailingInt) {   var tokenInt = *tokenIntPtr;    var valueInt = *((int*)curValuePtr);    if (tokenInt == valueInt)   {     tokenIntPtr++;     curValuePtr += charsPerInt;   }   else   {     goto advanceToNextDelimiter;   } }`

Then essentially the same check, but for a single `int` now – comparing two characters at a time isn’t as good as four but it’s not nothing.  Whether or not this check was necessary was determined earlier, so `tokenTrailingInt` is unchanging across a single call – this is good for branch prediction.

`char* tokenCharPtr = (char*)tokenIntPtr; if (tokenTrailingChar) {   var tokenChar = *tokenCharPtr;    var valueChar = *curValuePtr;    if (tokenChar == valueChar)   {     tokenCharPtr++;     curValuePtr++;   }   else   {     goto advanceToNextDelimiter;   } }`

And then again for a single `char` .

`if (curValuePtr == endValuePtr || *curValuePtr == delimiter) {   return true; }`

If this chunk of code is reached then every character in `token` has been matched, all that’s left is to make sure that either a `delimiter` or the end of `value` has been reached.  If it has then we’ve got a match and return true, otherwise we fall through to `advanceToNextDelimiter` .

`advanceToNextDelimiter:  while (true) {   var curVal = *((int*)curValuePtr);    var masked = curVal ^ delimiterTwice;   var temp = masked & 0x7FFF7FFF;   temp = temp + 0x7FFF7FFF;   temp = (int)(temp & 0x80008000);   temp = temp | masked;   temp = temp | 0x7FFF7FFF;   temp = ~temp;   var neitherMatch = temp == 0;    if (neitherMatch)   {     curValuePtr += charsPerInt;     if(curValuePtr >= endValuePtr)     {       return false;     }     continue;   }`

This code advances `curValuePtr` forward until a `delimiter` is found.  There’s a clever combination of tricks here to allow checking two characters with one branch.

The first trick is advancing two characters at a time (thus the `int*` cast), and being willing to advance 1 character  past the end of `value` .  This works because C# guarantees that a string coerced to a char* is terminated with a null  character (not a null byte), so there’s always two additional bytes at the end of the string for us to read into whether we need them or not.

The second trick is a series of bit twiddlings, as follows:

1. XOR `delimiterTwice` into `curVal` , making the the top or bottom char-sized chunks of `masked` all 0 if they match `delimiter`
• `delimiterTwice` was precalculated
2. AND `masked` with `0x7FFF7FFF` , clearing the high bits in the top and bottom char-sized chunks in `temp` , store the result in `temp`
3. ADD `0x7FFF7FFF` to `temp` , causing the high bits in the top and bottom char-sized chunks of `temp` to be 1 if  any  of the low bits in those respective chunks are set
4. AND `temp` with `0x80008000` , clearing all but the high bits in the top and bottom char-sized chunks in `temp`
5. OR `temp` with `masked` , making sure the high bits in each char-sized chunk are set in `temp` when the high-bits in `masked` are.
6. OR `temp` with `0x7FFF7FFF` , set all the low bits in the char-sized chunks in `temp` which will make temp all 1s if both char-sized chunks in `curVal`   do not match delimiter
7. INVERT `temp` , making it all 0s if step #5 made it all 1s
8. COMPARE `temp` against 0, setting `neitherMatch` if so

This is followed up with an if that’s taken if `neitherMatch` is set in which I increment the pointer into `value` by two `char` s.  There’s then a bounds check, comparing `curValPtr` to `endValuePtr` (which was pre-calculated for this purpose) and false is returned if the bounds have been exceeded.  Note that `endValuePtr` points at the null-terminator, so bailing when `curValPtr` is equal to it is correct.

`var top16 = temp & 0xFFFF0000; if (top16 != 0) {   curValuePtr += charsPerInt;    break; }  var bottom16 = temp & 0x0000FFFF; if (bottom16 != 0) {   curValuePtr += 1; }`

If this code is run a `delimiter` has been found but `curValuePtr` isn’t pointing to that character, just to the int that contains it.   `top16` and `bottom16` contain the top and bottom `char` s from `temp` , whichever one isn’t all zeros is the one that contained the  `delimiter` .  Depending on which is set, `curValuePtr` is incremented so that it points at the first character  after the `delimiter` .

Note because we’re executing on a little-endian architecture but C# is big-endian the increment sizes here appear to be flipped.  Also remember that `token` cannot be empty, and that is checked at the start of the method, so checking the second `char` (in `top16` ) first isn’t an error.

`var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr); if (remainingBytesInValue < tokenByteLength) {   return false; }`

This code runs once the next `delimiter` has been found, and `curValuePtr` advanced to one `char` past it.  It is necessary to reassert that there’s enough space remaining in `value` for `token` to fit, just as was asserted back towards the start of the method.

The space check is done in `byte` s instead of `char` s (the casts are free) because counting the number of `char` s between two pointers is a subtraction followed by a division, but counting the number of `byte` s is only a subtraction.  Pre-calculating `tokenByteLength` was done for this purpose.

Then the while loop begins again, and continues until a match is found or value is exhausted.

Aaaaaaand that’s it.  Performance wise, this is about 15x faster than the original ContainsToken code and something like 40% faster than than the next best implementation.

I’m not sure I’d encourage anyone to use this code, as relies on several subtleties which make it hard to understand.  It also isn’t tuned for 32-bit architectures, or correct on big-endian ones.  Both could be fixed, but require some build tweaks which aren’t particularly interesting or portable themselves.

It sure was fun to write though.