Friday, April 27, 2012

LongString: When things become unequal (comparing two long strings) – Part 2

 

Hi All,

In Part 1 of the LongString post I have shown how we can tackle the problem of running out of memory when manipulating long string using a class similar to the LongString. In Part 2 I will discuss a problem which arises from my implementation of LongString class when implementing the IEquatable<LongString> interface. As you may recall, the LongString is an aggregation of regular strings which are less or equal than a given value. The direct result of this definition is that two equal long strings may have a completely different representation (even if the given value is the same). I will discuss editing operations in Part 3 of this series but you may consider the case where for the given value of 3 the long string (“AB”,”CD”) is equal to the long string (“ABC”,”D”) but they have a different internal representation.

Implementing Equality

The first thing I did was to check how the framework string implements the equality method. You can see the code below but essentially since the strings are represented as a continues memory block the code simply goes over all the bytes and bitwise compares them (with a small optimization which I will not discuss).

  1. private unsafe static bool EqualsHelper(string strA, string strB)
  2.     {
  3.       int i = strA.Length;
  4.       if (i != strB.Length)
  5.       {
  6.         return false;
  7.       }
  8.       char* ptr = &strA.m_firstChar;
  9.       char* ptr2 = &strB.m_firstChar;
  10.       while (i >= 12)
  11.       {
  12.         if (*(long*)ptr != *(long*)ptr2 || *(long*)(ptr + (IntPtr)8 / 2) != *(long*)(ptr2 + (IntPtr)8 / 2) || *(long*)(ptr + (IntPtr)16 / 2) != *(long*)(ptr2 + (IntPtr)16 / 2))
  13.         {
  14.         IL_7D:
  15.           while (i > 0 && *(int*)ptr == *(int*)ptr2)
  16.           {
  17.             ptr += (IntPtr)4 / 2;
  18.             ptr2 += (IntPtr)4 / 2;
  19.             i -= 2;
  20.           }
  21.           return i <= 0;
  22.         }
  23.         ptr += (IntPtr)24 / 2;
  24.         ptr2 += (IntPtr)24 / 2;
  25.         i -= 12;
  26.       }
  27.       goto IL_7D;
  28.     }

I wanted to implement something similar and therefore my first implementation was like this:

  1. public bool EqualsVerySlow(LongString other)
  2. {
  3.   if (Length != other.Length) return false;
  4.   for (int i = 0; i < Length; i++)
  5.   {
  6.     if (this[i] != other[i])
  7.     { return false; }
  8.   }
  9.   return true;
  10. }

But then I discovered that I didn’t implement the indexer yet so I simply use the method from Part 1 which returns the internal indices (within the array of strings and in a string) GetItemIndexForCharIndex which resulted in this simple code:

  1. public char this[int index]
  2. {
  3.   get
  4.   {
  5.     Tuple<int, int> innerIndex = GetItemIndexForCharIndex(index);
  6.     return _items[innerIndex.Item1][innerIndex.Item2];
  7.   }
  8. }

The main problem here is that the implementation of GetItemIndexForCharIndex calculates the indices by going over the entire array of strings. This is OK for the indexer but it is not very good when used in a loop therefore it is time to implement the enumerator. A simple enumerator which “remembers” the current position and moves forward should be enough for this case.

  1. class LongStringEnumerator : IEnumerator<char>
  2. {
  3.   LongString _innerString;
  4.   int _currentIndex;
  5.   int _currentCharIndex;
  6.  
  7.   public LongStringEnumerator(LongString target)
  8.   {
  9.     _innerString = target;
  10.     _currentIndex = 0;
  11.     _currentCharIndex = 0;
  12.   }
  13.  
  14.   public char Current
  15.   {
  16.     get
  17.     { return _innerString._items[_currentIndex][_currentCharIndex]; }
  18.   }
  19.  
  20.   public void Dispose(){}
  21.  
  22.   object System.Collections.IEnumerator.Current
  23.   {
  24.     get
  25.     { return Current; }
  26.   }
  27.  
  28.   public bool MoveNext()
  29.   {
  30.     _currentCharIndex++;
  31.     if (_currentCharIndex >= _innerString._items[_currentIndex].Length)
  32.     {
  33.       _currentCharIndex = 0;
  34.       _currentIndex++;
  35.       if (_currentIndex >= _innerString._items.Count)
  36.       {
  37.         return false;
  38.       }
  39.     }
  40.     return true;
  41.   }
  42.  
  43.   public void Reset()
  44.   {
  45.     _currentCharIndex = 0;
  46.     _currentIndex = 0;
  47.   }
  48. }

Now I can fix the Equal code to use the enumerator instead of the indexer.

  1. public bool Equals(LongString other)
  2. {
  3.   if (Length != other.Length)
  4.     return false;
  5.  
  6.   LongStringEnumerator enumerator = new LongStringEnumerator(this);
  7.   LongStringEnumerator enumerator2 = new LongStringEnumerator(other);
  8.  
  9.   do
  10.   {
  11.     if (enumerator.Current != enumerator2.Current)
  12.       return false;
  13.     enumerator2.MoveNext();
  14.   }
  15.   while (enumerator.MoveNext());
  16.  
  17.   return true;
  18. }

One question remains… How bad is this solution? I have run some performance tests and it turns out that this equals method is about 30 to 50 times slower than the equals method of a regular string. In absolute terms, this equals method takes up to 20ms when comparing two 1,000,000 char long strings with only one different character (in a random position). The original method is so fast that it cannot be measured in ms Smile.

Conclusion

The main problem with the long string is going to be performance (in terms of time) compared to the regular string. We see that if an algorithm required mainly comparisons then the LongString is not a good choice. Still, remember that the LongString was designed for a very specific scenario in which it should perform better. In the next part I will implement some editing methods.

A small note about concurrency. Just out of curiosity I have implemented the slow method using the TPL:

  1. public bool EqualsSlow(LongString other)
  2. {
  3.   if (Length != other.Length)
  4.     return false;
  5.  
  6.   bool isEqual = true;
  7.   Parallel.For(0, Length, (i, loopState) =>
  8.   {
  9.     if (this[i] != other[i])
  10.     {
  11.       isEqual = false;
  12.       loopState.Break();
  13.     }
  14.  
  15.   });
  16.  
  17.   return isEqual;
  18. }

This was indeed faster in some cases over the very slow implementation but I cannot compete with the concurrency done in the CPU (which will be used to run the original Equals method of string). This implementation may give you some ideas on how to improve the Equals implementation of LongString (e.g. run a forward and a backward enumerators in separate tasks).

Thank you for reading,

Boris

Saturday, April 14, 2012

LongString: A long string for the small heap – Part 1

 

Hi All,

Today I am going to write about strings. I think that the most famous feature of the .Net string is its immutability and this property of the string is a source of comfort in many cases but also a source of pain in others. In this post I will describe a scenario where the immutability of a string may cause a serious degradation of performance and suggest a LongString class (with a partial implementation) which may somehow help in this case. To understand the problem in the scenario I am about to describe we have to throw in another .Net player – The large object heap (LOH for short). If you are unsure of how the LOH works you can read this Microsoft article but I will give a brief description of the relevant parts. The LOH is an additional heap used by the CLR to store large objects (and sometimes other object). How large? Currently (.Net 4) the magic number is 85,000 bytes and above. Every object allocated by the CLR which takes more than this number of byes will not be allocated on the regular heap but on the LOH. The problem is that the LOH has two undesirable properties:

  1. The objects on the LOH are always Generation 2 (i.e. you need a full GC run to remove them from memory).
  2. The memory in the LOH is not compacted because the CLR doesn’t want to copy such large objects around the memory. This may cause fragmentation in the LOH and sometimes to an OutOfMemory exception (even when there is plenty of memory remaining).

Now for the problematic scenario. Consider the case where you have a string which can be edited by the user and an internal mechanism that stores that string. Lets assume that for every change done by the user you must update some other data and display that data to the user immediately. In the general case this will not be a big problem because the strings will be pretty small. For every change you will be able to “edit” the internal string (by replacing it with a new instance) and the old instance will be garbage collected, probably as Generation 0. The problem begins when your string exceeds the magic number or actually half of the magic number since string characters are in Unicode. Then every generation of a new string will create a new instance on the LOH and the old instances will not be removed until full GC is run. If you are not careful enough and the string undergoes several manipulations then each such cycle may cause several objects to be allocated on the LOH and create fragmentation.

At this point some of you would stop me and say “Then use a StringBuilder, it is editable”. This is true, the StringBuilder is indeed a sort of editable version of the string class but internally it represents the data as a character array so the same problem remains (although not as severely as with String since the array will be reallocated much less).

I am going to propose an initial implementation of a LongString class which will be only rarely allocated on the LOH. The idea is simple, the LongString will keep an array of strings, each string is of length equal or less than a given number. It looks something like this:

  1. public class LongString
  2. {
  3.   private int _runSize;
  4.   private List<string> _items;
  5.   private int _length;
  6.  
  7.   public LongString(string data, int runSize)
  8.   {
  9.     _runSize = runSize;
  10.     _items = SplitString(data, runSize);
  11.     _length = GetTotalLength();
  12.   }
  13.  
  14.   private int GetTotalLength()
  15.   {
  16.     int count = 0;
  17.     foreach (string item in _items)
  18.     {
  19.       count += item.Length;
  20.     }
  21.  
  22.     return count;
  23.   }

Note that for brevity reasons I have omitted all the range/null checks from the code. Those checks are very important and mandatory in any code!

So, what are the pros and what are the cons of this implementation? The main advantage is that with a high enough runSize we can avoid the LOH altogether. The only part of the object that may grow too large is the items list but for run size of 100 we would be able to represent a string 100 times larger than we could with the regular string class without it being allocated on the LOH. I doubt that in any practical use you will have strings which are over 400Mb big (and you can always increase the runSize even more). The main disadvantage is that we don’t use the regular string class which means compatibility issue with any other existing code and some time and space penalties for all the extra calculations we do and data we store for the functionality of the regular string.

I will start by implementing the Substring method.

  1. private LongString(List<string> items, int runSize)
  2. {
  3.   _runSize = runSize;
  4.   _items = items;
  5.   _length = GetTotalLength();
  6. }
  7.  
  8.  
  9. public string Substring(int startIndex, int length)
  10. {
  11.   Tuple<int, int> innerStratIndex = GetItemIndexForCharIndex(startIndex);
  12.   Tuple<int, int> innerEndIndex = GetItemIndexForCharIndex(startIndex + length - 1);
  13.  
  14.  
  15.   if (innerStratIndex.Item1 == innerEndIndex.Item1) //This is the case when the substring is contained within one item
  16.   {
  17.     return _items[innerStratIndex.Item1].Substring(innerStratIndex.Item2, innerEndIndex.Item2 - innerStratIndex.Item2 + 1);
  18.   }
  19.  
  20.   StringBuilder result = new StringBuilder(); //Can be optimized by calculating the exact length
  21.  
  22.   result.Append(_items[innerStratIndex.Item1].Substring(innerStratIndex.Item2));
  23.  
  24.   for (int i = innerStratIndex.Item1 + 1; i < innerEndIndex.Item1; i++)
  25.   {
  26.     result.Append(_items[i]);
  27.   }
  28.  
  29.   result.Append(_items[innerEndIndex.Item1].Substring(0, innerEndIndex.Item2 + 1));
  30.   return result.ToString();
  31. }
  32.  
  33. public LongString LongSubstring(int startIndex, int length)
  34. {
  35.   Tuple<int, int> innerStratIndex = GetItemIndexForCharIndex(startIndex);
  36.   Tuple<int, int> innerEndIndex = GetItemIndexForCharIndex(startIndex + length - 1);
  37.  
  38.   List<string> result = new List<string>();
  39.   if (innerStratIndex.Item1 == innerEndIndex.Item1) //This is the case when the substring is contained within one item
  40.   {
  41.     result.Add(_items[innerStratIndex.Item1].Substring(innerStratIndex.Item2, innerEndIndex.Item2 - innerStratIndex.Item2 + 1));
  42.   }
  43.   else
  44.   {
  45.     result.Add(_items[innerStratIndex.Item1].Substring(innerStratIndex.Item2));
  46.  
  47.     for (int i = innerStratIndex.Item1 + 1; i < innerEndIndex.Item1; i++)
  48.     {
  49.       result.Add(_items[i]);
  50.     }
  51.  
  52.     result.Add(_items[innerEndIndex.Item1].Substring(0, innerEndIndex.Item2 + 1));
  53.   }
  54.   return new LongString(result, _runSize);
  55. }
  56.  
  57.  
  58. /// <summary>
  59. /// Returns a tuple where the first value is the index of the string that contains startIndex within the _items array
  60. /// and the second value is the offset within that item (string)
  61. /// </summary>
  62. /// <param name="index">The index of the character within the full string</param>
  63. /// <returns></returns>
  64. private Tuple<int, int> GetItemIndexForCharIndex(int index)
  65. {
  66.   int indexRemaining = index;
  67.   int itemIndex = 0;
  68.   while (indexRemaining >= 0)
  69.   {
  70.     if (_items[itemIndex].Length <= indexRemaining)
  71.     {
  72.       indexRemaining -= _items[itemIndex].Length;
  73.       itemIndex++;
  74.       continue;
  75.     }
  76.     return new Tuple<int, int>(itemIndex, indexRemaining);
  77.   }
  78.   return null;
  79. }

The code is quite simple. I have added a special private constructor that allows me to create a LongString from a list of strings. This allows me to speed up the internal cloning process without going through a regular string again. The substring operation is straight forward, cut away the start and the beginning of the target string from the saved string parts and copy all the parts in the middle as they are.

In the next post I will show more implementations of various methods of the LongString class.

Thank you for reading, Boris.