Friday, April 27, 2012

LongString: When things become unequal (comparing two long strings) – Part 2

 

Hi All,

In Part 1 of the LongString post I have shown how we can tackle the problem of running out of memory when manipulating long string using a class similar to the LongString. In Part 2 I will discuss a problem which arises from my implementation of LongString class when implementing the IEquatable<LongString> interface. As you may recall, the LongString is an aggregation of regular strings which are less or equal than a given value. The direct result of this definition is that two equal long strings may have a completely different representation (even if the given value is the same). I will discuss editing operations in Part 3 of this series but you may consider the case where for the given value of 3 the long string (“AB”,”CD”) is equal to the long string (“ABC”,”D”) but they have a different internal representation.

Implementing Equality

The first thing I did was to check how the framework string implements the equality method. You can see the code below but essentially since the strings are represented as a continues memory block the code simply goes over all the bytes and bitwise compares them (with a small optimization which I will not discuss).

  1. private unsafe static bool EqualsHelper(string strA, string strB)
  2.     {
  3.       int i = strA.Length;
  4.       if (i != strB.Length)
  5.       {
  6.         return false;
  7.       }
  8.       char* ptr = &strA.m_firstChar;
  9.       char* ptr2 = &strB.m_firstChar;
  10.       while (i >= 12)
  11.       {
  12.         if (*(long*)ptr != *(long*)ptr2 || *(long*)(ptr + (IntPtr)8 / 2) != *(long*)(ptr2 + (IntPtr)8 / 2) || *(long*)(ptr + (IntPtr)16 / 2) != *(long*)(ptr2 + (IntPtr)16 / 2))
  13.         {
  14.         IL_7D:
  15.           while (i > 0 && *(int*)ptr == *(int*)ptr2)
  16.           {
  17.             ptr += (IntPtr)4 / 2;
  18.             ptr2 += (IntPtr)4 / 2;
  19.             i -= 2;
  20.           }
  21.           return i <= 0;
  22.         }
  23.         ptr += (IntPtr)24 / 2;
  24.         ptr2 += (IntPtr)24 / 2;
  25.         i -= 12;
  26.       }
  27.       goto IL_7D;
  28.     }

I wanted to implement something similar and therefore my first implementation was like this:

  1. public bool EqualsVerySlow(LongString other)
  2. {
  3.   if (Length != other.Length) return false;
  4.   for (int i = 0; i < Length; i++)
  5.   {
  6.     if (this[i] != other[i])
  7.     { return false; }
  8.   }
  9.   return true;
  10. }

But then I discovered that I didn’t implement the indexer yet so I simply use the method from Part 1 which returns the internal indices (within the array of strings and in a string) GetItemIndexForCharIndex which resulted in this simple code:

  1. public char this[int index]
  2. {
  3.   get
  4.   {
  5.     Tuple<int, int> innerIndex = GetItemIndexForCharIndex(index);
  6.     return _items[innerIndex.Item1][innerIndex.Item2];
  7.   }
  8. }

The main problem here is that the implementation of GetItemIndexForCharIndex calculates the indices by going over the entire array of strings. This is OK for the indexer but it is not very good when used in a loop therefore it is time to implement the enumerator. A simple enumerator which “remembers” the current position and moves forward should be enough for this case.

  1. class LongStringEnumerator : IEnumerator<char>
  2. {
  3.   LongString _innerString;
  4.   int _currentIndex;
  5.   int _currentCharIndex;
  6.  
  7.   public LongStringEnumerator(LongString target)
  8.   {
  9.     _innerString = target;
  10.     _currentIndex = 0;
  11.     _currentCharIndex = 0;
  12.   }
  13.  
  14.   public char Current
  15.   {
  16.     get
  17.     { return _innerString._items[_currentIndex][_currentCharIndex]; }
  18.   }
  19.  
  20.   public void Dispose(){}
  21.  
  22.   object System.Collections.IEnumerator.Current
  23.   {
  24.     get
  25.     { return Current; }
  26.   }
  27.  
  28.   public bool MoveNext()
  29.   {
  30.     _currentCharIndex++;
  31.     if (_currentCharIndex >= _innerString._items[_currentIndex].Length)
  32.     {
  33.       _currentCharIndex = 0;
  34.       _currentIndex++;
  35.       if (_currentIndex >= _innerString._items.Count)
  36.       {
  37.         return false;
  38.       }
  39.     }
  40.     return true;
  41.   }
  42.  
  43.   public void Reset()
  44.   {
  45.     _currentCharIndex = 0;
  46.     _currentIndex = 0;
  47.   }
  48. }

Now I can fix the Equal code to use the enumerator instead of the indexer.

  1. public bool Equals(LongString other)
  2. {
  3.   if (Length != other.Length)
  4.     return false;
  5.  
  6.   LongStringEnumerator enumerator = new LongStringEnumerator(this);
  7.   LongStringEnumerator enumerator2 = new LongStringEnumerator(other);
  8.  
  9.   do
  10.   {
  11.     if (enumerator.Current != enumerator2.Current)
  12.       return false;
  13.     enumerator2.MoveNext();
  14.   }
  15.   while (enumerator.MoveNext());
  16.  
  17.   return true;
  18. }

One question remains… How bad is this solution? I have run some performance tests and it turns out that this equals method is about 30 to 50 times slower than the equals method of a regular string. In absolute terms, this equals method takes up to 20ms when comparing two 1,000,000 char long strings with only one different character (in a random position). The original method is so fast that it cannot be measured in ms Smile.

Conclusion

The main problem with the long string is going to be performance (in terms of time) compared to the regular string. We see that if an algorithm required mainly comparisons then the LongString is not a good choice. Still, remember that the LongString was designed for a very specific scenario in which it should perform better. In the next part I will implement some editing methods.

A small note about concurrency. Just out of curiosity I have implemented the slow method using the TPL:

  1. public bool EqualsSlow(LongString other)
  2. {
  3.   if (Length != other.Length)
  4.     return false;
  5.  
  6.   bool isEqual = true;
  7.   Parallel.For(0, Length, (i, loopState) =>
  8.   {
  9.     if (this[i] != other[i])
  10.     {
  11.       isEqual = false;
  12.       loopState.Break();
  13.     }
  14.  
  15.   });
  16.  
  17.   return isEqual;
  18. }

This was indeed faster in some cases over the very slow implementation but I cannot compete with the concurrency done in the CPU (which will be used to run the original Equals method of string). This implementation may give you some ideas on how to improve the Equals implementation of LongString (e.g. run a forward and a backward enumerators in separate tasks).

Thank you for reading,

Boris