Saturday, April 14, 2012

LongString: A long string for the small heap – Part 1

 

Hi All,

Today I am going to write about strings. I think that the most famous feature of the .Net string is its immutability and this property of the string is a source of comfort in many cases but also a source of pain in others. In this post I will describe a scenario where the immutability of a string may cause a serious degradation of performance and suggest a LongString class (with a partial implementation) which may somehow help in this case. To understand the problem in the scenario I am about to describe we have to throw in another .Net player – The large object heap (LOH for short). If you are unsure of how the LOH works you can read this Microsoft article but I will give a brief description of the relevant parts. The LOH is an additional heap used by the CLR to store large objects (and sometimes other object). How large? Currently (.Net 4) the magic number is 85,000 bytes and above. Every object allocated by the CLR which takes more than this number of byes will not be allocated on the regular heap but on the LOH. The problem is that the LOH has two undesirable properties:

  1. The objects on the LOH are always Generation 2 (i.e. you need a full GC run to remove them from memory).
  2. The memory in the LOH is not compacted because the CLR doesn’t want to copy such large objects around the memory. This may cause fragmentation in the LOH and sometimes to an OutOfMemory exception (even when there is plenty of memory remaining).

Now for the problematic scenario. Consider the case where you have a string which can be edited by the user and an internal mechanism that stores that string. Lets assume that for every change done by the user you must update some other data and display that data to the user immediately. In the general case this will not be a big problem because the strings will be pretty small. For every change you will be able to “edit” the internal string (by replacing it with a new instance) and the old instance will be garbage collected, probably as Generation 0. The problem begins when your string exceeds the magic number or actually half of the magic number since string characters are in Unicode. Then every generation of a new string will create a new instance on the LOH and the old instances will not be removed until full GC is run. If you are not careful enough and the string undergoes several manipulations then each such cycle may cause several objects to be allocated on the LOH and create fragmentation.

At this point some of you would stop me and say “Then use a StringBuilder, it is editable”. This is true, the StringBuilder is indeed a sort of editable version of the string class but internally it represents the data as a character array so the same problem remains (although not as severely as with String since the array will be reallocated much less).

I am going to propose an initial implementation of a LongString class which will be only rarely allocated on the LOH. The idea is simple, the LongString will keep an array of strings, each string is of length equal or less than a given number. It looks something like this:

  1. public class LongString
  2. {
  3.   private int _runSize;
  4.   private List<string> _items;
  5.   private int _length;
  6.  
  7.   public LongString(string data, int runSize)
  8.   {
  9.     _runSize = runSize;
  10.     _items = SplitString(data, runSize);
  11.     _length = GetTotalLength();
  12.   }
  13.  
  14.   private int GetTotalLength()
  15.   {
  16.     int count = 0;
  17.     foreach (string item in _items)
  18.     {
  19.       count += item.Length;
  20.     }
  21.  
  22.     return count;
  23.   }

Note that for brevity reasons I have omitted all the range/null checks from the code. Those checks are very important and mandatory in any code!

So, what are the pros and what are the cons of this implementation? The main advantage is that with a high enough runSize we can avoid the LOH altogether. The only part of the object that may grow too large is the items list but for run size of 100 we would be able to represent a string 100 times larger than we could with the regular string class without it being allocated on the LOH. I doubt that in any practical use you will have strings which are over 400Mb big (and you can always increase the runSize even more). The main disadvantage is that we don’t use the regular string class which means compatibility issue with any other existing code and some time and space penalties for all the extra calculations we do and data we store for the functionality of the regular string.

I will start by implementing the Substring method.

  1. private LongString(List<string> items, int runSize)
  2. {
  3.   _runSize = runSize;
  4.   _items = items;
  5.   _length = GetTotalLength();
  6. }
  7.  
  8.  
  9. public string Substring(int startIndex, int length)
  10. {
  11.   Tuple<int, int> innerStratIndex = GetItemIndexForCharIndex(startIndex);
  12.   Tuple<int, int> innerEndIndex = GetItemIndexForCharIndex(startIndex + length - 1);
  13.  
  14.  
  15.   if (innerStratIndex.Item1 == innerEndIndex.Item1) //This is the case when the substring is contained within one item
  16.   {
  17.     return _items[innerStratIndex.Item1].Substring(innerStratIndex.Item2, innerEndIndex.Item2 - innerStratIndex.Item2 + 1);
  18.   }
  19.  
  20.   StringBuilder result = new StringBuilder(); //Can be optimized by calculating the exact length
  21.  
  22.   result.Append(_items[innerStratIndex.Item1].Substring(innerStratIndex.Item2));
  23.  
  24.   for (int i = innerStratIndex.Item1 + 1; i < innerEndIndex.Item1; i++)
  25.   {
  26.     result.Append(_items[i]);
  27.   }
  28.  
  29.   result.Append(_items[innerEndIndex.Item1].Substring(0, innerEndIndex.Item2 + 1));
  30.   return result.ToString();
  31. }
  32.  
  33. public LongString LongSubstring(int startIndex, int length)
  34. {
  35.   Tuple<int, int> innerStratIndex = GetItemIndexForCharIndex(startIndex);
  36.   Tuple<int, int> innerEndIndex = GetItemIndexForCharIndex(startIndex + length - 1);
  37.  
  38.   List<string> result = new List<string>();
  39.   if (innerStratIndex.Item1 == innerEndIndex.Item1) //This is the case when the substring is contained within one item
  40.   {
  41.     result.Add(_items[innerStratIndex.Item1].Substring(innerStratIndex.Item2, innerEndIndex.Item2 - innerStratIndex.Item2 + 1));
  42.   }
  43.   else
  44.   {
  45.     result.Add(_items[innerStratIndex.Item1].Substring(innerStratIndex.Item2));
  46.  
  47.     for (int i = innerStratIndex.Item1 + 1; i < innerEndIndex.Item1; i++)
  48.     {
  49.       result.Add(_items[i]);
  50.     }
  51.  
  52.     result.Add(_items[innerEndIndex.Item1].Substring(0, innerEndIndex.Item2 + 1));
  53.   }
  54.   return new LongString(result, _runSize);
  55. }
  56.  
  57.  
  58. /// <summary>
  59. /// Returns a tuple where the first value is the index of the string that contains startIndex within the _items array
  60. /// and the second value is the offset within that item (string)
  61. /// </summary>
  62. /// <param name="index">The index of the character within the full string</param>
  63. /// <returns></returns>
  64. private Tuple<int, int> GetItemIndexForCharIndex(int index)
  65. {
  66.   int indexRemaining = index;
  67.   int itemIndex = 0;
  68.   while (indexRemaining >= 0)
  69.   {
  70.     if (_items[itemIndex].Length <= indexRemaining)
  71.     {
  72.       indexRemaining -= _items[itemIndex].Length;
  73.       itemIndex++;
  74.       continue;
  75.     }
  76.     return new Tuple<int, int>(itemIndex, indexRemaining);
  77.   }
  78.   return null;
  79. }

The code is quite simple. I have added a special private constructor that allows me to create a LongString from a list of strings. This allows me to speed up the internal cloning process without going through a regular string again. The substring operation is straight forward, cut away the start and the beginning of the target string from the saved string parts and copy all the parts in the middle as they are.

In the next post I will show more implementations of various methods of the LongString class.

Thank you for reading, Boris.

2 comments:

  1. Interesting! Did you have a case when this came useful?

    ReplyDelete
    Replies
    1. Yes. We have a parser which creates AST elements and an automatic code generator. The problem began when the automatic code generator would send very long strings (megabyte long) as parameters to methods and those parameters would be stored to the AST nodes as strings. Then the user would edit the parameters and each time the parser would run and update the AST dumping the old string. But then the problem didn't end because the string contained various language syntactic sugar (for example multiline strings in the editor) that you had to remove when storing to the AST. This process would create even more temporary strings and string builders eventually filling the memory (especially when more than one parser runs at the same time).

      In Part 2 I will show the implementation of other basic editing methods.

      Delete