In Part 1 of the LongString post I have shown how we can tackle the problem of running out of memory when manipulating long string using a class similar to the LongString. In Part 2 I will discuss a problem which arises from my implementation of LongString class when implementing the IEquatable<LongString> interface. As you may recall, the LongString is an aggregation of regular strings which are less or equal than a given value. The direct result of this definition is that two equal long strings may have a completely different representation (even if the given value is the same). I will discuss editing operations in Part 3 of this series but you may consider the case where for the given value of 3 the long string (“AB”,”CD”) is equal to the long string (“ABC”,”D”) but they have a different internal representation.
The first thing I did was to check how the framework string implements the equality method. You can see the code below but essentially since the strings are represented as a continues memory block the code simply goes over all the bytes and bitwise compares them (with a small optimization which I will not discuss).
I wanted to implement something similar and therefore my first implementation was like this:
But then I discovered that I didn’t implement the indexer yet so I simply use the method from Part 1 which returns the internal indices (within the array of strings and in a string) GetItemIndexForCharIndex which resulted in this simple code:
The main problem here is that the implementation of GetItemIndexForCharIndex calculates the indices by going over the entire array of strings. This is OK for the indexer but it is not very good when used in a loop therefore it is time to implement the enumerator. A simple enumerator which “remembers” the current position and moves forward should be enough for this case.
Now I can fix the Equal code to use the enumerator instead of the indexer.
One question remains… How bad is this solution? I have run some performance tests and it turns out that this equals method is about 30 to 50 times slower than the equals method of a regular string. In absolute terms, this equals method takes up to 20ms when comparing two 1,000,000 char long strings with only one different character (in a random position). The original method is so fast that it cannot be measured in ms .
The main problem with the long string is going to be performance (in terms of time) compared to the regular string. We see that if an algorithm required mainly comparisons then the LongString is not a good choice. Still, remember that the LongString was designed for a very specific scenario in which it should perform better. In the next part I will implement some editing methods.
A small note about concurrency. Just out of curiosity I have implemented the slow method using the TPL:
This was indeed faster in some cases over the very slow implementation but I cannot compete with the concurrency done in the CPU (which will be used to run the original Equals method of string). This implementation may give you some ideas on how to improve the Equals implementation of LongString (e.g. run a forward and a backward enumerators in separate tasks).
Thank you for reading,