StringBuilders magic for very large strings

The StringBuilder class is used to create mutable sequences of characters. Strings are immutable, so if you need to perform multiple operations on a string, it is better to use a StringBuilder instead of a string. This is especially useful when you need to concatenate a large number of strings. But there is more magic to it, especially when we go BIG!

StringBuilder

Okay, a very short recap of what the StringBuilder is all about. It is a class that represents a mutable string of characters. string is immutable and if you perform operations like string.Concat, + or +=, a new string is created every time. This is not the case with StringBuilder. It is mutable, and you can append, insert, remove, replace, and concatenate strings without creating a new instance of StringBuilder. The basic idea is to have an internal array (just like List<T>) that can be resized when needed. Okay, but you probably knew that, and if not, there are many resources out there that explain the basics of StringBuilder.

BIG strings

But what if you need to create a very large string? Let's say you need to create a string with 100'000 characters. Sure, that isn't a problem, as you might have guessed, and that is not what this blog post is all about. Let's create a StringBuilder and a List with 100'000 characters. Yes, if you need "append-only" functionality, you can use a List<char> and then create a string from it. That is what the .NET team does internally a few times - for performance reasons.

Before we do that let's create a baseline benchmark with only 10000 characters.

[MemoryDiagnoser]
public class StringBenchmark
{
    private readonly StringBuilder stringBuilder = new();
    private readonly List<string> list = new();
    private const int IterationCount = 1000;
    private const string TenCharacters = "0123456789";

    public static void Main() => BenchmarkRunner.Run<StringBenchmark>();
    
    [Benchmark]
    public string StringBuilder()
    {
        for (var i = 0; i < IterationCount; i++)
        {
            stringBuilder.Append(TenCharacters);
        }

        return stringBuilder.ToString();
    }
    
    [Benchmark]
    public string List()
    {
        for (var i = 0; i < IterationCount; i++)
        {
            list.Add(TenCharacters);
        }

        return string.Concat(list);
    }
}

Before I show you any results, please "ignore" the time benchmark here. That is not so important right now. What is important is the memory consumption. Let's run the benchmark.

| Method        | Mean     | Error     | StdDev    | Gen0   | Gen1   | Gen2   | Allocated |
|-------------- |---------:|----------:|----------:|-------:|-------:|-------:|----------:|
| StringBuilder | 2.613 us | 0.0142 us | 0.0126 us | 6.3477 | 0.3967 |      - |  51.95 KB |
| List          | 6.969 us | 0.0320 us | 0.0283 us | 4.3716 | 0.0229 | 0.0076 |  35.81 KB |

The fascinating thing is that the StringBuilder uses more memory than the List. Yes, I know, you can also give the initial capacity to the StringBuilder and List and it will use less memory. While this is true, you will see in a second why that "doesn't" matter when we go BIG!

Let's go BIG: Let's take "just" 1'000'000 characters. We will use the same benchmark as before, but with 10000 Iterations and the string is 100 characters long.

[MemoryDiagnoser]
public class StringBenchmark
{
    private const int IterationCount = 10000;

    private const string OneHundredCharacters =
        "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789";

    public static void Main() => BenchmarkRunner.Run<StringBenchmark>();
    
    [Benchmark]
    public string StringBuilder()
    {
        StringBuilder stringBuilder = new();
        for (var i = 0; i < IterationCount; i++)
        {
            stringBuilder.Append(OneHundredCharacters);
        }

        return stringBuilder.ToString();
    }
    
    [Benchmark]
    public string List()
    {
        List<string> list = new();
        for (var i = 0; i < IterationCount; i++)
        {
            list.Add(OneHundredCharacters);
        }

        return string.Concat(list);
    }
}

The results:

| Method        | Mean     | Error    | StdDev   | Gen0      | Gen1      | Gen2     | Allocated |
|-------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:|
| StringBuilder | 553.3 us | 17.36 us | 51.20 us |  696.2891 |  696.2891 | 210.9375 |   3.83 MB |
| List          | 652.8 us | 16.35 us | 48.22 us | 1030.2734 | 1018.5547 | 998.0469 |   6.15 MB |

Okay. The StringBuilder uses way less memory than the List. And also way less Gen2! What is going on here? To understand why StringBuilder is optimized for such cases, we have to resort to other means. I will not use the benchmarking but the performance counters of .NET itself. Basically, they provide you a metric of certain things that are going on. For example, in ASP.NET, they can tell you how many requests you are getting per second or how many active requests are currently running. But they can also tell you how much memory is used, and more importantly, how big the Large Object Heap is. Rider has a nice window for that - so let's use that.

Here the memory consumption of the StringBuilder: StringBuilder

Look at the LOH (Large Object Heap). We have exactly 0 bytes there! Keep that in mind. Now let's do the same thing, but for the List:

List

WOW! We have over 6 Megabytes in the LOH! That is a huge difference.

The Large Object Heap

Before we continue, let's have a small excursion to the Large Object Heap. The LOH is a heap in the .NET runtime that is used to store large objects (hence the name). Large objects are objects that are 85'000 bytes or larger. The LOH is not compacted like the normal heap. To understand what compacting is, have a look at: The garbage collector in .NET - Part 2: Compacting. Like in the old Windows time, where the defragmentation tool was king, the compacting has a similar purpose: Move chunks of memory together to free up space. Not only was this important when the virtual space was very small, but it also could help with caching as data was more local.

The exception to this rule is the LOH. The LOH is not compacted. The .NET team imperically found out that compacting stuff over 85'000 bytes is not worth it. Therefore we have this extra "thing".

What does this have to do with StringBuilder and List?

The magic of StringBuilder

A List is just one big chunk of memory as it backed by one array. So if you have stuff that is over 85'000 bytes, it will end up in the LOH. So for example a list of integers (that is 4 bytes) with more than 21'250 elements will end up in the LOH as it has over 85'000 bytes.

But for example a LinkedList will not. Why? Well, because it isn't backed by one big array but by many small nodes. So the LOH is not used. And the StringBuilder is similar.

The StringBuilder is not one array internally. If you ever looked into the StringBuilder you find the following:

public sealed class StringBuilder : ISerializable
{
    internal char[] m_ChunkChars;
    internal StringBuilder m_ChunkPrevious;

The StringBuilder is basically a linked list of other StringBuilder instances. Every chunk can get up-to 8'000 characters. If you append more, a new chunk is created and the link between those two instances is created! Very very clever! Technically that also brings some downsides - not only is the code more complex, but certain operations are not O(1). Imagine the indexer: myStringBuilder[10_000]. This is not O(1), as you have to traverse the linked list to get to the right chunk. However, for memory consumption, it is a very clever solution.

Why does the List use so much memory?

In the first example with "few" characters, we saw that List used less memory than StringBuilder, which exploded when we went BIG. The reason is the differences in grow strategies. How does a List grow:

int newCapacity = _items.Length == 0 ? DefaultCapacity : 2 * _items.Length;

Source: List.cs

At least until the current preview (.NET 9 preview 4), but also likely that it stays like that, a List has an exponential growth. So imagine the worst case here: Imagine you have 100'000 characters in your List and your List is full to the brim. You add one more character. The List will create a new array with 200'000 characters. The old 100'000 (after the content is copied into the new array) will be thrown away!

The StringBuider is different. It basically has two mechanism. The first is the m_ChunkChars array. But that is capped at 8'000 characters. If you append more, a new StringBuilder with a new m_ChunkChars array is created. SO that is the second mechanism. But there is no need to throw away stuff by default (don't get me wrong there are scenarios where this happens, but for this discussion it is not relevant).

10
An error has occurred. This application may no longer respond until reloaded. Reload x