Intro

Hello :) And welcome back. We got a big update today. As you might have noticed, this is the third post, which was supposed to be the last. Well I extended the last post to be the fourth one instead.

Why?

Well, because there was a memory leak in the application, so I couldn’t run it for more than 50 million IP’s before it would crash. Yikes.

Doing Everything Else Than Fixing The Memory Leak

I’m actually really proud of this. When I first found out that there was a memory leak, I initially thought it was because of the Ping class (it wasn’t btw). After reading through my code for what felt like a week, I decided to just slim down the class. The reasoning being the sheer amount of pings/second (16000/seconds).

It’s 256 threads running, each pinging with a 16ms timeout. That’s quite a lot of presure on the GC. And the server is only running on 2 virtual threads… So it’s not the best environment for the GC to work in.

A quick git clone later, I had the dotnet runtime downloaded. I copy/pasta all of the necessary classes into an empty BenchmarkDotNet project, and wrote my code as I normaly would.

Default:

[Benchmark]
[Arguments(1)]
public IPStatus? Default(int itt)
{
    IPStatus status = new();

    Ping ping = new();

    for (int i = 0; i < itt; i++)
    {
        status = ping.Send("127.0.0.1").Status;
    }

    return status;
}

DefaultWithExtra:

[Benchmark]
[Arguments(1)]
public IPStatus? DefaultWithExtra(int itt)
{
    IPStatus status = new();

    Ping ping = new();

    byte[] buf = [];

    for (int i = 0; i < itt; i++)
    {
        status = ping.Send(IPAddress.Parse("127.0.0.1"), 0, buf, null).Status;
    }

    return status;
}

Custom:

[Benchmark]
[Arguments(1)]
public IPStatus CustomPingTest(int itt)
{
    IPStatus status = new();

    for (int i = 0; i < itt; i++)
    {
        status = CustomPing.SendIcmpEchoRequestOverRawSocket(Parse("127.0.0.1".AsSpan()), 0);
    }

    return status;
}

The very first thing I did when cutting off the crust, was to call the lowest abstracted method possible. If we just run ping.Send("127.0.0."), then there are 4 chained methods, before we reach SendPingCore. But with ping.Send(IPAddress.Parse("127.0.0.1"), 0, buf, null), we just call the method calling SendPingCore.

The first parameter is an IPAddress object, the second is the timeout (We want as fast response as possible). The third is an empty buffer, so we make sure we send as little data as possible. The fourth is the options. We give it null for default options.

The results:

| Method           | itt | Mean     | Error    | StdDev   | Median   | Rank | Gen0   | Allocated |
|----------------- |---- |---------:|---------:|---------:|---------:|-----:|-------:|----------:|
| Default          | 1   | 28.29 us | 0.039 us | 0.196 us | 28.24 us |    3 | 0.1831 |    1592 B |
| DefaultWithExtra | 1   | 25.01 us | 0.021 us | 0.106 us | 25.00 us |    2 | 0.1221 |    1168 B |
| CustomPingTest   | 1   | 24.16 us | 0.017 us | 0.088 us | 24.14 us |    1 | 0.0916 |     808 B |

Pretty impressive, huh? I’m surprised at how close my stripped down version is to the lowest abstracted default. But what happenes if we run the loop a thousand times?

| Method           | itt  | Mean     | Error    | StdDev   | Median   | Rank | Gen0     | Allocated  |
|----------------- |----- |---------:|---------:|---------:|---------:|-----:|---------:|-----------:|
| Default          | 1000 | 28.01 ms | 0.027 ms | 0.135 ms | 27.98 ms |    3 | 156.2500 | 1336.18 KB |
| DefaultWithExtra | 1000 | 25.16 ms | 0.065 ms | 0.332 ms | 25.17 ms |    2 |  93.7500 |  976.75 KB |
| CustomPingTest   | 1000 | 24.35 ms | 0.032 ms | 0.167 ms | 24.31 ms |    1 |  93.7500 |  789.08 KB |

That’s cool an all, but it’s not very close to real-world usecases, if we need to scan the whole internet. But I also don’t want to wait a year for it to be done. So let’s test with a million itterations in the ping loop. (In the two previous tests, I had set it to LongRunJob, but with this, I’ll set it to ShortRunJob, since I’m mostly only interested in the memory usage)

| Method           | itt     | Mean    | Error   | StdDev  | Rank | Gen0        | Gen1      | Allocated  |
|----------------- |-------- |--------:|--------:|--------:|-----:|------------:|----------:|-----------:|
| Default          | 1000000 | 28.02 s | 0.273 s | 0.015 s |    1 | 163000.0000 | 2000.0000 | 1304.63 MB |
| DefaultWithExtra | 1000000 | 24.86 s | 0.227 s | 0.012 s |    1 | 119000.0000 | 2000.0000 |  953.68 MB |
| CustomPingTest   | 1000000 | 24.20 s | 0.267 s | 0.015 s |    1 |  96000.0000 | 2000.0000 |  770.57 MB |

But yeah, I removed all of the other methods calling SendPingCore. The speed increase was well within margin of error, so we continue.

The first “real” speed increase I stumpled upon was testing wether if SendIcmpEchoRequestOverRawSocket was getting run or not in SendPingCore:

private PingReply SendPingCore(
      IPAddress address,
      byte[] buffer,
      int timeout,
      PingOptions options)
    {
      return !RawSocketPermissions.CanUseRawSockets(address.AddressFamily) ? this.SendWithPingUtility(address, buffer, timeout, options) : Ping.SendIcmpEchoRequestOverRawSocket(address, buffer, timeout, options);
    }

If you run the application with sudo (we’re using Linux btw), then it’s SendIcmpEchoRequestOverRawSocket getting executed. If we’re not running as sudo, then it’s SendWithPingUtility. And it just so happened that I run the application as sudo on my server.

So we cut out that if statement, since we don’t need it at all. I did the same with all of the methods that was called in the program flow. Checking if the methods are getting run, and if I can remove any if/else statements.

As it turns out, there is a lot of if/else statements that could be removed, since they would never apply to my usecase. I even cut out the whole IPv6 handling code, as it was completely irrelevant to my usecase.

Another thing I did was to remove any object that was either not used, only assigned, or just inline the values, if they never changed.

I also modified the IPAddress.Parse method, that you see in the CustomPingTest method.

So far, so good. I like this result, even though it’s not as groundbreaking as it could be. The only thing holding me back from optimizing it further, is the internal static unsafe Interop.Error Send method getting called from the SocketPal class.

It uses [LibraryImport("libSystem.Native", EntryPoint = "SystemNative_Send")] and __PInvoke. It was spitting errors out the ass when I tried to implement it, and no matter what I did, I could not make it work. I was so close to be able to make the Ping class into a struct.

Oh well. On to the next topic.

Finding The Memory Leak

So, I did some doom-scrolling in the Microsoft.Data.Sqlite library code, and found this line:

SqliteCommand --> SqliteConnection --> SqliteConnectionFactory

protected SqliteConnectionFactory()
{
    AppDomain.CurrentDomain.DomainUnload += (_, _) => ClearPools();
    AppDomain.CurrentDomain.ProcessExit += (_, _) => ClearPools();

    ...
}

Ladies and Gentlemen, we got ’em.

What do you mean, “we got ’em?” I hear you say. Well, here is the thing with C# and dotnet. There are very few ways of creating memory leaks without using an unsafe context. And events are one of them, if not the most common memory leak.

The astute reader might have already noticed why it’s a memory leak. We never unsubscribe from the events. What we’re looking at is an event delegate, that whenever we subscribe to the event, the event handlers, DomainUnload and ProcessExit in this case, will hold a reference to the subscriber.

And if the subscriber is garbage collected before the event has been unsubscribed, the event handler will still keep the reference to the subscriper.

Fixing The Memory Leak

I have two fixes for the problem. One easy. And one also easy, but more steps are required.

The first fix, and the one I ended up with, was to just create the SqliteCommand once, and then just re-use it. Easy and simple. It’s the SqliteCommand that implements SqliteConnection, than then implements the memory leak. So if we just do something like this:

...
{
    SqliteConnection connection = new(ConnectionString);
    connection.Open();

    SqliteCommand command = new(InsertInto, connection);

    command.Parameters.AddWithValue("@ip", ip.Ip);
    command.Parameters.AddWithValue("@responseCode", ip.ResponseCode);
    _ = command.ExecuteNonQuery();

    command.Parameters.Clear();
}
....

That was an example from a benchmark, we will look into in a bit.

The other fix? Well, that would be to implement the IDisposable pattern, and unsubscribe from the event handlers when the SqliteCommand object had been collected. That code could look something like this:

public void Dispose()
{
    AppDomain.CurrentDomain.DomainUnload -= (_, _) => ClearPools();
    AppDomain.CurrentDomain.ProcessExit -= (_, _) => ClearPools();
    ClearPools();
}

Here, we unsubscribe from the event handler in the Dispose() method, making sure that there is no dangling reference anywhere. Although this might be the best fix, in my opinion, I would still have to compile the Sqlite library, and then figure out how to import it without the NuGet package manager. It’s a little too much for this project, so I’ll leave it as an exercise for the reader.

Now onto the implementation. Before actually implementing the fix, I create a BenchmarkDotNet project to test if the fix is worth anything, and come up with potential upgrades/optimizations.

Here is my initial benchmarks:

private const int Itterations = 10_000;

[Benchmark]
[Arguments(Itterations)]
public void Default(int count)
{
    Ip ip = new()
    {
        Ip1 = 255,
        Ip2 = 255,
        Ip3 = 255,
        Ip4 = 255,
        ResponseCode = 1001
    };

    for (int i = 0; i < count; i++)
    {
        using SqliteConnection connection = new(ConnectionString);
        connection.Open();

        using SqliteCommand command = new(InsertIntoDiscarded, connection);

        command.Parameters.AddWithValue("@ip1", ip.Ip1);
        command.Parameters.AddWithValue("@ip2", ip.Ip2);
        command.Parameters.AddWithValue("@ip3", ip.Ip3);
        command.Parameters.AddWithValue("@ip4", ip.Ip4);
        command.Parameters.AddWithValue("@responseCode", ip.ResponseCode);
        _ = command.ExecuteNonQuery();

        connection.Close();
    }
}

[Benchmark]
[Arguments(Itterations)]
public void Fixed(int count)
{
    SqliteConnection connection = new(ConnectionString);
    connection.Open();

    SqliteCommand command = new(InsertIntoDiscarded, connection);

    Ip ip = new()
    {
        Ip1 = 255,
        Ip2 = 255,
        Ip3 = 255,
        Ip4 = 255,
        ResponseCode = 1001
    };

    for (int i = 0; i < count; i++)
    {
        command.Parameters.AddWithValue("@ip1", ip.Ip1);
        command.Parameters.AddWithValue("@ip2", ip.Ip2);
        command.Parameters.AddWithValue("@ip3", ip.Ip3);
        command.Parameters.AddWithValue("@ip4", ip.Ip4);
        command.Parameters.AddWithValue("@responseCode", ip.ResponseCode);
        _ = command.ExecuteNonQuery();

        command.Parameters.Clear();
    }
}

The only difference between these two methods, is that the connection and command are getting instantiated with each loop. And embarrassingly, I did this in my scanner. I wrote code like this, and that’s the reason for the memory leak.

So how bad is the memory used?

| Method  | count  | Mean    | Error    | StdDev   | Rank | Gen0       | Allocated |
|-------- |------- |--------:|---------:|---------:|-----:|-----------:|----------:|
| Default | 100000 | 6.012 s | 0.0531 s | 0.0762 s |    2 | 40000.0000 | 320.44 MB |
| Fixed   | 100000 | 5.634 s | 0.0466 s | 0.0698 s |    1 | 23000.0000 | 189.21 MB |

The default version is using 1.6x more memory than the fixed version. Plus we get a nice speed boost also. And this is just for 100.000 rows inserted. let’s try 1.000.000 rows.

| Method  | count   | Mean    | Error | Rank | Gen0        | Gen1      | Allocated |
|-------- |-------- |--------:|------:|-----:|------------:|----------:|----------:|
| Default | 1000000 | 58.86 s |    NA |    2 | 401000.0000 | 5000.0000 |   3.13 GB |
| Fixed   | 1000000 | 53.98 s |    NA |    1 | 237000.0000 | 5000.0000 |   1.85 GB |

The speed boost isn’t actually all that great, but it is there. Although this is only one million rows, the actual number we could potentially store, given that we have to scan around 4.4 billion IP’s, is around 4.4 billion rows. That’s the absolute maximum that the project could potentially have to store.

So every little boost we can get is awesome. Another, in my opinion, huge speed boost we could achieve, is to compress the 4 columns of each byte of the Ip, into one column. We could just bitshift them together. And we could keep the response code in it’s own column, so we can keep the IP as a uint, and not a ulong, since we would need 32 bits more, to store the response code in the combined number.

You know what, I’ll do you one better. We could even combine multiple rows into one, and then compress it with a DeflateStream set to CompressionLevel.SmallestSize. Yeah, this could definitely work.

Here is the set up for the benchmarks:

[MemoryDiagnoser]
[RankColumn]
[ShortRunJob]
public class Test
{
    private const string DefaultConnection = "Data Source=Test1.db";
    private const string BetterConnection = "Data Source=Test2.db";
    private const string BestConnection = "Data Source=Test3.db";

    private const string DefaultStatement = "PRAGMA synchronous = OFF; PRAGMA temp_store = MEMORY;" +
                                               " PRAGMA journal_mode = MEMORY; PRAGMA foreign_keys = off;" +
                                               " INSERT INTO Discarded (Ip1, Ip2, Ip3, Ip4, ResponseCode)" +
                                               " VALUES (@ip1, @ip2, @ip3, @ip4, @responseCode)";

    private const string BetterStatement = "PRAGMA synchronous = OFF; PRAGMA temp_store = MEMORY;" +
                                         " PRAGMA journal_mode = MEMORY; PRAGMA foreign_keys = off;" +
                                         " INSERT INTO Discarded (PackedIp, ResponseCode)" +
                                         " VALUES (@packedIp, @responseCode)";

    private const string BestStatement = "PRAGMA synchronous = OFF; PRAGMA temp_store = MEMORY;" +
                                               " PRAGMA journal_mode = MEMORY; PRAGMA foreign_keys = off;" +
                                               " INSERT INTO Discarded (PackedData)" +
                                               " VALUES (@packedData)";

    private const int Itterations = 1_000;
    ...

The new default method:

...
[Benchmark]
[Arguments(Itterations)]
public void Default(int count)
{
    SqliteConnection connection = new(DefaultConnection);
    connection.Open();

    SqliteCommand command = new(DefaultStatement, connection);

    Ip ip = new()
    {
        Ip1 = 255,
        Ip2 = 255,
        Ip3 = 255,
        Ip4 = 255,
        ResponseCode = 1001
    };

    for (int i = 0; i < count; i++)
    {
        command.Parameters.AddWithValue("@ip1", ip.Ip1);
        command.Parameters.AddWithValue("@ip2", ip.Ip2);
        command.Parameters.AddWithValue("@ip3", ip.Ip3);
        command.Parameters.AddWithValue("@ip4", ip.Ip4);
        command.Parameters.AddWithValue("@responseCode", ip.ResponseCode);
        _ = command.ExecuteNonQuery();

        command.Parameters.Clear();
    }
}
...

The better method, where we only bitshift the IP’s into one number, and keep the response code in it’s own column:

...
[Benchmark]
[Arguments(Itterations)]
public void Better(int count)
{
    SqliteConnection connection = new(BetterConnection);
    connection.Open();

    SqliteCommand command = new(BetterStatement, connection);

    Ip ip = new()
    {
        Ip1 = 255,
        Ip2 = 255,
        Ip3 = 255,
        Ip4 = 255,
        ResponseCode = 1001
    };

    for (int k = 0; k < count; k++)
    {
        command.Parameters.AddWithValue("@packedIp", ip.PackIp());
        command.Parameters.AddWithValue("@responseCode", ip.ResponseCode);
        _ = command.ExecuteNonQuery();

        command.Parameters.Clear();
    }
}
...

The best method, where we bitshift, append the response code at the end of the IP, collect 2048 IP’s into one string, compress it, and insert it into one column.

Ez.

...
[Benchmark]
[Arguments(Itterations)]
public void Best(int count)
{
    SqliteConnection connection = new(BestConnection);
    connection.Open();

    SqliteCommand command = new(BestStatement, connection);

    Ip ip = new()
    {
        Ip1 = 255,
        Ip2 = 255,
        Ip3 = 255,
        Ip4 = 255,
        ResponseCode = 1001
    };

    int j = 0;
    int i = 0;

    DefaultInterpolatedStringHandler interpolatedStringHandler = new(2, 2);

    while (true)
    {
        if (i >= count)
        {
            command.Parameters.AddWithValue("@packedData", CompressionUtil.CompressG(interpolatedStringHandler.ToStringAndClear()));
            _ = command.ExecuteNonQuery();

            command.Parameters.Clear();

            break;
        }

        if (j == 2048)
        {
            command.Parameters.AddWithValue("@packedData", CompressionUtil.CompressG(interpolatedStringHandler.ToStringAndClear()));
            _ = command.ExecuteNonQuery();

            command.Parameters.Clear();

            j = 0;
        }

        interpolatedStringHandler.AppendFormatted(ip.PackIp());
        interpolatedStringHandler.AppendLiteral(":");
        interpolatedStringHandler.AppendFormatted(ip.ResponseCode);
        interpolatedStringHandler.AppendLiteral(",");

        j++;
        i++;
    }
}

Are you ready for the result? Let me just warn you first, it’s quite surprising. As in, it’s pretty wild. Really really wild.

| Method  | count  | Mean         | Error      | StdDev     | Rank | Gen0       | Gen1    | Allocated |
|-------- |------- |-------------:|-----------:|-----------:|-----:|-----------:|--------:|----------:|
| Default | 100000 | 5,527.820 ms | 31.4127 ms | 45.0511 ms |    3 | 23000.0000 |       - | 189.21 MB |
| Better  | 100000 | 5,231.529 ms | 38.6897 ms | 57.9089 ms |    2 | 13000.0000 |       - | 103.76 MB |
| Best    | 100000 |     7.821 ms |  0.0712 ms |  0.0975 ms |    1 |   578.1250 | 78.1250 |   4.66 MB |

Yeah, this is getting ridiculous.

What about a million rows? Do note that I stopped including the other methods here, as the numbers would be too large to really comprehend in the comparison

| Method | count   | Mean     | Error    | StdDev   | Rank | Gen0      | Gen1      | Allocated |
|------- |-------- |---------:|---------:|---------:|-----:|----------:|----------:|----------:|
| Best   | 1000000 | 75.75 ms | 0.119 ms | 0.603 ms |    1 | 5714.2857 | 1285.7143 |  46.62 MB |

Do we dare test for one billion?

| Method | count      | Mean    | Error | Rank | Gen0         | Gen1         | Allocated |
|------- |----------- |--------:|------:|-----:|-------------:|-------------:|----------:|
| Best   | 1000000000 | 1.326 m |    NA |    1 | 5812000.0000 | 1452000.0000 |  45.53 GB |

This is just silly. But what about size diferences of the databases?

Well:

Default: 2415MB
Better: 1909MB
Best: 9MB

Wait wait wait wait. Hold the phone. Only 9 MegaBytes? Yep. Let me create a bar chart for you to illustrate just how big of a difference that is.

Image of bar chart comparing database sizes

Imagine my surprise when I first saw the results. I simply couldn’t believe it.

TLDR

The TLDR of this, is that bitshifting the 4 bytes of the IP’s into one number –> appending that IP and response code into a string (with DefaultInterpolatedStringHandler) –> collecting 2048 strings –> compressing the string –> inserting that into the database, is 706x faster, than just inserting each byte of IP and response code into each column and row, in the database.

Pretty wild. We’re essentially doing more, by doing more.

Misc

These are the methods I couldn’t find a good place to include in the blog post.

This is the Ip struct, and public uint PackIp() is the method that bitshifts the ip into one number.

public struct Ip
{
    public int Ip1 { get; set; }

    public int Ip2 { get; set; }

    public int Ip3 { get; set; }

    public int Ip4 { get; set; }

    public int ResponseCode { get; set; }

    public override string ToString()
    {
        return $"{Ip1}.{Ip2}.{Ip3}.{Ip4}";
    }

    public uint PackIp()
    {
        return (uint)(Ip1 << 24) | (uint)(Ip2 << 16) | (uint)(Ip3 << 8) | (uint)Ip4;
    }
}

This is the compression method, as well as decompression:

public static class CompressionUtil
{
    public static string Compress(string text)
    {
        // Convert the string to a byte array
        byte[] byteArray = Encoding.UTF8.GetBytes(text);

        // Use a MemoryStream to hold the compressed data
        using MemoryStream memoryStream = new();

        // Create a DeflateStream for compression
        using (DeflateStream compressionStream = new(memoryStream, CompressionLevel.SmallestSize))
        {
            // Write the byte array to the compression stream
            compressionStream.Write(byteArray, 0, byteArray.Length);
        }

        // Convert the compressed byte array from the MemoryStream to a string (Base64 encoding is common for this)
        return Convert.ToBase64String(memoryStream.ToArray());
    }

    public static string Decompress(string compressedText)
    {
        // Convert the Base64 encoded string back to a byte array
        byte[] byteArray = Convert.FromBase64String(compressedText);

        // Use a MemoryStream to hold the decompressed data
        using MemoryStream memoryStream = new(byteArray);

        // Create a DeflateStream for decompression
        using DeflateStream decompressionStream = new(memoryStream, CompressionMode.Decompress);

        // Read the decompressed byte array from the decompression stream
        using StreamReader reader = new(decompressionStream);

        return reader.ReadToEnd();
    }
}