Homemade Search Engine Part (1/3)

Disclaimer: This post is going to be long.

Why?

I have a small obsesion with collection of different types of data, statistics and everything in between. And since my last project, I wanted to challenge myself even further. And this time, I wanted to build something more usefull than a terminal video player.

So with this, I decided to create a search engine. It’s not that big of a project, I told myself at the beginning. It’s just going to be a two week project, I thought. No. Absolutely not. This was so much more than I had anticipated.

And there is a reason for that: My requirements.

Do note that it’s not that straight forward to confirm or verify that, for example, “The runtime should be as lean as possible.” So the checkmarks are just an rough educated guess/confirmation

The application requirements entail:

The backend should be as multithreaded as realistically possible.
The application should use as little memory as possible.
The application should run for as long as absolutely possible with as few problems as possible.
The runtime should be as lean as possible.
The application should run as fast as possible (except certain code paths. Later on that).
The application should use as little CPU as realistically possible.
I want to use as few external libraries as possible.
The application should be implemented as the classic three layer application. View-, logic- and data-layer.
Actual search function.
As many parts of the project should run on .NET. That includes:
- Frontend should be made in Blazor.
- Backend should be a console application.

The data and statistics requirements:

Save the Ip of the server.
Save if port 80 or 443 is open.
Save response code from pinging the server.
The response code for Ips that doesn’t reply with Success.
[1/2] Title and description on the site running on the server, for both port 80 and 443.
Url for the website running on the server. For both port 80 and 443.
What type of server the website is running on. e.g NginX, Apache…
Check for robots.txt.
Http version.
Certificate issuer country (if the server is using SSL).
Certificate organization name (who issued the certificate).
Ipv6 (if the server also has a Ipv6 address assigned).
Tls version.
Cipher suite.
Key exchange algorithm.
Public key type (some servers uses multible types. I collect up to 3 types).
Accept Encoding (what kind of encoding the connection accepts. As in negotiate,accept-language,accept-encoding).
Compression algorithm.
ALPN (Application-Layer Protocol Negotiation).
Collect an assortment of “tags” per site.

The Challenges.

Scanning.

In the beginning, the biggest challenge was how to split the scanner into multible scanners, that would run in seperate threads. Or really, just how to go about it. My first idea would be to make a list of Ips I could scan, and then just pass them to the scanner. Maybe add them to a ConcurrentQueue<QueueItem> and consume them from each thread. But that goes against point 2. in the application requirements.

The best idea I got was to have four nested for loops in each thread when scanning, because an Ip is just 4 bytes. Each loop goes to 256. So all we need is a method like Scan(i, j, k, l), where i, j, k and l is a loop and each is corresponding to one byte in the Ip. and we could segment all of the Ips into ranges, defined by the first byte. So each Ip range is X.256.256.256, where X is split up into multiple ranges. That means we can scan one range per thread. And for simplicity, we can create multiples of two threads. For example: We can create 2, 4, 8, 16… threads.

Pretty smart, huh?

Another challenge is how to balance the accuracy of the scanner. How I scan is a multi-part process. First I send a ping to the Ip with a timeout of 128ms. Then if the response is Success, we sent a TCP open request to port 80 and 443. If either of those two ports are open, we create an Unfiltered object and add that to our ConcurrentQueue<QueueItem>. This will be detailed later on. If either ping or TCP doesn’t respond, we create a Discarded object and add that to our ConcurrentQueue<QueueItem>.

This is the exact reason why I made the scanner multithreaded, so I can run multiple “long” running tasks in parallel. Why not use async instead of creating threads? Well dear reader, the reason is that a seperate thread is more suited for longer running tasks. While an async task is great for not blocking the main thread and being able to wait for the result, without callbacks, it’s simply not what the project needs. And since I’m using a producer/consumer pattern, having long running threads seemed more appropriate than having to create a task and wait for it.

Yes, you can create a long running Task, but that’s not really a viable solution.

And from university, I’ve also learned that a task is a job to be done and returned, while a thread is a worker. Tasks also run in a threadpool, where .NET makes multiple “async tasks” into a statemachine running on a seperate thread. While a seperate thread runs in it own thread, and is free of other threads state.

Another big problem is the amount of Ips to scan. If I have to scan every single Ip in the Ipv4 net, and each ping takes up to 128ms, it would take way more than 10 years. And that includes running over 8 threads in parallel. So I decided to cut out some unnescesary Ip ranges. A quick DuckDuckGo search later, I found a neat list of reserved Ip ranges. And thought to myself: “Do I really need to scan the multicast Ip range for websites? Do I really need to scan my loopback addresses?”

And a few if statements later, we got a rather primitive, but working, Ip filter.

for (int i = scanSettings.Start; i < scanSettings.End; i++)
{
    if (i is 0 or 10 or 127 or 256) continue;
    if (i is >= 224 and <= 239) continue;

    for (int j = 1; j < 256; j++)
    {
        if (i == 169 && j == 254) continue;
        if (i == 192 && j == 168) continue;
        if (i == 198 && j == 18 || j == 19) continue;
        if (i == 172 && j is >= 16 and <= 31) continue;

        for (int k = 1; k < 256; k++)
        {
            if (i == 192 && k == 2) continue;
            if (i == 192 && j == 88 && k == 99) continue;

            for (int l = 1; l < 256; l++)
            {
...

Why do I exclude iteration 0 in all loops? Because that gave me problems. As in, the pinging implementation of C# didn’t want to scan it, and threw exceptions. List of reserved Ip ranges can be found here

It’s not a lot of Ips that gets filtered out, but it is still a good amount, considering the timeout for sending a ping (128ms). And creating a TCP connection and having it fail can take up to 300ms.

Database

But that’s not all. There were a few challenges more than this. How I should go about handling the database. That was more than a big challenge really. My initial implementation used EF Core with Postgres as database. And while that worked really well, and it was really fast. Around 2-3ms per rows inserted. But the fact it was eating memory like no tomorow was more than concerning. At one point, it ended up at 203Mb of LOH and Heap Generation 2 was growing at an alarming rate.

After researching a bit, I came to the understanding that you really shouldn’t be re-using the EFCore object. Always create a new EFCore object whenever you need to insert. Yes, even with the using keyword, it still eats memory. The problem now is that it’s slow, with around 7-10ms per insert, but with way less memory usage, still higher than desirable. This implementation broke point 2. and 5. in the application requirements.

So what now?

Enter the good ‘ol reliable: SQLite.

Yep. SQLite. Ok, so first off, it’s really minimal in terms of memory usage and really efficient. Second off, I’m quite used to SQL syntax variant that SQLite uses and feel comfortable creating optimized queries in SQLite. So the idea here is that since SQLite isn’t completly “thread safe”, I have to do that myself. Which is completely fine, because that means I finally have a project where I can use a producer/consumer pattern to it’s fullest.

So, what to do? Simple, really. I create a seperate thread that has a single while(isRunning) loop, that loops through a ConcurrentQueue<QueueItem> of QueueItem objects I’ve created. And this is also where it get’s a lil’ funny, maybe a lil’ silly even. This is a limitation I’ve had a hard time working around. Since I want only a single entrypoint to the database, I’m limited in how I can handle the database.

My idea: The QueueItem object looks like this:

public class QueueItem
{
    public Unfiltered? Unfiltered { get; set; }
    public Runtime? Runtime { get; set; }
    public Filtered? Filtered { get; set; }
    public Discarded? Discarded { get; set; }
    public Operations Operations { get; set; }
}

Notice that all but Operations are null. There is a good reason for that. Operations should always be set, and then depending on what kind of object we want to use with the database, we set that. So let’s say I want to insert a Discarded object, I create that object, and I set Operations to be the enum Insert. Then just add that to the queue and wait for the database handler to get to that item.

Operations looks like this:

public enum Operations
{
    Insert,
    Update,
    Stop,
    Optimize,
}

Or maybe I want to update a row in the runtime database. I just read the row consisting of a runtime object, change what I need in the object, then set Operations to be Update, and add the object to the queue and let the application handle the rest.

The while(isRunning) loops through the ConcurrentQueue<QueueItem>, checks what operation we want to do, and what kind of object we’re working with. Then just executes the method that handles that object and operation. Like private static void InsertRuntime(Runtime runtime) or private static void UpdateUnfiltered(Unfiltered unfiltered).

Due to this being a queue, I can’t retrieve any data through that queue back to the calling method. So this is where I break my perfect implementation. To retrieve data from the databases, I just have a public method I can access from everywhere that has a reference to the database handler. I don’t like it, but that was really the only way I could think off. But hey, at least I use a lock object. Just to be on the safe side.

And now, with my implimentation, it takes around 0-1ms per row inserted. Mostly less than 1ms. Pretty good should I say so myself. And just to make sure I adhere to point 2. even more, I limit the amount of items the queue can hold to 1k items. Although, yes, I do that per thread, so it’s not super acurate and it often ends up at 2k from time to time, it’s within safety limits. If the count reaches 1k, each thread that detects the amount will wait for 500ms, so the queue can get emptied faster.

On the database handler side, if the queue is empty, we wait 10ms, so that we don’t loop unnecessarily. But with how fast we can scan, that threshold will never ever be reached. Not even with 32 threads scanning simultaneously. This little endeavor satisfies point 1., 2., 5., and 6. of our application requirements.

Indexing And Filtereing.

Oh boy, where to start with this.

This turned out to be harder than I had imagined really. And I’m not even done yet. I only collect urls and titles for the websites. That’s all I have managed to implement yet. And it’s not all that clean if I’m to be honest.

The way I get the url from an Ip is to run an external bash script (yikes) where I run the curl command. The curl command saves from standard output to ContentPort80.txt and ContentPort443.txt file. And also saves from standard error to HeaderPort80.txt and HeaderPort443.txt file. The url can often be read from the certificate you retrieve when using curl. And that’s what I’m saving to the hearder file. And when using the verbose flag with curl, I also the the index page. That’s where I get the title from.

The next update and fix I’m going to do with this, is to only use this way on the url, then maybe use HttpClient to get the HTML elements. But that’s for the next blog post.

The Implementations

Ip Scanning And Pre-Filtering

The Ip scanner is quite simple, really. First, we send a ping to an Ip, wait for max 128ms and handle the response. If the result is not Success, we add that Ip to the discarded database, and continue to the next iteration of the loop with a new Ip. We set the default responseCode to IPStatus.Unknown, since sometimes, if the server on the specific Ip has set up their firewall in a certain way, the pinger throws an exception.

If the result is Success, then we try to create a TCP connection to that Ip at port 80 and 443. Trying to create a connection to both ports can take up to 300ms total. Do notice that I return the result as a ValueTuple<int, int>, since I just return the ports that are open. if they’re not open, I return 0 instead of the port. But thinking back on it now, I should return a ValueTuple<bool, bool> instead.

Notice the empty catch statement. We don’t use it for anything, since logging down the error is not really useful in this scenario, and it would just flood the console.

string ip = $"{i}.{j}.{k}.{l}";

using Ping pinger = new();
IPStatus responseCode = IPStatus.Unknown;

try
{
    // Sometimes, if the pinger gets a Destination Unreachable Communication administratively prohibited response, the pinger will throw an exception.
    // https://en.wikipedia.org/wiki/Internet_Control_Message_Protocol?useskin=vector#Control_messages
    responseCode = pinger.Send(ip, 128).Status;
}
catch
{/**/}

if (responseCode != IPStatus.Success)
{
    QueueItem superDiscardedObject = CreateDiscardedSuperObject(ip, (int)responseCode);
    _queue.Enqueue(superDiscardedObject);
    continue;
}

(int, int) ports = TcpClientHelper.CheckPort(ip, 80, 443);

The TcpClientHelper.CheckPort() method is quite simple really. But I don’t like it. And there is a reason for that. It relies on try/catch statements in the control flow, because if the connection fails, it will throw an exception. It would be nice if it would return a result. like maybe a Result.Error instead of throwing, but that’s just how it is.

Looking back, I should propably just have a params int[] ports parameter, instead of seperate ports. That way, I could just iterate over the array instead of having duplicate code. Maybe for the next update.

public static (int, int) CheckPort(string ip, int port1, int port2)
{
    int number1;
    int number2;

    // This would be way cleaner if the TcpClient didn't throw an exception if the destination couldn't be reached,
    // and it would just return a Result.Error, for example.
    try
    {
        using TcpClient client = new();
        client.Connect(ip, port1);
        number1 = port1;
    }
    catch
    {
        number1 = 0;
    }

    try
    {
        using TcpClient client = new();
        client.Connect(ip, port2);
        number2 = port2;
    }
    catch
    {
        number2 = 0;
    }

    return (number1, number2);
}

And then we just further check if either port is open and then depending on that, create an Unfiltered or Discarded object.

if (ports is { Item1: 0, Item2: 0 })
{
    QueueItem superDiscardedObject = CreateDiscardedSuperObject(ip, (int)responseCode);
    _queue.Enqueue(superDiscardedObject);
    continue;
}

QueueItem superUnfilteredObject = CreateUnfilteredSuperObject(ip, (int)responseCode, ports);

_queue.Enqueue(superUnfilteredObject);

Database Handling

The database handling is actually super simple. We just try to dequeue the item, and if there is one, we check the operation. And then just run the corosponding method depending on the object type and operation. This just runs in a while loop. There is a stop condition also. The ´stop´ boolean is changed from the thread handling class that manages all threads. More on that in the next post.

_queue.TryDequeue(out QueueItem? queueItem);
if (queueItem is null) { continue; }

switch (queueItem.Operations)
{
    case Operations.Insert when queueItem.Discarded is not null:
        InsertDiscarded(queueItem.Discarded);
        break;
    case Operations.Insert when queueItem.Unfiltered is not null:
        InsertUnfiltered(queueItem.Unfiltered);
        break;
    case Operations.Insert when queueItem.Runtime is not null:
        InsertRuntime(queueItem.Runtime);
        break;
    case Operations.Insert when queueItem.Filtered is not null:
        InsertFiltered(queueItem.Filtered);
        break;
    case Operations.Update when queueItem.Unfiltered is not null:
        UpdateUnfiltered(queueItem.Unfiltered);
        break;
    case Operations.Stop:
        stop = true;
        break;
}

if (stop)
{
    break;
}

Filtering And Indexing

This will be discussed in the next blog post.

Communication Between Frontend And Backend

This will be discussed in the next blog post.

The Envionment And Further Optimizations

The envionment that the project will be running on is my VPS. My rather low-powered VPS. My VPS (Virtual Private Server) is quite low powered, but cheap nontheless. Hardware goes as follows:

OS: Ubuntu 22.04.4 LTS x86_64
Host: KVM/QEMU
Kernel: 5.15.0-116-generic
CPU: QEMU Virtual version 2.5+ (1) @ 2.593GHz
Memory: 957MiB
Zram: 1.9G lzo-rle

With this low amount of RAM, and amount of cores, it’s crucial that the application is in tip-top shape. That also includes, but not limited to, compile flags. One thing I’ve noticed, is that not many developers pay much mind to the compile flags. Especially the .csproj file. You can fine tune your projects quite a lot actually with this file. It’s not only used to list the libraries or folders/files to include, it’s also used to define the flags used when compiling.

And since I’ve been focusing on minimizing the runtime overheard and memory ussage, I’ve included quite a bunch of flags. I will try to explain the ussage and effect of each flag. But do note that some flags was quite hard to find, and are not listed on any official Microsoft documentation pages that I could find.

<!-- These four flags are from default .csproj -->
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>

<!-- A good chunk of these comes from this page:
https://learn.microsoft.com/en-us/dotnet/core/deploying/trimming/trimming-options?pivots=dotnet-8-0#trim-framework-library-features -->

<!-- Compiling for only x64 can cut down on unnescesary x86 code. -->
<Platform>x64</Platform>

<!-- This is more obscure, But from what I've gathered, it's not set by default and can optimize some codepaths. -->
<Optimize>true</Optimize>

<!-- AOT (Ahead Of Time) is the new kid on the block. Instead of compiling and having a JIT
(Just In Time) compiler running, we remove that and optimize everything doing compile time, instead of runtime. -->
<PublishAot>true</PublishAot>

<!-- Remove unnescesary libraries or assemblies. Slimming down the application size. -->
<PublishTrimmed>true</PublishTrimmed>

<!-- Granularity of trimming. Here we remove as much as possible. -->
<TrimMode>full</TrimMode>

<!-- Remove symbols from the trimmed application, including embedded PDBs and separate PDB files. -->
<TrimmerRemoveSymbols>true</TrimmerRemoveSymbols>

<!-- Removes the ability to debug the code, further slimming it down in size and memory ussage, as some debug symbols
and logic has been removed. -->
<DebuggerSupport>false</DebuggerSupport>

<!-- We don't need this encoding anywhere, so we remove the code for handling unsafe UTF-7 -->
<EnableUnsafeUTF7Encoding>false</EnableUnsafeUTF7Encoding>

<!-- When set to false, removes EventSource-related code and logic. -->
<EventSourceSupport>false</EventSourceSupport>

<!-- When set to true, removes globalization-specific code and data. -->
<InvariantGlobalization>true</InvariantGlobalization>

<!-- When set to false, removes metadata update–specific logic related to hot reload. -->
<MetadataUpdaterSupport>false</MetadataUpdaterSupport>

<!-- Remove stack traces. Further lowers overheard -->
<StackTraceSupport>false</StackTraceSupport>

<!-- When set to true, strips exception messages for System.* assemblies. -->
<UseSystemResourceKeys>true</UseSystemResourceKeys>

<!-- When set to false, removes code related to diagnostics support for System.Net.Http. -->
<HttpActivityPropagationSupport>false</HttpActivityPropagationSupport>

<!-- Ready to run is a mix of JIT and AOT. Everything that can't be AOT compiled, get's JIT compiled. It's mostly used with
ASPNET, it's still quite useful though. But since everything in this application can be AOT compiled, we don't need the
overhead of a JIT. -->
<PublishReadyToRun>false</PublishReadyToRun>

<!-- Makes it more portable. -->
<PublishSingleFile>true</PublishSingleFile>


<!-- These are more obscure flags and wasn't easy to track down. -->
<!-- https://github.com/dotnet/runtimelab/blob/feature/NativeAOT/docs/using-nativeaot/optimizing.md -->
<!-- This completely disables the reflection metadata generation. -->
<IlcDisableReflection>true</IlcDisableReflection>

<!-- allows the compiler to remove reflection metadata from things that were not visible targets of reflection. -->
<IlcTrimMetadata>true</IlcTrimMetadata>

<!-- This disables generation of stack trace metadata that provides textual names in stack traces. -->
<IlcGenerateStackTraceData>false</IlcGenerateStackTraceData>

<!-- Folds method bodies with identical bytes (method body deduplication). -->
<IlcFoldIdenticalMethodBodies>true</IlcFoldIdenticalMethodBodies>

<!-- When generating optimized code, favor smaller code size. -->
<IlcOptimizationPreference>Size</IlcOptimizationPreference>

Phew, that was quite a lot. In the backend project, the memory usage got cut down, on average, by about 5-8Mb. Startup and idle, the application sits at about 45Mb. I can’t say for sure how much it changed on the frontend, since the frontend hasn’t been implemented yet lol. More to come in the next post.

Why?#

The Challenges.#

Scanning.#

Database#

Indexing And Filtereing.#

The Implementations#

Ip Scanning And Pre-Filtering#

Database Handling#

Filtering And Indexing#

Communication Between Frontend And Backend#

The Envionment And Further Optimizations#

Thank You For Reading :)#