Homemade Search Engine Part (2/3)

Intro

Welcome back! It’s been about a month since I promised a new blog post, and here it is. A lot has happened since last time. I’ve had to rewrite a large amount of code and restructure the control flow of the search engine/scanner.

The biggest problem with the scanner was that the control flow was all over the place. For example, the only way to gracefully exit the application was to either scan the entire internet or queue a QueueItem with the command Stop. And that’s really not ideal.

Another issue was that handler classes were being initialized in the wrong methods or member space. Again, not great. So what I did was rewrite the entire control flow for initializing handlers, streamlining the process. See the code example below.

public class ThreadHandler
{
    private readonly DbHandler _dbHandler;
    private readonly Communication _communication;
    private readonly IpScanner _ipScanner;
    private readonly ContentFilter _contentFilter;

    private bool _communicationStopped;
    private bool _ipScannerStopped;
    private bool _contentFilterStopped;
    private bool _stopSignal;

    public ThreadHandler()
    {
        ConcurrentQueue<QueueItem> contentQueue = new();
        ConcurrentQueue<Discarded> discardedQueue = new();

        _dbHandler = new(contentQueue, discardedQueue);
        _communication = new(_dbHandler, this);
        _ipScanner = new(contentQueue, _dbHandler, discardedQueue);
        _contentFilter = new(contentQueue, _dbHandler);
    }
...

The code example above is a snippet of the thread handler. As the name suggests, it handles the threads. One requirement I had was to run as much functionality in parallel as possible. The scanner utilizes 32 threads, the filter runs in its own thread, and the database handler also employs multiple threads.

Therefore, having one class that initializes the handlers and waits for them to finish seemed like the right approach. And given how clean the class and control flow turned out, I’m quite satisfied with the result.

Now, the only way to shut down the application is through a small client I created that interacts with the scanner. It’s a separate console application that allows me to send commands to the scanner, queue a database reindex, and so on. This is why I pass this in the line _communication = new(_dbHandler, this);. The thread handler (the class from the code snippet) has a method to gracefully exit all running threads.

The code below shows the stop method. _stopSignal is a boolean used to control any do while loops within the thread handler class. _ipScanner, _contentFilter, and _dbHandler can be stopped from the calling thread. However, _communication needs to be stopped from a separate thread because the communication library I’m using cannot be reliably stopped from the same thread on which it runs.

The database handler can only be stopped after all other handlers have stopped. This ensures that there are no running jobs still queuing objects to be saved in the database.

public void Stop()
{
    _stopSignal = true;
    _ipScanner.Stop();
    _contentFilter.Stop();
    StopCommunicator();

    bool stopping = true;

    while (stopping)
    {
        if (_communicationStopped && _ipScannerStopped && _contentFilterStopped)
        {
            _dbHandler.Stop();
            stopping = false;
        }

        Thread.Sleep(3000);
    }

New Requirements

I’ve added and changed a few of the requirements.

The application requirements entail:

The backend should be as multithreaded as realistically possible.
The application should use as little memory as possible.
The application should run for as long as absolutely possible with as few problems as possible.
The runtime should be as lean as possible.
The application should run as fast as possible (except certain code paths. Later on that).
The application should use as little CPU as realistically possible.
I want to use as few external libraries as possible.
The application should be implemented as the classic three layer application. View-, logic- and data-layer.
Actual search function.

The data and statistics requirements:

Save the Ip of the server.
Save if port 80 or 443 is open.
Save response code from pinging the server.
The response code for Ips that doesn’t reply with Success.
Title and description on the site running on the server, for both port 80 and 443.
Url for the website running on the server. For both port 80 and 443.
What type of server the website is running on. e.g NginX, Apache…
Check for robots.txt.
Http version.
Certificate issuer country (if the server is using SSL).
Certificate organization name (who issued the certificate).
Ipv6 (if the server also has a Ipv6 address assigned).
Tls version.
Cipher suite.
Key exchange algorithm.
Public key type (some servers uses multible types. I collect up to 3 types).
Accept Encoding (what kind of encoding the connection accepts. As in negotiate,accept-language,accept-encoding).
Compression algorithm.
ALPN (Application-Layer Protocol Negotiation).
Collect an assortment of “tags” per site.

I’ve decided to create the frontend in Vue3 with TypeScript. It’s been some time since I’ve had my hand in anything non-dotnet related, so Vue3 it is. I was going between either Vue or React, but the ease of use of Vue, and the need for UseEffect/UseSate on React just tipped me over the edge for Vue.

The Database

Initially, the database handler processed every single database insert, select, update, and delete action from a single thread. This approach was sufficient when the scanner ran with 16 threads. However, increasing the scanner to 32 threads overloaded the system.

The solution involved splitting the objects to be handled by different threads. Before diving into that, let me provide a brief overview of the object types:

Discarded: These objects consist of IPs and response codes that indicate anything other than success. If a server doesn’t respond or returns a non-success code, it’s added to a Discarded object and inserted into the discarded database.
Unfiltered and Filtered These are the primary objects used for processing scanned data.

The unfiltered object looks like this:

public class Unfiltered
{
    public int Id { get; set; }

    public string Ip { get; set; } = "";

    public int Port1 { get; set; }

    public int Port2 { get; set; }

    public int Filtered { get; set; }
}

It would be nice if I could save the Ip as a number. Because that way, I could just allocate it as a struct, and that would create less pressure on the garbage collector.

And the filtered object looks like this:

public class Filtered
{
    public string Ip { get; set; } = "";
    public string Title1 { get; set; } = "";
    public string Title2 { get; set; } = "";
    public string Description1 { get; set; } = "";
    public string Description2 { get; set; } = "";
    public string Url1 { get; set; } = "";
    public string Url2 { get; set; } = "";
    public int Port1 { get; set; }
    public int Port2 { get; set; }
    public string ServerType1 { get; set; } = "";
    public string ServerType2 { get; set; } = "";
    public bool RobotsTXT1 { get; set; }
    public bool RobotsTXT2 { get; set; }
    public string HttpVersion1 { get; set; } = ""; // Could be made into an int
    public string HttpVersion2 { get; set; } = "";
    public string ALPN { get; set; } = ""; // Application Layer Protocol Negotiation, which allows clients and servers
    // to agree on a common application layer protocol during the TLS handshake process.
    public string CertificateIssuerCountry { get; set; } = "";
    public string CertificateOrganizationName { get; set; } = "";
    public string IpV6 { get; set; } = "";
    public string TlsVersion { get; set; } = ""; // Could be made into an int
    public string CipherSuite { get; set; } = "";
    public string KeyExchangeAlgorithm { get; set; } = "";
    public string PublicKeyType1 { get; set; } = "";
    public string PublicKeyType2 { get; set; } = "";
    public string PublicKeyType3 { get; set; } = "";
    public string AcceptEncoding1 { get; set; } = "";
    public string AcceptEncoding2 { get; set; } = "";
    public string Connection1 { get; set; } = ""; // Fx: keep-alive
    public string Connection2 { get; set; } = "";
}

If a variable ends with either “1” or “2,” it indicates the data originates from port 80 or 443, respectively.

To address performance bottlenecks, I split the database handler into two methods: one for discarded objects and another for unfiltered and filtered objects. This separation was crucial because the scanner often generates a large volume of discarded objects, overwhelming a single queue.

Dividing the queues also simplified subsequent logic. Now, I have two distinct queues: one for discarded objects and another for all other items. The discarded object queue benefits from multiple consumer threads, enabling rapid processing. Two consumers have proven sufficient for this task. The queue for unfiltered and filtered objects remains consistently low, averaging less than two items per second.

Each instance of the discarded database method initializes a new discarded database for exclusive use by that method. The code example below illustrates this process:

private void RunDiscarded(object obj)
{
    DiscardedDbHandlerSetting discardedDbHandlerSetting = (DiscardedDbHandlerSetting)obj;
    Console.WriteLine($"Discarded DbHandler started with thread: ({discardedDbHandlerSetting.ThreadId})");

    string connectionString = CreateDiscardedDb(discardedDbHandlerSetting.ThreadId); // Get a new database for this thread

    while (!_stop)
    {
        if (_discardedQueue.IsEmpty || _pause)
        {
            Thread.Sleep(10);
            _paused = true;
            continue;
        }

        _discardedQueue.TryDequeue(out Discarded? queueItem);

        if (queueItem is null) { continue; }

        InsertDiscarded(queueItem, connectionString);
    }

    discardedDbHandlerSetting.Handle!.Set();

    Console.WriteLine("Content DbHandler stopped.");
}

It’s pretty simple, really. This method can be initiared by multiple threads withous creating any deadlock, since the method just inserts into it’s own database.

And the method to create a discarded database is also rather simple. I don’t think I need to explain this too much:

private string CreateDiscardedDb(int threadNumber)
{
    string databaseName = $"Data Source=../../../../Models/Discarded{threadNumber}.db";

    const string createStatement = "CREATE TABLE IF NOT EXISTS Discarded (Id INTEGER NOT NULL, Ip TEXT NOT NULL, ResponseCode INTEGER NOT NULL, PRIMARY KEY(Id AUTOINCREMENT))";

    _discardedConnectionStrings.Add(databaseName);

    using SqliteConnection connection = new(databaseName);
    connection.Open();

    using SqliteCommand command = new(createStatement, connection);
    command.ExecuteNonQuery();

    return databaseName;
}

The line _discardedConnectionStrings.Add(databaseName); adds the name of each newly created discarded database to a list called _discardedConnectionStrings. This list proves invaluable for another method that counts rows within these discarded databases, allowing for efficient tracking and retrieval of discard data.

The Search Function

Implementing a search function required some ingenuity given the constraints of the current setup. While not ideal, the approach is remarkably straightforward:

User Input: When a user enters a search term (e.g., “C#” or “beef”), the search function retrieves rows from the database where either the title or description field contains content.
Fuzzy Matching: A fuzzy search algorithm analyzes the title and description fields, comparing them to the user’s input.
Score Threshold: If the fuzzy search score exceeds a threshold of 75 (indicating a strong match), the corresponding URL, title, and description are packaged into a “search object” and sent back to the user.

While functional, the current search implementation faces performance limitations:

Row-by-row Retrieval: Retrieving rows individually from the database can be inefficient. Chunking together row selects could significantly improve speed.
Tag System: Implementing a tag system would enhance search accuracy and relevance. Analyzing website content and building a dictionary of words (excluding common ones like “and,” “yes,” or “.”) could provide valuable insights into a site’s topic.

Frontend

Sorry guys, but the frontend will be showcased in the next blogpost.

Intro#

New Requirements#

The Database#

The Search Function#

Frontend#

Intro

New Requirements

The Database

The Search Function

Frontend