Making a search engine in Godot

Please complete the required fields.




Welcome back to WebUrbEx my Godot project where you explore the retro internet.

So I’ve got all of these decidedly unfancy websites

A website with a rotalty free image of a bunch of women working a machine

but the only way to access them is me manually adding them to the Bookmarks tab at the top. This isn’t particularly scalable, or moddable. Let’s see how we can change that.

So how do scrubs open websites? URLs. aka “The web address” These are broken down into a bunch of different sections corresponding to different nerd stuff that means stuff. If you don’t care just scroll down to the next image.

a url annotated

The first bit is the “scheme”, but in practice it’s used to specify either the context of what the browser’s loading, or the protocol… which I’m not going to explain because that would mean trying to explain the OSI model. Suffice to say this bit just tells your browser it’s going on the internet to look at a website. There’s a bunch of other protocols like SMTP or FTP or RUDP or IPP or other meaningless acronyms but they’re used for different stuff. You can also put file:// here if you want your browser to look at files on your computer, usually something it just downloaded. Anyway it’s a bunch of nerd shit and the only bit you actually need to know is only put your payment info in if it says HTTPS and has the padlock next to it.

The Top Level Domain could be the country code (such as .uk, .jp, .au, .tk, or .io (which, fun fact, might not exist soon because that country no longer exists so hopefully itch finds a new top level domain)), basically it has nothing to do with where the site is hosted, just which government they paid for their url. Theres also ones like .com for “company”, .gov for “government”, .edu for “university”, .net for “neckbeard programmers”, and .zip for “why did they make this a domain, this is going to make hacking people a lot easier” where you buy those from nebulous international bodies, and sometimes they get mashed together. Anyway this one’s also boring nerd shit and not important, just don’t click on weird links in your emails.

The regular Domain Name is usually the name of the website or at least a secondary name for it. Ideally you want one that’s distinct. It’s probably a good idea to have it be the name people associate with your site so it shows up in search results, but you can also have it be a secret second name for your site if you want. I’m not your dad.

The Sub-Domain can be a lot of things. Wikipedia uses it to split its site up into languages, platforms like blogspot or neocities use it to seperate different blogs in their network, some sites use it for jamming a seperate store domain on their site, and a lot will use m.domain to seperate off the mobile version of their site. It can be useful but a lot of sites just stick www on there, meaning “the website”, and then use the next thing…

The Path resembles a file system. If you imagine all websites to be a file system hosted on a server this would be the path to where on the server a page is hosted. Got that? Great. Now, a lot of websites don’t work that way but it’s an easy enough analogy that what’s the harm in pretending they do.

And finally there’s the arguments (queries and fragments), preceded by a ? or a #. Queries are extra data passed on and Fragments are “if you’re on a webpage, go to this subheading”.

Anyway like most of that isn’t remotely relevant to what I’m doing, and half of it was probably wrong but apart from that you can say you’ve learned something now, right?

Anyway those of you who tuned out can come back now.

An hippo

Over in WebUrbEx Things are a little bit less sophisticated on the addressing side. Like I expounded on in the last article, The websites are comprised of two files: “content.txt” and “style.txt”, with style.txt being just some config settings and content.txt having all the text, bbcode and EWA tags to tell the program how to put together the site when it loads it in. The expand a little, the game looks for content.txt in a path specified as the folder the godot executable is in, /unpacked/ and then the folders inside that, taking the site’s name from the subfolder that it’s in.

a file system bath annotated next to a block of text defining a page
At the moment unpacked is next to the godot editor exe

Again, at the moment I’m specifying all of these relative addresses manually so let’s automate going through all of the “content.txt” files in any folder in the unpacked directory, and grab their relative folder paths so I don’t have to do that anymore.

code
There’s a bit more in the for loop…

Lovely.

At the moment, unlike real URLs instead of specifying a domain, we’re just entering the file path of the site, so to get to the above site you would have to give the address “retrohost/samplesite”, but at the moment the EWA framework also treats the “/” as the delimiter between the site address and the page you’re on. To stop the program from getting them confused, and to make the site address look more like a web address, let’s have the program save the file path with “.”s in the place of “/”s and only swap them back in when accessing the file

a website with the url the wrong way around

Yeah, that looks good. But we can do one better. Back in the section on URLs we saw that the domain name works backwards from the way that file systems work, subdomain.domain.tld. If we take the retrohost folder to be the domain, and samplesite to be a subdomain for someone’s blog on retrohost, really that should be samplesite.retrohost.

code
Could make these static i suppose…

These two functions in the class that builds the database of sites should do the trick. Just have to remember to use them always.

a website with the url the right way around

There we go. I’ll hold off on top level domains in case someone buys samplesite.retrohost.net and puts anything weird on it but it would be possible.

Right so we’ve got sites loading from folder paths turned into urls… wait this article was supposed to be about search engines…

classic google

Alright, what’s up skibidi gamer nation today we’re here to talk about the way that people who are not scrubs find websites. That’s right, it’s search engines, here to feed you hallucinated AI summaries, advertisements and tangentially related garbage from quora until you give up and start searching for answers to your questions on reddit instead. That’s my normal experience but for some reason google is fucking up today and giving me actually good results so you’ll have to imagine I was able to get a screenshot of the garbage pile.

While it’s good that you can now get to sites via url in WebUrbEx, it’s not exactly convenient or appropriate. WebUrbEx is supposed to be a representation of being someone in the present falling down a rabbit hole of looking at a bunch of old and abandoned websites, so that means modern comforts and capabilities… also search engines have been around since the mid 90s anyway.

Enter Morrigan. No it absolutely wasn’t called anything else last post, I know how to double check that I’m not infringing on any trademarks before using something, what do you think I am, some sort of copyright daredevil?

A scene titled Morrigan Search

So if we want Morrigan to find a page we need to have it take in whatever’s in the search bar and match that to the title or contents of that page.
Problem: If we do that, all it’d do is spit out all of the pages that match the terms in alphabetical order. handy… but not that handy.

  • If part of the term the user searched shows up multiple times then that means the result is probably more relevant.
  • We should prioritise matching more specific words (like “Hamburger”) over generic words like “It” or “The”. An easy to implement discriminator would be checking how long a word is.
  • If sites have a base rank, Pages should build on that rank.
  • case shouldn’t effect search results
  • Search operations (+ for “exclude anything missing this word”, – for “exclude anything with this word” and quotes for “make sure this quotation is exactly in the result”) should on top of excluding results, have more of an effect on priority when ranking other metrics.
  • Of course, if two pages get the same score, it’s probably going to resort back to alphabetical order but… well, at least it’s consistent behaviour.
  • There needs to be a way to opt certain sites out of search results. Something equivalent to robots.txt that “prevents the search engine from crawling the site”. This, while having a nugget of realism in it serves the gameplay purpose of making it possible to hide sites from the player, and make things a bit of a puzzle to find.

Yeah, that seems easy…

Several Days Later

To give myself credit, I almost managed to get through that one without any bizarre or cruel bugs getting in the way of progress.

code written to make errors when faced with a string containing a single comma

Almost. (I know where this string is coming from and eventually nuked it at the source, but how on earth it got here I could not say.)

So, where were we?

Search engine does a loose pass on all sites’ names, page names (if cached), and site tags, then a much stricter but more thorough pass on what exists in the site’s content. I am aware this has the potential of missing out some results for a little performance gain I have no idea if I need or not but I’d rather play that on the safe side at least until I can tell how well things scale.

Here’s some snippets of the code in action

code

Search engine has to respect search operators like quotes, includes and excludes?

Done, done and done.

We have a loose mode for site tags that doesn’t necessarily need the necessary ones and a strict mode for pages

code

This bit generates the list of sites that match the code. The exact checking algorithm is both sprawling and in its own function CheckStringMatches()

This function is all just one big series of list compares to score search results based on the criteria set out in the bullet list above. They’re actually defined in the player-facing class for the search engine scene so I can edit them in the editor a lot quicker, and passed to this via some vectors to cut down on just how many arguments this function has.

code

While splitting things off into their own functions for reuse is good for code reuse, and I do re-use it in the section below, it’s also good for debugging. There’s been so many times in this project where creating what are essentially single points of failure used by several objects have made working out that I implemented something wrong a breeze.

code

And the end result:

code

This is probably the most “fun” bit of the code.

For each site and page listed on the search engine, the search engine pulls up a little blurb of text. For the sites it just grabs from the about section but for the pages I decided to have it hunt through each page to find one of the words you searched for and drop in everything following it. It also has some guards to stop a blurb being 100 line breaks.

  • 2 search results
  • a full page of search results with the next page button hovered
  • search results showing a link to the site search.morrigan.odd
  • no search results

So that’s how you make a search engine.

Now, I did say at the end of the last post that I was going to release this build after I was done with the search engine but there’s one more thing I want to get done, and it’s on the visual side of things…

Don’t adjust your sets, folks,

-Matt

1 Comments

Leave Comment

Your email address will not be published. Required fields are marked *