Advertisement


Locust Swarm - Web Apps, Viral Campaigns, cutting-edge SEO
Home | About | Web Apps | Contact

A High-Performance Hybrid Search Engine Using Perl

Posted November 4, 2010 by Head Hopper and filed under Perl, search

Let's say you have a site with more than 20,000 pages, and you need a search function.

These days, people are so spoiled by Google, they expect search to be lightning fast, and highly accurate. If they have to wait for more than 3 seconds, they start dissing your site. You could solve the problem by using G's site search, that is if you have no pride and dignity left at all.

So you want to build something yourself, and it's gotta be snappy.

The problem is that with 20,000 pages or more, you need to get your game on otherwise your search function is going to take as long as a minute, and deliver crappy results.

You can dynamically grep through your pages on the fly if they are only about 100 or fewer pages on your site. But anything more and you need to have a well-planned, sophisticated solution.

Search functions are generally built using either a reverse index or a vector comparison.

Reverse Index

A reverse index is pretty basic. Let's say your page contains about 500 words. You create a file for each word, and make sure it has the page ID on it.

Let's say the page ID is 42 and one of the words on the page is "mango".

You get a file called mango.txt and it has 42 in it.

If you have another page (say page ID 87) and it also has the word "mango" in it, then this is added to mango.txt and the contents of mango.txt become as follows:

    42
    87

You can then get fancy and start including weights based on the number of occurrences, whether it occurs in the title, in bold text, and so on - in short a very primitive version of what search engines used to do. Calculate a weight using your own primeval Google algorithm and include it in the file:

    42:17
    87:2

This works actually surprisingly well.... up to about 20,000 pages. Thereafter, your "mango" file may contain dozens of lines. Not a problem! But if you have a file with something common like "free" or "plant" (if your niche is marketing or horticulture respectively), then you get thousands of lines in your keyword file. Still not a problem! But then you get people searching for multiple words, each of which has thousands of pages, such as "free marketing tips" or "plant soil water" and your little search engine will take a long time (10+ seconds) to sort it all out, even if you have a blazing fast server.

Vector Space

An alternative is a vector space search engine. This is documented on the web, courtesy of Maciej Ceglowski.

Basically, it uses vectors to compare pages. You draw up a lexicon (a long list of all words that show up at least once) of your site. It would typically be 50,000 words or more, but for simplicity's sake let's say your lexicon is 5 words.

    reindeer monkey cucumber mango vanilla

A page with the words "mango" and "vanilla" would get a vector that looks like this:

    0 0 0 1 1

A page with the word "monkey" only would get a vector that looks like this:

    0 1 0 0 0

And finally a page with the words "cucumber" and "mango" would have a vector like this:

    0 0 1 1 0

You compare these vectors using a mathematical engine called PDL, and it tells you which vectors are most similar. Think of vectors as arrows pointing through three-dimensinal space comprised of words floating around the room.

The result of your comparison will be that the first and third pages are similar, and that's how you compare the search query with the pages. You construct a vector for your search query and a vector for each of your pages.

I should note that if your lexicon is tens of thousands of words you will have to use some clever techniques to draw up your vectors, because average systems can't handle strings that long. In my case, I save the above three vectors by marking only the location of positions that have a word in them, and rebuld the vectors when I need them, as follows:

    4,5

    2

    3,4

So far so good. The problem is that with thousands of pages, even comparing vectors is slow.

So what I did is, I devised a hybrid reverse-index and vector search engine. It collects a list of all pages which have all the query keywords, using standard reverse-index methods. Then it compares this list of all relevant pages found with the search query, using vectors. Clever huh? The result is a fairly speedy, accurate search function. To see it in action, go search at Qondio or Ah Yer! (I should note that due to high traffic Ah Yer! is a little overloaded to begin with so if it's slow it's not because the search function is slow).





Pages
Home
About
Custom Web Apps
SEO
Contact

Copyright Locust Swarm 2010. All rights reserved.
Locust image by an earnest young man named Richard X Thripp.