Findkit is crawler based search toolkit which stores web pages to a search index. But what is an index and how the pages are scored when searching? Lets take a look.
In short index is a mapping in a database of every word on every page to the page URLs where the words are seen. Each word also contains some metadata like the position in the text and term frequency. The index uses a data structure which makes it extremely fast to look up full words.
Since a single word can appear on multiple pages, a score is calculated to determine which pages are the most relevant to the given search terms. The heart of the scoring is the Okapi BM25 ranking function which gives higher scores when:
- The search terms appear more of often on a page
- When search terms match words on a page that are otherwise uncommon in the index
In addition to this a higher score is given when
- Match in the title is found
- Search terms are found in a phrase vs. scattered around the page
- Exact word match vs. a match converted from different inflection forms
Handling Inflection forms
Eg. how we can find a page with a word “cars” when we search for “car”. Easy solution would to do some sort of prefix matching but it would not work very well. Not all all base forms are prefixes of inflection forms and when matching different inflection forms would not work at all. Ex. “walking” and “walked”. It also would not be optimal for speed since the index optimised for finding full words.
Ideally we would do lemmatization, a process of transforming the inflection forms to the base forms, ex. “walking” to “walk”, and store it to the index and the during search do the same for the search terms. Unfortunately natural languages are highly irregular and doing this can be very compute intensive, possibly requiring an AI powered solution to get good results, thus making it slow and expensive.
Luckily we can cheat a bit. Since the words stored in the index are not actually visible to the users, they are only used for matching behind the scenes, we can store the stems of the words: The part of the word that stays the same during inflection. The stem can be the actual base form or just some common part. For example “automatically” and “automatical” stem to “automat”. This is called stemming.
Stemming algorithms are much simpler and thus faster and cheaper to use. Checkout our documentation for supported languages that have stemming support. The stemming algorithms are heuristics based meaning they are not perfect and can fail on irregular words but when combined with the other search terms the total score can still produce relevant results.
The Last but not Least
For “search as you type” experience the last word requires special handing for constant search results improvements when the search terms are typed. The last word is matched as a raw prefix to get these results. If we would match it as a full word it would not get very many good matches before it is completely typed out.
Improving with AI
I already mentioned that AI tools could be used to implement better a better version of stemming but that is not the best way the leverage AI in search. AI can be used to detect the semantic similarity between two pieces of text. A machine learning model can create a vector, basically a list of numbers, which can be used measure the semantic similarity between the texts using the k-nearest neighbors algorithm.
When the vector is created for both the indexed words and search terms we can get very intelligent search as the models can understand text and push synonyms, context, semantic meaning etc. closer to each other. Some models can even understand different languages so it would be possible to make searches to an index populated with german text with english search terms! This is often called Vector Search or Semantic Search.
Read more about the AI search from the announcement article.