#TIL: Search Indexing

A number of the new APIs and projects I’m working with include some level of searchable index, using utilities like Lucene, ElasticSearch, and Nest. I know absolutely nothing about these things, so I decided to go on a bit of a fact-finding mission to get myself more acquainted. Starting at the basics, I first learned about Search Indexing.

Learning mostly from the wikipedia article, the main idea behind search indexing is to create a much smaller, searchable resource that points you towards the actual document you’re trying to find. Much like how indexes work in a database server, an index consists of an abbreviated form of the document (keywords, suffixes, etc.) and a pointer to the original document.

Because the dataset is much smaller compared to the original collection of documents, a computer can search through the index significantly faster compared to if it had to search through each document in a collection every query. Furthermore, we can structure the index to optimize for speed using a tree structure, hash table, or any other optimized structure.

Often, search engines will use an Inverted Index to locate relevent documents, and then later rank these documents based on relevance. An inverted index is simply a structure that stores the fact that a document has something in it. If we search for “the cow says moo”, an inverted index will return all documents have have each of those words within them at least once. Some inverted indexes also contain information such as term frequency or position of the terms in the document, which can then be used for ranking. This gives the search engine a quick way of identifying the documents it needs to rank.

Of course, to build this index a search engine also needs to parse the documents that it can search. This consists of tokenization (identifying individual words and phrases), forward indexing, and then the building of the inverted index. Forward indexing is simply a list of a document and each of its relevant terms, so that it can easily be turned into an inverted index.

The tokenization process may also take into account other formatting, such as document titles and fonts, to give terms a different priority in the index. If a document has a clear title, the terms in that title may be given priority in the index. That way future searches for those terms can favour documents that have those words in their title over documents that may just mention them in the subtext.

Lucene (specifically Lucene.Net) is the framework we’ve been using to provide all this functionality for us – there will be a future TIL on how exactly we query these indexes.