The Open Index consists of different, independent indexes. They all implement a simple query protocol that integrates the results into one final result list. The topic of this blog is to outline how the selection of relevant indexes, within an Open Index, for a specific query works. Or could work, for this is just one way of doing it.
One of the key elements of the Open Index is the Meta Index (one or more). It does not index documents, it indexes indexes. It records, among other things, which index indexes which vocabularies. Also, it indexes the vocabularies itself. Let me explain that.
When an index joins an Open Index it registers itself with the Meta Index by telling the Meta Index its own location. Also, it tells the Meta Index which vocabularies it uses to index its documents. Many indexes use the Dublin Core vocabulary for indexing, but more specialized indexes use more specialized vocabularies such as FOAF (for social relationships), Geo (longitude and latitude info), MusicBrainz (to describe music, performances, concerts). So now the Meta Index knows exactly which indexes specialize on certain vocabularies.
For every vocabulary reference the Meta Index receives, it retrieves the descriptions of the vocabularies itself (does this make it a meta-meta index then?). That means that it now also knows which fields each vocabulary contains. For example, the MusicBrainz vocabulary contains fields like Album, Artist, Track and so on. For some vocabularies, it might even know about possible values too. The GTAA (an extensive vocabulary for TV etc.) for example contains approximately 97.000 Persons, 27.000 Names, 14.000 Locations and 18.000 (TV) Makers. The Meta Index knows all these names, and it knows if a name representes a person or a location.
Querying for Vocabularies
For any given query, the Meta Index can tell which indexes that are part of the Open Index will give meaningful results. How this is done? In two ways. Suppose someone enters the query “artist=lennon”. If you send this query to the Meta Index, it will lookup which vocabularies have a field named ‘artist’ (ignoring a few problems that arise during matching here), then it will look up the indexes registered for these vocabularies and it will send you this list of indexes. The next step is to send the same query to these indexes and integrate the results.
Now suppose you would enter a simpler query such as “yvon jaspers”. The Meta Index could lookup the word “yvon jaspers” and find it to be in a list of names for television Makers in the GTAA. So it gives you the list of GTAA indexes, and you could take this as a suggestion to include these in your query.
Automating versus User controlled
The examples above assume you being in control of querying the Meta Index and deciding what to do with the hits it gives you. In practise however, the interaction with the Meta Index will be invisible to users. A search portal might show the suggestions from the Meta Index and let the user free to direct his query to one or more of the suggested indexes. For example by saying “did you mean to search for TV maker ‘yvon jaspers’?”. Another search portal might simply take the hints from the Meta Index directly, carry out the users search query on all of them and just show the results.
Freedom of Design Choices
It all comes down to decoupling design choices: creating flexibility because we can not see into the future. With current technology (big integrated indexes), the choice for a particular search engine often implies many other choices you will often only become aware of later. One such decision is the way the search engine deals with multiple indexes. The Open Index allows such decisions to be made separately. The way of working as outlined above is only our first take on how we will do it. It could be any other algorithm in the future.
In a next installment of this blog I’ll cover efficiency and scalability of the Open Index.