Open Index: Query Resolving
This second installment about the Open Index deals with how search queries are processed. Each index contains RDF descriptions (metadata) for objects identified by URIs (identifiers). The objects are not in the index, the metadata and the URIs are.
Late integration means that integration happens late in the process: during query processing. This is quite different from integrating metadata databases. But it is even quite different from the current practise of leaving the databases in place and integrating the metadata into one monolithic index.
Wide or deep
In the Open Index, when an index contains few metadata for many objects, then we call this a wide index. An index containing much metadata for a small set of objects, we call a deep index. Current style indexes are often both wide and deep, as they try to encompass everything. Indexes may contain overlapping sets of URIs, such as, one index might enrich another.
The index that receives a query from a client (e.g. portal) is called the leading index. It determines which other indexes to involve and it takes care of the final ranking and faceting. It returns to the user a top list of matching records, along with facet data.
The leading index executes the query itself and sends it to one or more other indexes. It asks the other indexes to return only the URIs (conceptually). The leading index integrates these sets using set arithmetic resulting in one set of hits. This arithmetic is the same as is found deep inside search engines.
Based on the set of hits the leading index adds facets. These are correct and complete facets; not estimates. (How facets can be distributed is the subject of an advanced blog later).
The Open Index is technology agnostic and lends the solution to integrating relevancy from the way indexes are intended to be used. Assumptions:
- The deeper the index, the more relevant its hits are. It is more specialized.
- The more indexes yield hits on a URI, the more relevant it is.
- Native relevancy of the leading index can be used when the width of this index determines the full scope of the query.
A later blog will go into more detail about relevancy.
Once a top of hits has been determined, the leading index gathers the RDF documents for this top. It does so by asking the same indexes for their parts and then it merges the results into on one RDF description and sends this to the client.
In the following installment of this blog I will describe how indexes are found and how optimizations make the basic idea outlined so far scale up.