A Faster Join for Solr/Lucene

Reducing Index Maintenance Costs with Join.

The previous post introduced the reasons why we want blazingly fast join functionality in Lucene: reduction of index maintenance costs.

This post details how we improved the speed of Lucene’s Query-Time Join a thousandfold.  We achieved this by looking at usage patterns instead of looking at the technology. Lucene’s default Join is a truly Lucene way of performing Joins between arbitrary terms in arbitrary fields in two indexes. Our Join more or less turns a Lucene index into a relational table and provides Join functionality as you would expect from a relational database.  That might sound restrictive to Lucene-adepts, but it offers unprecedented possibilities.

Keys, keys, keys

Lucene joins indexes based on terms. Our first observation is that these terms in fact play the role of keys. In database-lingo you would call them primary or foreign keys. Nowadays, most people use either UUIDs or URIs to identify things, but these are hard to deal with: they occupy much space, are expensive to compare and do not compress well.  Data management software internally always translates long, text based identifiers to small integers. (As a matter of fact, Lucene also uses them and calls them docid and ord.) Our Join implementation is based on such small monotonically increasing numbers and we call them keys.

Multiple Values

The second observation is that lucene supports join on fields that contain multiple values for one document.  But key cells in databases always contain exactly one value for each row.  If you need multiple values, you’d create a separate table containing multiple entries containing the same (foreign) key and then… join it.  Since join is a database concept after all, we might want to consider to be consequent and use single valued fields exclusively. If we need multiple values, we’ll create a separate index and then… join it!  So in our Join implementation, every document in a Lucene index gets one and only one (non-unique though) key.   This is not restricting you as long as you are prepared to adapt your data model.  Read on.

Performance

The performance gain of using small integers over strings is tremendous.  Small integers use significantly less memory and can be fetched, stored and compared with a single machine instruction. Having only single values (one key) per field per document means they can be organized efficiently in one-dimensional arrays or bitsets.  Much research in the domain of Information Retrieval deals with fast storage and intersection of these arrays and bitsets, and much of the research results are available in Lucene! It works marvelous with our keys.

Translate Identifiers to Single Valued Keys

So how do we go about translating string identifiers to numeric keys?  The implementation of this is downright easy in Lucene if you use the new Taxonomy functionality.  The taxonomy is a proper dictionary, mapping every term onto a small number: the key.  Lucene can store this key very efficiently using the relatively new NumericDocValuesField feature. During indexing, we use it like this to store a key in field key (pseudo-code):

TaxonomyWriter keysDict = new DirectoryTaxonomyWriter(...);
Document doc = new Document();
long key = keyDict.addCategory(new CategoryPath());
doc.add(new NumericDocValuesField("key", key));
indexWriter_A.addDocument(doc);

It is essential that one uses the same TaxonomyWriter for every index so that identifiers in all indexes get mapped onto the same keys.

(A completely different way of creating keys is using another database’s key generation mechanism.  Virtuoso for example exposes its IRI_TO_ID and ID_TO_IRI functions.  Using these to obtain keys gives the opportunity to Join between Lucene and Virtuoso.  Expect a blog about that!)

Create Single Valued Keys

The problem of having single valued keys may require changes to the data model. In our case we had the denormalized form in one index:

id:<uri_1>  title:"The Hobbit"  location:<loc_1>,<loc_2>, ...

We split this into two indexes:

id:<uri_1>  title:"The Hobbit"

and:

id:<uri_1>  location:<loc_1>
id:<uri_1>  location:<loc_2>

Querying

Now we have two or more indexes in which the terms we wish to use for joining are replaced by keys.  We can now join these indexes during queries by specifying which fields to use for joining. To keep things simple, we assume that one index A simply filters the results from another index B.  In practise things are (much) more complicated; too complicated for a blog.

Expressing Join

For join to really work well, we will need a query language that supports join. The query language built into Lucene offers no support for this, but luckily we can do without.  There is another language that does support joins and that is the language consisting of Java classes centered around Query, Filter and Collector.

Collecting Terms… eh Keys

The first step is to collect the keys for the terms we want to match between indexes. This step is straight forward (if you leave out caching and all the devious Java markup, it is just two lines of useful code.).  We created a Collector that for each hit, looks up the key and stores it into a bitset.  It needs to know the name of the field containing the key. Use it as follows:

KeyCollector keyCollector = new KeyCollector("key");
indexSearcher_A.search(<query_a>, keyCollector);

Filtering Terms… eh Keys

The second step is to ask the KeyCollector for a KeyFilter to be used in another index. The filter is also not too complicated if you leave out caching and Java markup. It needs to know the name of the field containing the key.  Use it as follows:

KeyFilter keyFilter = keyCollector.getFilter("otherKey");
TopDocs results = indexSearcher_B.search(<query_b>, keyFilter);

Done. Now the results in index B are filtered based on a query on index A.  And it is fast, blazingly fast, thanks to Lucene’s excellent DocValues and Collector APIs.

But it needs to be faster, even more

Although this duo of KeyCollector and KeyFilter improve the speed of Lucene’s built-in Join with a factor 50 or so, it is not fast enough.  It would get the raw processing time for our 8-second query down to 160 ms, but that is still too much. You’ll need to add all kinds of post-processing to this 160 ms and then you’ll end up way to close to 1s. Any raw processing time not substantially under 100 ms makes me personally very nervous. It would require many machines to deal with the loads we have.

Caching

With caching, we can get the processing times down by a factor 20, leaving a mere 8 ms. That yields a total 1000-fold speedup compared to where we started.

How that is achieved might be the subject of the next blog. But you might peek at Meresco’s Lucene code at GitHub.

Conclusion

By making some observations about the nature of joins and by making a firm decision to follow the relational interpretation of join and by assuming that anything is still possible by adapting your data model, we managed to speed up joins 1000-fold.

The result is completely orthogonal to other Lucene functionality such as faceting, sorting, scoring, filtering, etc. All intermediate steps and results are in the hands of the programmer.

The results are about to go in production (March) on a Dutch national library search engine, joining 12.000.000+ titles with 25.000.000+ holdings. The contribution of join to the query response times are deep down on the list of bottlenecks: a few ms.

7 gedachten over “A Faster Join for Solr/Lucene

  1. Jana Beantwoorden

    I have 3 cores (relational) running in Solr, I tried joining using Solrj with {!join }. But performance is too bad.

    Is there way where i can use Lucene API (as you mentioned) to point Solr indexes. Any kind of suggestion would really me.

    Thanks
    Jana

  2. Pingback: Performance index Lucene | Seecr

  3. Mee Cee Beantwoorden

    Hi,
    Thank you for this article I think it can help me a lot.
    Do you have a way to send me a full example of how to index and query the key.

    Thank you MC

  4. Pingback: The NBC+ Search Platform | Seecr

  5. Erik J. Groeneveld Beantwoorden

    Josef, thanks for your reply. You have an interesting point here. It would of course be nice if the ranks in both (or more) indexes could influence the final rank. As we speak we are working on possible solution.

    There are more ways of having index B influence the ranking of A. You could have something like Lucene’s term coordination that basically just ranks higher if more clauses of a boolean query match.

    There is also the possibility to count the number of items on the n-side of the 1:n relation. And, if you are doing so, you could also apply a more complex rule to combine the weights.

    The solution we are currently building is one that assumes a 1:1 relation and simply weighs the ranks from both indexes. This simple approach gives us the freedom to work on the hard parts of making it both CPU and memory-efficient.

    Later this year, we wil work on counting relations as well.

    • Josef Beantwoorden

      Thanks Erik for your response.
      I see, this is a hard issue to find a good solution for.

      I have just started to investigate possible solutions. Combining the weights from two (or more) indexes is however not a simple problem, especially when resource consumption and speed is an issue. We always have 1:N relations and would need to come up with a generic solution that works across severale indexes.
      Need to dive deeper into Lucene, hoping to find answers there.

      Thanks again for your thoughts.

      Josef

  6. Josef Beantwoorden

    Thanks a lot for the very interesting and insightful article and thanks for sharing your code.
    I came across the need to use joins (modelled after relational database joins) on different Lucene indexes as well.
    One thing that I did not find a solution for is how to preserve the ranking of search results across different indexes. Let’s say I am searching on index A and index B, which I join using the methods described above. My search criteria is applied to fields on index A and index B.
    However, I would only get ranking on the last search index (in the case above on index B), but loose ranking results on index A.

    Did you find or think about a solution to this problem, i.e.combine the ranking of results from index A and index B?

    Thanks a lot and best regards,
    Josef

Geef een reactie

Deze site gebruikt Akismet om spam te verminderen. Bekijk hoe je reactie-gegevens worden verwerkt.