Index Maintenance: Cheaper and Better
Last year Seecr has been on a quest to make maintenance of large integrated search engines easier, cheaper and at the same time better.
We achieved that goal by reorganizing the indexing process and applying a technique called Late Integration. We created an exciting extension for the Solr/Lucene search engine to make that possible.
In a series of blogs, we will present the problem, the solution and the code. This completes our previous series of blogs about the Open Index: Late Integration, Query Resolving and how to find relevant indexes.
Relations and Denormalization
Traditionally (that is, during the last decade), integrated search meant collecting data from various sources and integrating it in an central search engine. This approach works well but runs into problems when the data sets become larger and more divers. And what finally kills this approach is the presence of 1:n relations between data sets.
Traditional databases and triple stores deal with relations explicitly and provide query languages (SQL and SPARQL) that incorporate the notion of relations. With Lucene, the predominant approach is to de-normalize the data in order to create one big index. This denormalization is what becomes the bottleneck when the index grows.
Denormalizing: The Problem
Suppose you have a corporate catalogue of product descriptions and each local branch maintains a database, linking to this central catalogue, adding information about the local stock, price, promotions, reservation, et cetera. Providing Solr/Lucene-based search functionality spanning both the central catalogue as well as the local database would require both data sources to be de-normalized. Doing so runs into a few problems:
- The 1-N nature of the relationship yields an explosion of data. That’s the nature of denormalization. The most prominent example I know of is the travel agency that maintained one index with over 109 trips, resulting from heavy denormalizing destinations, hotels, dates, lengths and features like dog admittance and smoking rooms.
- Both sources have their own update cycle (life cycle). Some data might be constantly updated, for example room availability, while other data is rather stable, such as room descriptions. Feeding both into the same update cycle often limits the update rate of more volatile data.
- Both sources require their own specific expertise. The extraction of the data out of its original environment poses a problem: away from the experts that know it by heart and into the centrally managed domain of the generalists.
It would be so much nicer to have separate indexes: one for the corporate catalogue and one for the local stock. Each update in local stock would require just one simple update in the stock index, just as one item update in the catalogue requires only one update in the catalogue index. So, an catalogue item update need not to go through the denormalize process, avoiding the forced update of all local stock related to this item.
Having separate indexes makes it almost trivial to differentiate the calculation of fields (tokenization, ranking, normalization etc) per index. Also, the update rate can vary easily between indexes, simply because they are independent.
If we only could join the indexes during query time…
Relations in Lucene: Join
If we could bring the notion of relations to Lucene, and we are able to efficiently query along relations, we could avoid denormalization, making the indexing process easier while at the same time creating opportunities to have more specialized indexing when needed.
What is Join?
The idea of joins stems from relational databases (RDBMS) and the structured query language (SQL). The main idea is to normalize the data into separate tables, and to re-combine these tables during queries. Queries express relations between those tables and data is recombined based on the relations. This is called joining.
The same idea is present in RDF triple stores. The Sparql query language allows queries to span multiple relations between concepts. The actual join however is much more natural: it happens implicitly because of the notion of everything being a relation, even individual properties.
Coincidentally, last years, Solr/Lucene has seen the addition of functionality for joining. This allows one to indicate a specific field in one index, and one specific field in another index, and Lucene would then join documents from both indexes if and when they contain the same value in the indicated fields. But since it is not fast enough, we had to rethink and redesign it for us to work. We needed it to be as fast as any other query in Lucene. So we made it that way.
In Solr/Lucene a special utility helps an application programmer to create a query that under the hood collects terms in one index and filters on these terms in another. Using this utility (JoinUtil.createJoinQuery) we run a typical query on our indexes: one containing 12,000,000+ book titles and one containing 25,000,000+ individual books held by libraries in the Netherlands. The question is: for a certain selection, which books are present at a certain library or group of libraries?
This query took 8679 ms to execute.
Regardless of hardware, this is too much. Way too much, as each and every query will come with such a filter. We need it down to 8 ms. That is our target.
A 1000 times faster
The reasons Lucene’s Join is not fast enough lays in the generic nature of it’s approach. It works on terms: strings of random length. It builds sets of terms and makes intersections or unions of them. However, for computers, dealing with strings is expensive. They take a lot of memory, and comparison is slow, especially if almost every string is an URI, with long common prefixes. Secondly, the intersections and unions are made by scanning the terms of an index, which is a slow process.
But Lucene is open source, is rather well though-about, and has healthy discussions between experts. So we started reading and we improved the speed of joining by doing the following:
- Translate terms (strings) into numbers. This will speed up the creation of sets as well as taking the intersection.
- Closely follow the Lucene segmenting architecture to exploit caching per segment on a lowest possible level.
We set out to do so and learned a lot about Lucene and joining algorithms!
We were very surprised by the results: joining has become so fast, it is now a non-issue for us. This means we can use it whenever we decide that maintaining a separate index is beneficial, without worrying about the performance loss of queries.
Exactly how we did it will be the subject of a following, more technical, blog post.
In the next post, we will outline our approach, with real working code, and show how join can be combined orthogonally with other Lucene functionality. At the end, we will contribute this code to the marvellous Lucene-community.