Index Maintenance: Cheaper and Better
Last year Seecr has been on a quest to make maintenance of large integrated search engines easier, cheaper and at the same time better.
We achieved that goal by reorganizing the indexing process and applying a technique called Late Integration. We created an exciting extension for the Solr/Lucene search engine to make that possible.
In a series of blogs, we will present the problem, the solution and the code. This completes our previous series of blogs about the Open Index: Late Integration, Query Resolving and how to find relevant indexes.
Relations and Denormalization
Traditionally (that is, during the last decade), integrated search meant collecting data from various sources and integrating it in an central search engine. This approach works well but runs into problems when the data sets become larger and more divers. And what finally kills this approach is the presence of 1:n relations between data sets.
Traditional databases and triple stores deal with relations explicitly and provide query languages (SQL and SPARQL) that incorporate the notion of relations. With Lucene, the predominant approach is to de-normalize the data in order to create one big index. This denormalization is what becomes the bottleneck when the index grows.
Denormalizing: The Problem
Suppose you have a corporate catalogue of product descriptions and each local branch maintains a database, linking to this central catalogue, adding information about the local stock, price, promotions, reservation, et cetera. Providing Solr/Lucene-based search functionality spanning both the central catalogue as well as the local database would require both data sources to be de-normalized. Doing so runs into a few problems:
- The 1-N nature of the relationship yields an explosion of data. That’s the nature of denormalization. The most prominent example I know of is the travel agency that maintained one index with over 109 trips, resulting from heavy denormalizing destinations, hotels, dates, lengths and features like dog admittance and smoking rooms.
- Both sources have their own update cycle (life cycle). Some data might be constantly updated, for example room availability, while other data is rather stable, such as room descriptions. Feeding both into the same update cycle often limits the update rate of more volatile data.
- Both sources require their own specific expertise. The extraction of the data out of its original environment poses a problem: away from the experts that know it by heart and into the centrally managed domain of the generalists.
It would be so much nicer to have separate indexes: one for the corporate catalogue and one for the local stock. Each update in local stock would require just one simple update in the stock index, just as one item update in the catalogue requires only one update in the catalogue index. So, an catalogue item update need not to go through the denormalize process, avoiding the forced update of all local stock related to this item.
Having separate indexes makes it almost trivial to differentiate the calculation of fields (tokenization, ranking, normalization etc) per index. Also, the update rate can vary easily between indexes, simply because they are independent.
If we only could join the indexes during query time…
Relations in Lucene: Join
If we could bring the notion of relations to Lucene, and we are able to efficiently query along relations, we could avoid denormalization, making the indexing process easier while at the same time creating opportunities to have more specialized indexing when needed.
What is Join?
The idea of joins stems from relational databases (RDBMS) and the structured query language (SQL). The main idea is to normalize the data into separate tables, and to re-combine these tables during queries. Queries express relations between those tables and data is recombined based on the relations. This is called joining.
The same idea is present in RDF triple stores. The Sparql query language allows queries to span multiple relations between concepts. The actual join however is much more natural: it happens implicitly because of the notion of everything being a relation, even individual properties.
Lucene Join
Coincidentally, last years, Solr/Lucene has seen the addition of functionality for joining. This allows one to indicate a specific field in one index, and one specific field in another index, and Lucene would then join documents from both indexes if and when they contain the same value in the indicated fields. But since it is not fast enough, we had to rethink and redesign it for us to work. We needed it to be as fast as any other query in Lucene. So we made it that way.
Speed?
In Solr/Lucene a special utility helps an application programmer to create a query that under the hood collects terms in one index and filters on these terms in another. Using this utility (JoinUtil.createJoinQuery) we run a typical query on our indexes: one containing 12,000,000+ book titles and one containing 25,000,000+ individual books held by libraries in the Netherlands. The question is: for a certain selection, which books are present at a certain library or group of libraries?
This query took 8679 ms to execute.
Regardless of hardware, this is too much. Way too much, as each and every query will come with such a filter. We need it down to 8 ms. That is our target.
A 1000 times faster
The reasons Lucene’s Join is not fast enough lays in the generic nature of it’s approach. It works on terms: strings of random length. It builds sets of terms and makes intersections or unions of them. However, for computers, dealing with strings is expensive. They take a lot of memory, and comparison is slow, especially if almost every string is an URI, with long common prefixes. Secondly, the intersections and unions are made by scanning the terms of an index, which is a slow process.
But Lucene is open source, is rather well though-about, and has healthy discussions between experts. So we started reading and we improved the speed of joining by doing the following:
- Translate terms (strings) into numbers. This will speed up the creation of sets as well as taking the intersection.
- Closely follow the Lucene segmenting architecture to exploit caching per segment on a lowest possible level.
We set out to do so and learned a lot about Lucene and joining algorithms!
Results
We were very surprised by the results: joining has become so fast, it is now a non-issue for us. This means we can use it whenever we decide that maintaining a separate index is beneficial, without worrying about the performance loss of queries.
Exactly how we did it will be the subject of a following, more technical, blog post.
In the next post, we will outline our approach, with real working code, and show how join can be combined orthogonally with other Lucene functionality. At the end, we will contribute this code to the marvellous Lucene-community.
Pingback: The NBC+ Search Platform | Seecr
Hi,
This is very interresting. Just a few questions : what is your roadmap for this improvement? Do you plan to share the code as a Solr plugin? To submit it to the Apache community to make it part of the trunk?
Aurélien,
Thanks for your interest. The roadmap is to continuously improve it. Much of that is already done and running in productions systems. However, my write-up lags behind.
The ranking based on another index (see my reply March 31st) is already finalized and running fine. It allows one to give to separate queries, one for searching, yielding a topic rank, and the other only for ranking, for example on recency, popularity, type, collection, ownership etc. The ranking query may be on a separate (joined) index. This allows us to maintain a light-weight static rank index separate from the main index.
At this moment we are writing a new Collector interface (and implementation) that allows any complicated query to be carried out by a thread-pool. As you might know, Solr/Lucene only parallelizes simple queries, not those using collectors. Code, see here: https://github.com/seecr/meresco-lucene/tree/supercollector/src/org/meresco/lucene.
I have offered my code at the Apache Solr/Lucene group. See https://issues.apache.org/jira/browse/LUCENE-5291. However, I did not hear from them for a long time. If you wish you can take it from here: https://github.com/seecr/meresco-lucene/tree/master/src/org/meresco/lucene. It is well-maintained and kept up-to-date.
We use Lucene’s low level API’s and dropped Solr altogether since it does not really add value to us while it is serially in the way if you want to hack Lucene.
I’ll try to find time for more write-ups.
Erik
Erick,
Thank you for your answer and for sharing your work. I hope that the Lucene/Solr community will consider your plugin : it could be a good improvement for Lucene/Solr. I have some new questions for you : will you keep the GPL license or will you release some part of code under apache license? I have another question about query parsing. If we want to let the user query the data as if it was flattened data : let’s consider the following query : “book_title:LuceneInAction AND library_city:Utrecht” need to be transform to something with a join. Did you also consider this problematic?
Thanks,
Aurelien
Aurélien, if and when our code qualifies for inclusion in Solr/Lucene, I will change the license for that part. Regarding query parsing: that actually is the hardest part. No kidding. The query you suggest only implicitly joins two entities. The design and implementation of a suitable query language is currently one of our metiers. –Erik
Thanks for this post, And me too waiting for your next post 🙂 Cheers!!
Excellent post, Eric. Looking forward to the next part …
I have one problem with these query-time-joins though: The result documents do not have the fields from both indices combined but are retrieved from one index only. In your example scenario one would like to get the price from the local stock index with the description from the central catalog with one (join) query. Seems like this is not possible with Lucene’s joinQuery.
Yes, it would be convenient if, in the example, you would get the information contained in both indexes. In our application scenario we carry over some details from one index to the other. This includes counts, weights and facets but not stored fields, although this is possible.
We generally do not store data in Lucene indexes. We only use Lucene for querying and retrieve the actual data from other systems. So once we get the Lucene results, we grab the data from different places and compose the complente result which is sent to the client.
It is almost trivial to collect all information from all indexes involved. However, the processing of queries spanning multiple indexes is not trivial, and it is this piece of you’ll probably need to modify.
The way Lucene’s default Join implementation performs the Join makes accessing the results very hard if at all possible. The reason for that is that it collects all terms in one index and scans them in the other. This happens while performing the query. The Join implementation I described here is a sequence of separately executing queries on different indexes, and then join them the way you want. This leaves many possibilities for getting and combining results.
Very important and one of the best post Erik. I have a one question, Can we do the Lucene Query Join in distributed environment, and can you also explore the above post.
Yes, you can do distributed join in an efficient way provided that you have only 1:n or 1:1 relations and you have a suitable distributions key. In fact, you’ll run into the trouble every distributed database runs into, and you can solve these problems in basically the same way. This year we will most likely apply a quorum-based distribution policy in one of our own systems. With some luck, I’ll find the time to write something about it.
I am very interested in reading the next post.
Where can I find it?
Miriam, I have the draft ready. It will soon be published. Erik