Solr focusses to get the most out of one index type: Lucene. Meresco supports a number of different index types, each specialized for a specific task. Queries are split, each part processed by the most appropriate index, and the results are integrated. This ensures that all types of queries are processed within tens or at most hundreds of milliseconds.
Each type of index has distinct and unique properties such as specific query algorithms, optimized access patterns and scalability. We will introduce each index type below together with a short characterization.
Fulltext Index
This index is optimized for queries for a combination of words, literal phrases, words nearby other words etc. It is implemented with Lucene which is known to scale very well. Meresco helps scaling it by keeping it small, this post about Storage versus Index.
Facet Index
The facet index specializes in drilldown (faceting) queries, dynamic clustering and tag clouds. It produces exact results even on large data sets, which is one of Merescos unique selling points. Meresco uses custom data structures and algorithms which scale to billions of postings on a single node.
Dictionary Index
This index supports fast lookup of arbitrary textual information related to keys. It supports simple lookup ‘queries’ only. It is implemented using Berkeley DB, which is known for its good scalability and performance. It is being used to scale up set and metadataPrefix queries in OAI-PMH to tens of millions of records. This post describes the process: Dependable OAI Repositories.
Sorted Dictionary Index
This index supports extremely fast lookup of simple numeric information attached to alphabetically ordered terms. It supports prefix queries such as needed for auto-complete. It is implemented using a Burst Trie.
Triple Store
This index supports queries about arbitrary relationships between objects (graph-inference) typically through SPARQL or extensions to CQL. It is implemented with rdflib and OWLIM, the former being simple, the latter being one of the most scalable and fast triple stores around. An application is relating traditional records to social metadata such as tagging, ratings and reviews. A lot is going to happen around here.
Range Index
This index supports ultra fast retrieval of data contained in numerical ranges. It supports range queries such as 20090101 < date <= 20101231. Meresco has its own optimized implementation. This index is so small, it scales to billions of documents even on a single node.
N-gram
The n-gram index is capable of performing approximate matches and hence used for suggestions in ‘Did you mean?’-like solutions. More generally it allows for language neutral queries. This index lays on top of the Lucene index, but is nominated to be replaced by a faster and more specialized one in 2010
Meresco can maintain these index types in sync both during batch and real-time updates. Together, these indexes deliver fast results to queries, even if those queries are complicated and demanding such as tag clouds, auto-complete, clustering, term suggestions, did-you-mean and relationship queries.
Pingback: How to scale up Meresco « MERESCO
Pingback: Webhamer Weblog: Search & ICT-related blogging » links for 2010-03-02