In a lot of search engines the data and indices are stored together, creating a single huge entity. This approach potentionally leads to a number of problems, ranging from backup problems to performance issues. Also, with these systems access to the data is limited by what is offered through their respective APIs.
Meresco works differently, using an index for what it is designed to do best: to return the identifiers of documents that match a query. The index of a book gives you the number of the page that covers a certain topic.
Similarly, the identifiers returned by a Meresco index point to documents in a separate storage. This leads to a simple index that even for millions of documents typically stays small enough to fit in memory entirely. This yields an obvious speed advantage.
The data is stored in a Meresco Storage. The storage is basically a well defined directory structure. Identifiers are used to pinpoint a directory in which the data is stored. This means that it can be stored on basically any filesystem (although e.g. the ext2/3 filesystems impose a limit on the number of subdirectories in a directory).
Having all data in native format on disk makes it easier to control and maintain. Data can be read immediately without having to be decoded or transformed in any other way. Data enrichment tools, for example to get metadata from PDF files or digital images, can do their work in the background directly on the data files.
Many systems come with their own caching mechanisms. Meresco Storage however takes advantage of the disk caching capabilities of modern unix systems. This results in fast data lookups with no added complexity.
By keeping only identifiers, a Meresco index stays simple, small and fast. The accompanying storage offers fast retrieval of stored documents in their native formats.
1 gedachte over “Storage versus Index”
Pingback: How to scale up Meresco « MERESCO