Recently, Bibliotheek.nl and the Dutch Royal Library have started cooperation to develop the Open Index. Gerard Kuys and I developed the idea.
What does it do?
The Open Index makes maintaining search indexes easier while delivering more room for specialized metadata formats. It makes no choices regarding funtionality or technology unless it is essential for the concept to work properly. It is Open in a sense that:
- Any metadata format can play.
- Any technology can be used.
No more monolithic indexes
The main idea is to stop creating one big index for all metadata sets because of these disadvantages:
- All metadata formats are unified to a common denominator. The richer metadata is ignored.
- All queries are unified to what one specific type of index supports. Not all is full-text.
- Update processing becomes a bottleneck, especially when trying to avoid 1 and work around 2.
This causes organisational and technical costs to rise while failing to deliver on specialization.
Specialization
The Open Index does not integrate indexes but it integrates search results. Integrate; not federate! This allows maintainers of specialized sets to make their own choices regarding metadata, technology and update processing when creating their own, independent indexes. These indexes then join in a bigger Open Index by providing unified identifiers and a standard search protocol.
Unified Resource Identifiers
Each index must use URI’s to identify what is in the index. This is just a good practise being applied widely and increasingly, but it is essential.
Standard Search Protocol
Each index should implement a standard protocol for searching. The query language is a variable (see 2), but standardizing on one or two does not hurt. What is essential is the two types of results that must be supported:
- A top list of the complete records for the best ranked results.
- A complete list of only the URI’s of all the results.
Peer to peer
Indexes are arranged in a peer to peer fashion. Any index may deal with user queries and will then be called the leading index for the duration of one particular query. User queries are handled by returning results of type A. The leading index uses type B results from other indexes to fullfill the request. The algorithms are packaged in reusable components and deserve a separate blog post.
What’s next?
To explain the concept clearly, I will write some more blogs about:
- Query resolving: about how the actual integration works.
- Finding and selecting indexes: how indexes find other indexes to work with.
- Efficiency optimizations: what is needed to make it work with large indexes and big query loads.
Or call or e-mail me if you don’t want to wait! I am happy to discuss this subject.