The web is moving towards linked data. Many data collections are available as Linked Data, including Dutch scientific libraries, museums and archives. What can we do with all this data? What tools do we need? The good news is that Linked Data can be adopted incrementally.
What problem does Linked Data address?
Objects in libraries, museums and archives are increasingly described by experts not related to these institutions. The resulting descriptions relate to persons, places and other concepts, of which none of the experts or institutions can claim authority. Institutions are no longer the only authority on specific data collections and certainly not authoritative on all the concepts their collection relates to. Maintaining collections of authoritative information is becoming increasingly difficult. Life cycle management of metadata records [possibly even maintaining different versions for different communities] become major challenges. Failing to maintain a clear authoritative, and not isolated, collection undermines the existence of museums, archives and libraries and any other (broker) service that is to add value somehow.
How does Linked Data solve the problem?
Linked data allows everyone to make statements about everything [RDF Concepts]. It does not encapsulate knowledge about objects in a record, but represents the knowledge as a set of statements about the object. The record becomes a set of statements. This is simple, but fundamental. Consider the following record about a certain piece of art:
Record 5832: identifier = http://institute.org/987639 title = "A true work of Art" creator = "V. van Gogh"
This could be represented with (at least) two statements:
http://institute.org/5832 has a title whose value is “A true work of Art”
http://institute.org/5832 has a creator whose value is “V. van Gogh”
The fundamental change here is that record 5832 no longer plays a role when exchanging data. The record has been artificially created to describe an object, but the record itself is not important, only the statements it introduces are. (Linked Data can transparently introduce intermediate objects to group statements, however these are not manageable items ‘to worry about’ as records were). Only these sets of statements are exchanged. Maintaining an authoritative collection comes down to carefully selecting sets of statements to join.
Subject and Predicate are always URIs, while Object can be an URI or a value. In the example above the Creator statement could have been:
http://institute.org/5832 has a creator whose URI is info:eu-repo/dai/nl/071792279
For the actual name of this person, we will have to look for statements saying something about info:eu-repo/dai/nl/071792279 as the subject. This again might resolve to a URI so we have to repeat the process until we find a value.
How can institutions take advantage of Linked Data?
If the world around an institution is a cloud of Linked Data sources, the center of the cloud is where the institution has most of its authority. Surrounding this authority center are the related data sources on which the institution has less authority. Together we call this the authority cloud.
With this as a reference, do the following little steps:
- Start seeing data collections as statements, both in your Authority Cloud and outside it. Don not worry when they are not in RDF, that is not required.
- Start with using global persistent identifiers for all your objects. This allows you and others to make statements about the objects and to have meaningful joins.
- Start gathering triples from the sources within your Authority Cloud in a Triple Store. When sources are not in RDF just use simple tools to extract triples.
- Populate your local services using the Triple Store to resolve others statements. For example, while indexing your own metadata, use the triple store to create additional search fields, facets, tag clouds etc.
- While displaying objects, turn unresolved statements into click-able links.
- For advanced users: start making use of the Triple Store’s query capabilities for enhancing your services.
What tools are needed to deal with Linked Data?
Keep your tools! Unless you are dissatisfied of course, retain your investment. You will need a scalable triple store in your own data center however. Since this Triple Store contains all the statements you decided need resolving before offering your service, it must be fast and readily available.
In the next installment of this blog, we will outline how MERESCO can be used to implement Linked Data.
4 gedachten over “What to do with Linked Data?”
What @jvanvuuren points out, could that be added as step 7 as:
7. Start using fine grained provenance information?
I wasn’t clear about the fact that I assume equal authority within your Authority Cloud. For example, when your records contain author URIs and you load author information from an institution you trust, you can just merge the triples. That is what happens today (using other technologies), and you can continue doing so. After some time you will definitely want to distinguish who said what, but by that time, you already have your (RDF) tools in place making making life easier.
I got a reply by e-mail from Frits van Latum suggesting this type of graphs for dealing with tags being added to objects. I am posting it here assuming Frits agrees:
As stated a record describes a set of statements about everything (e.g a set of triplets). But apart from the content of a record (e.g. triplets about objects) a record is also implicitly used to define the “source” of the data. If you do not bring this information in the equation you will lose relevant data. This is especially prominent when the triplets are subjective, so in the querying of the data you would like to include characteristics of the “source” of these data (like experience, role, organization)
This means that if the source of the record is not defined in the record itself, not only the triplets in a record should be stored, but also the triplets about the record needs to be stored, before you abandon the principle of a “record”.
In theory a “record” is just any object as another. In triplet terms you can define “record 5832” contains “triplet 123”. You can also define “record 5832” is supplied by “organization ABC”. This implies that “triplet 123” is supplied by “organization ABC”. Only if the later is stored the object of “record” is not relevant anymore.
Thanks for your remark. I assume that with ‘resolvable’ you mean HTTP+DNS resolvable? If so, I agree that this is important when displaying unresolved statements in a web-browser. This is what is suggested in step 5.
If KNAW would have a resolver, say http://info.knaw.nl/, then what would the complete URI look like? Something like http://info.knaw.nl/eu-repo/dai/nl/071792279?
When ‘resolvable’ is limited to resolvable within your own Authority Cloud (up to step 4), which is suggested to be in your triple store, then it would different. Then it may be enough to have harvested DAIs from KNAW/NARCIS and those DAIs will match whether they are HTTP+DNS resolvable or not.
This of course depends on the availability of DAI for harvesting. I don’t know if there is such a service from KNAW, do you?
As a last remark: there are mechanisms in RDF/XML to make abbreviations, such as namespaces and entities. Defining an abbreviation for ‘info:’ would make the URI resolvabe. This is not intended in the example however. I wanted it to stay as short as possible.
An essential feature of Linked Data is that the URIs are resolvable. This allows for following the links from one URI to other URIs (to go form one node to other nodes). The URI info:eu-repo/dai/nl/071792279 in your example fails in this respect. info URIs are by definition not resolvable which make them unsuitable for Linked Data. When you use info URIs and you want your data to be Linked Data, you must combine them with resolvable URIs and relate the URIs with owl:sameAs.