While constructing the semantic architecture of the IRL Linked Data platform, we make a clear distinction between (i) the curation of the encoded data contained in the records as well as the long-term preservation thereof, managed by a digital archivist, and (ii) the analysis and interpretation of that data to answer specific questions for historians and researchers. To support the two processes and to maintain the clear separation of concerns, two distinct but interrelated knowledge bases are developed [1].
A first knowledge base is set up to contain the in RDF encoded records using a “flat” ontology that captures the information contained in those records in a lexical manner. While encoding, any noise such as errors or missing values will be respected to preserve the original historical record and provenance. This first knowledge base is then used as input to populate a second knowledge base with more expressive ontologies and with specific assumptions and interpretations of values prior to the data analysis.
In this post, we present the ontology for representing vital records and elaborate on its construction and validation process, which we will from now on call the records ontology. This records ontology had to be created as, to the best of our knowledge, available ontologies for vital records were non-existent.
The latest version of the records ontology can be found by following http://www.purl.org/net/irish-record-linkage/records
Construction of the records ontology
Births, deaths and marriages were captured per district (within a union, within a county) as single records on register pages. These pages can contain up to 10 records after which such a page is signed off by the registrar and sent to the superintendent registrar for inspection and validation. To create a first version of the records ontology, we merely “lifted” the information one could see on one such register page to an ontology.
To minimize interpretation, we choose to develop a “flat” ontology, which means that most information that can be found on such a register page was captured as literals. For example, instead of creating a concept Person that can have a fore– and surname, we choose to relate the concept of a Record to these attributes.
For the records ontology, we have defined the following concepts:
- A RegisterPage, which contains records.
- Record to represent the different types of records. Each record must belong to a register page and each register page can have zero (blank pages) or more records.
- Then we make a distinction between a Certificate and a MarriageRecord, both of them being subclasses of the concept Record. The first has as a subject only one person and the latter two. The two concepts are disjoints, which makes that no instance of a certificate can be an instance of a marriage record and vice verse.
- Finally, we created two disjoint subclasses of the concept Record to represent birth- and death records; BirthRecord and DeathRecord.
The only object properties – a relation between two concepts – we needed were to relate records to register pages. All other properties are data type properties. Data type properties are related to the greatest common denominator. For instance, all records are signed off by a registrar on a certain date. The date of registration as well as information on the registrar are therefore related to the concept of Record so that all subtypes of this class inherit this property.
<owl:DatatypeProperty rdf:about="&records;dateOfRegistration"> <rdfs:label rdf:datatype="&xsd;string">date of registration</rdfs:label> <rdfs:comment rdf:datatype="&xsd;string">The registration date of a record.</rdfs:comment> <rdfs:domain rdf:resource="&records;Record"/> <rdfs:range rdf:resource="&rdfs;Literal"/> </owl:DatatypeProperty>
One of the challenges is to capture the domain as well as possible, yet maintain a valid OWL 2 [2] ontology. As explained by Motik and Horrocks in [3], it is difficult to reason about date and time intervals, and therefore only specific points in time (captured by both xsd:dateTime and xsd:dateTimeStamp) were “amenable for implementation” and those “can be handled by techniques similar to the ones for numbers.” Together with the digital archivist, we choose not to capture dates mentioned in records as instances of xsd:dateTime as we do not know the exact times and we felt uncomfortable to encode “default” times. We thus chose to declare the range of these properties as being rdfs:Literal, but provided encoding guidelines in which the use of xsd:date was to be highly encouraged.
Assessing the ontology
The records ontology was evaluated for any problems using the OOPS! Ontology Pitfall Scanner [4] (http://oeg-lia3.dia.fi.upm.es/oops/catalogue.jsp). OOPS! allows one to quickly scan an ontology for common or potential problems based on experience of many ontology projects in an automated way.
Minor problems surfaced such as the lack of documentation (comments and labels) and ontology annotations that were quickly rectified. Using OOPS!, and interesting question surfaced. Initially, we did not provide an inverse relation for the predicate hasRecord, which has as domain RegisterPage and range Record. OOPS! suggested us to declare an inverse relation, which is useful for browsing through the data by means of, for instance, faceted browsing.
After declaring the inverse property and re-evaluating the ontology, however, the framework suggested us to also explicitly declare the domain and range of the inverse property. Ontologies are supposed to be minimally redundant and the domain and range of the inverse property can be inferred from the relation using a reasoner. This redundancy can lead to errors if one changes the domain and range of one relation, but not the other. Though one can debate whether this really poses a problem – it is true that the redundancy does help a human understanding the inverse relation – we decided to declare those “missing” domains and ranges.
Ongoing work
The digital archivist is currently encoding the different records in a relational database using adequate input mechanisms. We adopted R2RML [5] to create RDF triples from the relational database via a mapping language. Those generated triples are loaded in a triplestore to constitute our first knowledge base. The construction of the second knowledge base and the construction of the second, more expressive ontology will be the subject of a second post in the near future.
References
[1] C. Debruyne, O. Beyan, S. Decker and S. Collins. Using Semantic Technologies to Create Virtual Families from Historical Vital Records, 1st EUON Workshop, 2014.
[2] W3C. OWL 2 Web Ontology Language Document Overview (Second Edition), 2012. Via http://www.w3.org/TR/owl2-overview/ (last accessed December the 2nd, 2014).
[3] B. Motik and I/ Horrocks. OWL Datatypes: Design and implementation. Springer Berlin Heidelberg, 2008.
[4] M. Poveda-Villalón, M. C. Suárez-Figueroa and A. Gómez-Pérez. “Validating ontologies with oops!.” Knowledge Engineering and Knowledge Management. Springer Berlin Heidelberg, 2012. 267-281.
[5] W3C. R2RML: RDB to RDF Mapping Language, 2012. Via http://www.w3.org/TR/r2rml/, last accessed December 2, 2014.