The KNOT Data Model - Structure
As explained in the overview to KNOT-DM, the model is focused on the intersection of three domains that reflect specific aspects of the project’s central theme:
- Academia — representing the universities and their agents (academics, researchers, departments, laboratories, students) as well as the activity of research and the digital scholarly objects created by it.
- Public data — representing how digital scholarly objects are publicly available in order to facilitate and foster further research and use.
- (Digital) Cultural heritage — representing both the subjects of academic research (people, works, physical objects) and its outputs (datasets, software, digital-born or digitized objects) as examples of (digital) cultural heritage.
This intersection provides the structural outline of the KNOT-DM, which is organized in three segments that reflect the specificities of each domain and make use of concepts from an internationally recognized standard to describe entities, activities, agents, themes, relationships, and spatio-temporal information. The segments and their relative standards are:
- Public Data Segment — modeled with the Data Catalog Vocabulary (DCAT), a W3C RDF vocabulary “designed to facilitate interoperability between data catalogs” published online [1]. It is used to represent digital scholarly objects as a type of public data using conceptual entities such as datasets or data services. KNOT-DM combines ontological entities and their usage from the DCAT W3C specification as well as the European and Italian Application Profiles (DCAT AP and DCAT AP IT).
- Academic Provenance Segment — modeled with the PROV Ontology (PROV-O), a W3C OWL2 Web Ontology Language designed to “represent provenance information generated in different systems and under different contexts” [2]. It is used to represent the specificities of academic research activity and production and was chosen because it is already integrated into DCAT.
- Cultural Heritage Information Segment — modeled with the CIDOC Conceptual Reference Model (CIDOC CRM), a “living standard” that offers “a theoretical and practical tool for information integration in the field of cultural heritage” [3]. It is used to represent specific activities within a research project as well as the cultural heritage dimension of digital scholarly objects. KNOT-DM primarily makes use of the CRMdig and LRMoo extensions of CIDOC CRM. The former models “the steps and methods of production ("provenance") of digitization products and synthetic digital representations” [4] and as such offers various concepts that align with standard academic research project practices, while the latter is “intended to capture and represent the underlying semantics of bibliographic information and to facilitate the integration, mediation, and interchange of bibliographic and museum information” [5] and is primarily used to enable relations with the WEMI (Work, Expression, Manifestation, Item) concept of the Functional Requirements for Bibliographic Records model (FRBR) [6].
Figures 1 and 2 summarise the core of KNOT-DM by highlighting key classes (yellow boxes) and properties (arrows) from each standard and their interconnection as well as relationship to examples of class instances (pink dots) central to the project. Further details on the full extent of each segment and notation used in the project is provided in the next section.
The use of internationally recognised and complimentary standards stems from an early design decision that KNOT-DM should be focused on reusability rather than the creation of new ontological components in order to allow flexibility and ensure its eventual compatibility with the Digital Library infrastructure. This is exemplified by the use of DCAT which is the standard used by the Italian Department for Digital Transformation and is one of the recommended ontologies from the national OntoPia network of ontologies and controlled vocabularies promoted by the Ministry of Culture and ICDP. Furthermore, this flexibility should also enable the KNOT project to integrate existing information and data about academic research in order to minimise duplication and foster interconnectivity. One early area of interest in this regard is in exploring how KNOT can integrate with the IRIS research management system offered to Italian universities by CINECA.
All of the KNOT-DM terms and concepts are collected and expressed in RDF via the KNOT Ontology (KNOT-O) while some of its thematic and indexing requirements are expressed in RDF via the KNOT Taxonomy and KNOT Technology Thesaurus, two controlled vocabularies based on the SKOS data model. The use of existing standards based on, or adapted to, RDF enables the KNOT project to publish its metadata as Linked Open Data (LOD) to ensure machine readability and maximise its interoperability with other existing data available online.
The SAMOD methodology was used as a reference point in the design of the KNOT-DM to create an iterative workflow: once the central theme was chosen, the domain was designed by researching and evaluating existing standards against a first set of requirements drawn in part from the information acquired during the census of existing academic research projects; each standard was then assembled into a segment and evaluated against a first set of competency questions (who, what, where, when, how) by modeling a small subset of data from the census (12 entries); following this further refinements were made to which elements of the standards the model should use and how they should connect and be used; insights from this step also led to the creation of the controlled vocabularies; lastly the first version of the ontology and documentation was created drawing from the insights acquired in each previous step.
Following publication of the first version of the data model in the summer of 2023, further evaluation will continue with particular attention to the model’s ability to answer the evolving requirements of the project’s integration into the national infrastructure.
All elements of the KNOT-DM, including KNOT-O and the KNOT Taxonomy and KNOT Technological Thesaurus controlled vocabularies, are published in English and Italian in LOD-friendly formats and available from GitHub under a Creative Commons BY-NC 4.0 license.
The KNOT Data Model - Segments
The tabs below give more details on the full extent of each segment, with a focus on the key classes within each and how they should be used. See the Modules section for detailed practical examples and the Ontology for full details of all properties and classes.
The following legend applies to all graphical representations and textual highlights based on the "Graphical Framework for OWL Ontologies" [7]: a yellow box indicates a class
, a blue arrow indicates an object property
, a green arrow indicates a data property
, a black arrow indicates a predicate
between two entities, and a pink dot indicates an instance of a class
(also referred to as an individual).
The Research Project as Public Data
Figures 3 to 5 give a complete overview of the public data segment of the KNOT-DM, which is the largest in terms of classes and properties available for use. Each figure details the properties and classes connected to one of four central DCAT concepts: Catalog, Dataset, Data Service, and Distribution.
Figure 3 details the first part of the segment, centered on dcat:Catalog
. This class is defined within KNOT-DM as a “catalogue or repository that hosts the datasets or data services being described” [8] based on the DCAT Application Profile. It also takes into consideration the W3C DCAT definition, “a curated collection of metadata about resources (e.g., datasets and data services in the context of a data catalog),” [1] and a usage note that web-based data catalogs represent a single instance of this class.
These definitions are the starting point for the conceptualization of the class within KNOT-DM as representing the academic research projects as containers, rather than activities, for outputs. The most common and practical example of this conceptualization is a website where the research project activity and outputs are documented (representing the curated collection of metadata about resources) and made available to the public, whether via download or through interaction with specific user interfaces (representing the hosting of datasets or data services).
Furthermore, the W3C DCAT specification explicitly makes dcat:Catalog
a subclass of dcat:Dataset
which adds another layer to this conceptualization by allowing KNOT-DM to treat the academic research project as container as another form of output from the research project as activity, meaning that websites can be described as another type of digital object produced by research.
Figure 4 details the second part of the segment, centered on dcat:Dataset
. It is defined within KNOT-DM as “a collection of data published or curated by a single agent, and available for access or download in one or more representations” based on the W3C specification and an additional note that states “data comes in many forms including numbers, words, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.” [1]
Within KNOT-DM this class is one of the representations of digital scholarly objects, whatever their form. This last point is key because while the DCAT standard is primarily used in the EU and Italy to classify datasets from public entities, such as government agencies, the KNOT-DM uses it to cover all the forms digital scholarly objects can take: a piece of software, a textual corpus, a relational database, a collection of annotated images, an interactive map, or a visualization service.
This class also offers a direct connection to the Academic Provenance segment of the KNOT-DM via prov:wasGeneratedBy
, which DCAT includes in order to allow users to further “describe the activity that generated a dataset.” [1] DCAT specification notes that in order to use this property the context resource should also be classified as a prov:Entity
, which is the approach KNOT-DM takes. This approach to multiple classifications of individuals is detailed in the section about Conceptually Similar Classes.
Figure 5 details the third and final part of the Public Data Segment, centered on dcat:Distribution
and dcat:DataService
. These two classes are closely connected by virtue of giving access to the data included in the catalogue, however they have some key differences.
A distribution is defined within KNOT-DM as an “accessible form of a dataset such as a downloadable file” [1], based on the W3C specification. However, the term “accessible form” could be interpreted as covering non-downloadable forms and so KNOT-DM makes the limitation of distribution to downloadable forms explicit (this is also implied in the EU Application Profile usage notes and definitions). A data service is defined within KNOT-DM as “a collection of operations that provides access to one or more datasets or data processing functions,” based on the DCAT AP specification [8].
Furthermore, KNOT-DM makes explicit reference to the usage note included in DCAT AP 2.1.1 regarding distinctions between these two classes which adds the following conceptual dimensions to both [9]:
- A distribution cannot exist without its dataset while both a dataset and data service are conceptual entities of their own.
- A distribution is not required to be the result of the data service operations.
- Anything not intended to provide a downloadable form of a dataset is a data service.
- A data service offers smarter, more interactive ways to the data.
As such within KNOT-DM dcat:Distribution
represents the downloadable form of a dataset only, and it is possible for a dataset to exist without it such as for example when a relational database can be accessed via a search interface but not directly for download. dcat:DataService
represents all other web services that give a user access to the data contained within a dataset via any number of processing functions. To follow the previous example, the search interface for the relational database would constitute a data service in this case. Within the domain of the KNOT project, data services can represent anything from API endpoints to website search interfaces and analytics services. In many cases it is possible that the data service is part of the same website that is represented by dcat:Catalog
. Therefore a research project's website may fulfill the conceptual function of both a container for the project's outputs by collecting the relevant (meta)data in one place and (in part) a web service that gives access to the underlying data(sets) and data processing functions to interact with the data(sets) such as search functionalities (to use one of the most common examples). Lastly, by virtue of its ability to give users access to the datasets, a data service that was created as part of a research project may also be represented as another type of research output and therefore classified as a prov:Entity
(see Provenance section for more on the use of that class).
A Practical Example
Figure 6 details a simplified practical example of how the four DCAT classes come together in KNOT-DM: a research project takes place over the course of 2 years with the goal to create a dataset of images and make it available to users for interaction via a web interface and via download. The final outputs of the project are: a website, which contains all the outputs and documentation (represented by dcat:Catalog
, dcat:dataset
, and dcat:service
) as well as a section where users can interact with the dataset’s contents via various functionalities (represented by dcat:DataService
and dcat:servesDataset
), and the dataset itself (represented by dcat:Dataset
). The latter is also made available for direct download to encourage further use (represented by dcat:Distribution
and dcat:distribution
). The dataset, data service and website (three outputs) are connected to the activity of the research project, which is modeled using PROV-O as detailed in the next section (represented by prov:Activity
and prov:wasGeneratedBy
).
See the Modules section for further detailed practical examples of how this segment can be used to describe academic research projects and the Ontology for the definitions of all the different properties and classes.
The Research Project as Academic Activity
Figure 7 gives a complete overview of the academic provenance segment of the KNOT-DM, which is centered on the Activity and Entity classes from PROV-O.
prov:Activity
is defined as “something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.” [2] Within KNOT-DM this class represents academic research projects as activities that use and generate entities, such as digital scholarly objects. Therefore within KNOT-DM the aspets of an academic research project can be represented as two individuals, one classified as dcat:Catalog
to represent the activity as a container for outputs, and one classified as prov:Activity
to represent it as an activity in the traditional sense. The prov:wasGeneratedBy
property connects these two individuals making explicit that the activity generated all the outputs, including the container.
prov:Entity
is defined as “a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.” [2] Within KNOT-DM this class represents digital scholarly objects, as well as the things that go into a research activity from concepts to physical or digital objects. As mentioned in the Public Data section this means that digital scholarly objects are classified as individuals belonging to both DCAT and PROV-O in order to satisfy the connection between the Public Data and Academic Provenance Segments.
The relationship between these two PROV-O classes can be detailed in a variety of ways. The properties prov:used
, prov:wasInfluencedBy
, prov:generated
, and dct:subject
connect an activity to an entity (including abstract concepts) while prov:wasInformedBy
connects two activities together to indicate a sharing of entities and prov:wasDerivedFrom
connects two entities together to indicate a transformation. Lastly dct:isReferencedBy
can be used to connect an activity to existing publications about it. See the Modules section for details on their usage.
Lastly the range of the prov:wasInformedBy
property was extended to also include crmdig:D7_Digital_Machine_Event
in order to establish a clear connection between the Academic Provenance and Cultural Heritage Information Segments of KNOT-DM, similar to the connection created with the Public Data Segment by the prov:wasGeneratedBy
property. More details on this connection are given in the Cultural Heritage Information section.
A Practical Example continued
Figure 8 details the continuation of the simplified practical example we began in the previous section, now showing how the main classes in PROV-O are used: our research project generated three entities (represented by prov:Activity
, prov:generated
, and prov:Entity
): the website, the dataset, and the data service (each of which is also classified using DCAT as stated previously). The project, as articulated through its research question, wants to explore Italian medieval architecture, which is therefore a concept that influenced the research activity and is its subject (represented by prov:Entity
, prov:wasInfluencedBy
, and dct:subject
). The project also made use of an existing image collection to assemble its final dataset, making the latter a derivation (represented by prov:used
and prov:wasDerivedFrom
).
See the Modules section for further detailed practical examples of how this segment can be used to describe academic research projects and the Ontology for the definitions of all the different properties and classes.
The Research Project as Cultural Heritage
Figure 9 gives a complete overview of the Cultural Heritage Information Segment of the KNOT-DM, which is centered on key classes from the CIDOC CRM extensions CRMdig — D7 Digital Machine Event and D1 Digital Object —and LRMoo (FKA FRBRoo) — F5 Item and F3 Manifestation.
crmdig:D7_Digital_Machine_Event
is defined as comprising “events that happen on physical digital devices following a human activity that intentionally caused its immediate or delayed initiation and results in the creation of a new instance of D1 Digital Object on behalf of the human actor,” [4] while crmdig:D1_Digital_Object
is defined as comprising “identifiable immaterial items that can be represented as sets of bit sequences, such as data sets, e-texts, images, audio or video items, software, etc., and are documented as single units.” [4]
Within KNOT-DM instances of D7 and its subclasses (which cover more specific types of activities that can take place on a digital device) can be used to represent specific activities within the larger research project activity (a prov:Activity
) and are connected to it by prov:wasInformedBy
as described in the Academic Provenance section.
Instances of D1 and its subclasses (which cover more specific types of digital objects) meanwhile are a third way to represent digital scholarly objects as well as the objects that go into a project but specifically limited to the digital realm and within the context of cultural heritage, a dimension that is not explicit in either DCAT or PROV-O.
KNOT-DM makes use of two D7 subclasses: crmdig:D2_Digitization_Process
, which represents the specific activity of transforming a physical object into a digital one, and crmdig:D10_Software_Execution
, which can be used to represent more generic computational or software tasks run on a digital device, such as for example the transformation of text strings into XML. The data model also makes use of two D1 subclasses: crmdig:D9_Data_Object
, which represents specific instances of D1 that contain quantitative properties (such as for example a linguistic corpus) and were created by D2, and crmdig:D14_Software
, which represent both the software programs or computer codes that an instance of D7 can use as well as a type of digital object that can be created by the research project activity.
frbroo:F5_Item
is defined as comprising “physical objects (printed books, scores, CDs, DVDs, CD-ROMS, etc.) that were produced by (P186i) an industrial process involving a given instance of F3 Manifestation,” while frbroo:F3_Manifestation
is defined as comprising “products rendering one or more Expressions. A Manifestation is defined by both the overall content and the form of its presentation.” [10] Within KNOT-DM these two classes are used specifically in conjunction with D2. As such F5 represents the physical item that was used for digitization while F3 represents the manifestation that was used to produce the item. The latter can also be related to a crm:E1_Entity
via crm:P129_is_about
in order to attach the subject to the manifestation (for example an author or a historical period).
A Practical Example concluded
Figure 10 details the last iteration of the simplified practical example, showing how the main classes in CIDOC are used. Our research project was informed by two specific tasks: the digitization of old photographs and the creation of the dataset (represented by prov:wasInformedBy
, crmdig:D2_Digitization_Process
, and crmdig:D7_Digital_Machine_Event
). The old photographs depict specific buildings of interest related to the project's subject of Italian medieval architecture (represented by crm:E18_Physical_Thing
, crm:P129_is_about
, and crm:E1_Entity
). Creating the dataset involved using the resulting scans, alongside the existing collection described in the previous step (represented by crmdig:D9_Data_Object
, crmdig:D1_Digital_Object
, crmdig:L10_had_input
, crmdig:L11_had_output
, and crm:P106_is_composed_of
). This example is simplified so that the digitization process covers multiple photographs, though in reality a single instance of D2 should represent the digitization of each photo. Equally, the tasks that informed the research project could be extended to cover the creation of the data service and website (see the Modules section for a more detailed example). Should the information be available, the entities depicted in these photos can in turn be described using CIDOC.
See the Modules section for further detailed practical examples of how this segment can be used to describe academic research projects and the Ontology for the definitions of all the different properties and classes.
As we've seen in the previous sections of this overview, the three standards used within KNOT-DM segments each include classes which refer to similar concepts such as entities and activities.
While this could be dealt with by using the owl:equivalentClass
axiom or the property rdfs:subClassOf
(pointing to a new specifically created class), these classes are respectively defined with enough difference that KNOT-DM does not consider them as being equivalent. Furthermore the goal of KNOT-DM is to reuse existing standards as much as possible rather than create new classes or properties. As such it was decided during design that individuals described with KNOT-DM should be assigned each class as appropriate.
The conceptually similar classes within KNOT-DM cover the following:
- People and/or organizations involved in academic research projects.
prov:Agent
with the implication that the agent bears some form of responsibility towards aprov:Activity
orprov:Entity
.foaf:Agent
with the implication that the agent is responsible for an action that is further defined by properties such asdct:creator
ordct:publisher
.crm:E39_Actor
with a similar implication of responsibility asprov:Agent
but in relation to specific activities within the overall academic research project activity (prov:Activity
).- Example: A researcher within our example project is responsible for the technological infrastructure that makes the final dataset available but is not involved or responsible for anything to do with the specific activities that created the dataset, such as the digitization of photos. This researcher should be given the
prov:Agent
andfoaf:Agent
classes. Conversely a researcher involved in creating the dataset but not in the development of the website or data service should be given thecrm:E39_Actor
andprov:Agent
classes. - Locations where research activity took place or which are represented in the results of academic research.
prov:Location
with the implication that it can be a geographical or non-geographic place where activity took place.dct:Location
with the implication that it is a spatial region or named place mentioned or covered by the data.crm:E53_Place
with a similar implication asdct:Location
but relating to activities.- Example: the dataset created by our example project includes writings about Italy, which are included in the dataset, while the project itself took place in Rome. Therefore
dct:Location
should be used to indicate Italy whileprov:Location
andcrm:E53_Place
should be used to indicate Rome. - Entities representing objects and subjects that are used and generated by academic research projects.
prov:Entity
with the implication that it can represent things with fixed aspects that are real or imaginary and used or generated in the activity of a research project.dcat:Dataset
with the implication that it can only represent a collection of data created and published as a result of a research project activity.- dcat:DataService with the implication that it can only represent a data service created by the research activity to give access to dataset(s) also created by it.
crmdig:D1_Digital_Object
(and its subclasses),crm:E18_Physical_Thing
(and its subclass F5), andcrm:E1_Entity
with the implication that D1 and its subclasses are used to represent only digital objects created and/or used by specific tasks within the research activity (and which may ultimately form part of or also be adcat:Dataset
) while E18 is used to represent physical objects that undergo a process of digitization and E1 is used to represent the entity or concept that the physical object is about.- Example: the dataset created by our example project is an instance of
prov:Entity
,dcat:Dataset
andcrmdig:D1_Digital_Object
while the physical representations of the photos used in the digitization process are instances ofcrm:E18_Physical_Thing
and/orfrbroo:F5_Item
and the results of this process are instances ofcrmdig:D9_Data_Object
. The subject of the research (Italian medieval architecture) can be represented as bothprov:Entity
andcrm:E1_Entity
. - Activities representing the overall academic research project as well as specific things that happen within it.
prov:Activity
is used to represent the overall research project activity.crmdig:D7_Digital_Machine_Event
(and its subclasses) is used to represent specific sub-activities or tasks within the wider activity.- These classes are connected via
prov:wasInformedBy
. - Example: our example project is an instance of
prov:Activity
while the digitization of the photos and creation of the datase are instances ofcrmdig:D2_Digitization_Process
andcrmdig:D10_Software_Execution
(orcrmdig:D7_Digital_Machine_Event
if there are not enough specific details).
References
[1] Albertoni, Riccardo. n.d. “Data Catalog Vocabulary (DCAT) - Version 2.” Www.w3.org. Accessed June 19, 2023. https://www.w3.org/TR/vocab-dcat-2/.[2] “PROV-O: The PROV Ontology.” n.d. Www.w3.org. Accessed July 11, 2023. https://www.w3.org/TR/prov-o/.
[3] “Home.” n.d. Cidoc-crm.org. Accessed July 11, 2023. https://www.cidoc-crm.org/.
[4] “———.” n.d. Cidoc-crm.org. Accessed July 11, 2023b. https://www.cidoc-crm.org/crm-dig/.
[5] “———.” n.d. Cidoc-crm.org. Accessed July 11, 2023c. https://www.cidoc-crm.org/frbroo/.
[6] “Functional Requirements for Bibliographic Records (FRBR).” 2009. In Encyclopedia of Library and Information Sciences, Third Edition, 1884–91. CRC Press.
[7] Peroni, Silvio. “Graffoo - Graphical Framework for OWL Ontologies.” n.d. Essepuntato.It. Accessed July 13, 2023. https://essepuntato.it/graffoo/.
[8] Van Nuffelen, Bert. “DCAT Application Profile for data portals in Europe Version 2.1.0.” Joinup.ec.europa.eu. Accessed July 11, 2023. https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/dcat-application-profile-data-portals-europe.
[9] Van Nuffelen, Bert. “Usage guide on Datasets, Distributions and Data Services.” Github.com. Accessed July 11, 2023. https://github.com/SEMICeu/DCAT-AP/blob/master/releases/2.1.1/usageguide-dataset-distribution-dataservice.md.
[10] Bekiari, Chryssoula, Martin Doerr, Patrick Le Boeuf, Trond Aalberg, George Bruseker, Günther Görz, and Mika Nyman. n.d. “LRM OO (Formerly FRBR OO ) Object-Oriented Definition and Mapping from IFLA LRM.” Cidoc-crm.org. Accessed July 11, 2023. https://www.cidoc-crm.org/frbroo/sites/default/files/LRMoo_V0.9%28draft%20for%20WLIC%202022%29.pdf.
Modules
Learn more about how to use the different modules within KNOT-DM.
Controlled Vocabularies
Learn more about the Controlled Vocabularies used in KNOT-DM, including those created specifically for the project.
Ontology
Learn more about the KNOT Ontology, which expresses the data model in RDF.