Friday, April 1, 2016

Controlled values for Establishment Means from Darwin Core

This post is a the third in a series about using SKOS to define controlled values.  In the first post, I discussed the differences between thesauri and ontologies, and the pros and cons of using them to define vocabulary terms in various situations.  In the second post, I experimented with using SKOS to describe the SERNEC Live Plant Image Group standardized views for live plants as a thesaurus that could be used as controlled values for the term Iptc4xmpExt:CVterm from Audubon Core.  In this post, I want to talk about how one might use SKOS to define a controlled vocabulary to provide values for the Darwin Core property dwc:establishmentMeans.

<http://bioimages.vanderbilt.edu/ind-baskauf/10113> dwc:establishmentMeans "managed".

What is dwc:establishmentMeans?

To indicate "the process by which the biological individual(s) represented in the Occurrence became established at the location", the Darwin Core vocabulary provides the term dwc:establishmentMeans.  The term definition specifies that best practice is to use a controlled value for the value.  A comment associated with the term suggests possible values, but there is not currently any officially sanctioned "controlled vocabulary" to provide the authoritative values.  If TDWG were to create such a controlled vocabulary, how should it be done?

What is a TDWG vocabulary?

In the context of my role as convener of the TDWG Vocabulary Maintenance Specification Task Group, one question that has been on my mind recently is: what is a TDWG vocabulary?  This question is relevant to the issue of constructing controlled vocabularies for TDWG because TDWG is a standards organization that borrows vocabulary terms from other organizations. That has implications because it means that TDWG's vocabularies may need to be structured in a more complex manner than an ad hoc vocabulary constructed by an individual from scratch.

A simplistic answer is to say that a TDWG vocabulary is a collection of terms that have been chosen by TDWG for some purpose.  However, in the context of RDF, that isn't specific enough.  What if the terms are adopted from other namespaces outside of TDWG's control (such as Dublin Core terms)?  What is the difference (if any) between a TDWG Standard and a TDWG Vocabulary?  Can they be the same thing?  Should they be the same thing?  How do we differentiate between the vocabulary itself and a representation of that vocabulary in a particular form such as CSV, XML, or RDF?  The answers to all of these questions should make sense when the data are expressed as RDF, but the relationships that we describe must also make sense conceptually outside of the world of RDF, since many users of the vocabularies won't know or care about RDF.

I think that it makes sense to differentiate between a TDWG standard and a TDWG vocabulary. TDWG standards may include vocabularies, but they may also include user guides and other documents that aren't part of the vocabulary.  This is true in the case of the Darwin Core Standard, which includes the Darwin Core vocabulary, but also includes a text guide, an XML guide, and other informative documents.  A TDWG standard may not establish any vocabulary at all, as in the case of the GUID Applicability Statement.  It makes sense to me that a TDWG standard should probably be typed a dcterms:Standard, with a TDWG vocabulary being related to its containing standard by dcterms:isPartOf.

It would be tempting to type a TDWG vocabulary as a voaf:Vocabulary [1]. However, voaf:Vocabulary is defined as "A vocabulary used in the linked data cloud. ..."  Although TDWG vocabularies might be used in the context of Linked Data, they are commonly used in non-RDF contexts.  Since voaf:Vocabulary rdfs:subclassOf void:Dataset, asserting that a TDWG vocabulary was a voaf:Vocabulary would also entail that TDWG vocabulary was void:Dataset [2]. Since a void:Dataset is defined as a "set of RDF triples that are published, maintained or aggregated by a single provider", that could be a problem, since the terms of a TDWG vocabulary may be defined in human-readable form (e.g. Audubon Core) and used in a non-text format such as CSV without necessitating that those terms ever be part of a set of RDF triples.  For lack of a better term, I've settled on dcmitype:Dataset as a type for TDWG vocabularies.  Its defined as "Data encoded in a defined structure." with examples given as "lists, tables, and databases", which would seem to apply to the list of terms that a TDWG vocabulary contains.

In a simple world, vocabulary terms would be defined in a single vocabulary document and the terms would be related to that document via rdfs:isDefinedBy.  The common practice followed by vocabularies such as Darwin and Dublin Core is to describe (using RDF) a defining entity that's identified by the term namespace, then relate terms minted by the vocabulary to the defining entity using rdfs:isDefinedBy.  That's a good trick, because if the server is set up right, dereferencing either the term URI (e.g. http://rs.tdwg.org/dwc/terms/recordedBy) or the URI of the defining entity (i.e. the namespace; for example http://rs.tdwg.org/dwc/terms/) will retrieve the document representation of the defining entity in either web page form or RDF, depending on what the client asks for.

Unfortunately, TDWG vocabularies don't live in a simple world.  Both of the existing TDWG vocabularies (Darwin Core and Audubon Core) borrow terms from other vocabularies in addition to minting terms of their own. I've been careful to avoid calling the defining entity the "vocabulary", because the defining entity identified by the namespace URI generally only includes the vocabulary terms minted specifically for that vocabulary, and doesn't include the other terms in the vocabulary that are borrowed.  Instead, I'll call the defining entity a "term list". By that nomenclature, Darwin Core would consist of two term lists: the list of terms defined by the Darwin Core standard itself (those in the dwc: namespace) and a list of the Dublin Core terms that have been borrowed for inclusion in Darwin Core (from the dcterms: namespace).  TDWG could create its own list of terms that includes the subset of Dublin Core terms used in Darwin Core, assign that term list a URI, and assert that each of the borrowed terms is related to that list by dcterms:isPartOf.  However, TDWG should NOT assert that those terms are related to its list by rdfs:isDefinedBy, because it's Dublin Core's term list (identified by the URI http://purl.org/dc/terms/) that actually does the defining.



The diagram above shows how I'm applying the relationships I've just described to the pretend TDWG controlled vocabulary I'm creating.  At the bottom is a term whose label is "native".  It is both part of, and defined by, a term list created by the Global Biodiversity Information Facility (GBIF).  That term list is incorporated in an imaginary TDWG controlled vocabulary intended to be used with the Darwin Core term dwc:establishmentMeans.  In this situation, GBIF hasn't actually created an RDF term list for its terms, so I'm simplifying the situation I described above by assuming that TDWG can either get GBIF to create an RDF term list document of the sort we want (so that it will be served when the namespace URI is dereferenced), or else TDWG will make the assertions about the term list in its own RDF document (and nothing would be served as RDF when the namespace URI is dereferenced).  Since I'm assuming that TDWG is accepting the entire GBIF term list as-is, there isn't a need for TDWG to create its own separate term list containing a subset of the GBIF terms.  (In the next post I'll show an example where a subset list is required.)

Notice that I explicitly typed the vocabulary as a dcmitype:Dataset, but that I typed the term list as a dcat:Dataset.  The term dcat:Dataset comes from the Data Catalog Vocabulary (DCAT) W3C Recommendation.  Conveniently, there is no assumption that a dcat:Dataset is expressed as RDF; it's defined as "a collection of data, published or curated by a single agent, and available for access or download in one or more formats."  So the available forms could include RDF/XML or RDF/Turtle, but they could also include non-RDF formats such as HTML, or PDF.  The available formats are typed as dcat:Distribution and a DCAT dataset is linked to its distributions using the term dcat:distribution.  Also conveniently, declaring the term list to be a dcat:Dataset entails that it is also of type dcmitype:Dataset. [3]

So in summary, the relationships are modeled like this:

A term dcterms:isPartOf a term list (types dcat:Dataset and dcmitype:Dataset).
A term list dcterms:isPartOf a vocabulary (type dcmitype:Dataset).
A vocabulary dcterms:isPartOf a TDWG standard (type dcterms:Standard).

Considerations for creating a controlled vocabulary for dwc:establishmentMeans

Having established what I mean by "vocabulary" and "term list", I'll list some things that should be considered with respect to a controlled vocabulary for dwc:establishmentMeans as opposed to the kind of controlled vocabulary I laid out for Iptc4xmpExt:CVterm in my last post:

  • Unlike the property Iptc4xmpExt:CVterm, which is part of a relatively new TDWG standard and for which URI values are a preferred option, the property dwc:establishmentMeans has a long history of having literal values, often stored and transmitted as simple text strings stored in cells of a CSV table.  
  • There is a commonly used, small set of literal values for dwc:establishmentMeans established by GBIF, a major aggregator of Darwin Core-described metadata.  In contrast, there are numerous options for controlled values for Iptc4xmpExt:CVterm.
  • As with Iptc4xmpExt:CVterm, it would be beneficial to establish a URI for the values of dwc:establishmentMeans so that those values would be globally unique.  
  • The Darwin Core RDF Guide specifies that the property http://rs.tdwg.org/dwc/iri/establishmentMeans (dwciri:establishmentMeans) should be used with URI values.   dwciri:establishmentMeans is distinct from dwc:establishmentMeans, which is intended for use with literal values.  In contrast Audubon Core allows Iptc4xmpExt:CVterm to be used with either literals or URIs.
  • If TDWG designated a particular controlled vocabulary as a standard whose terms were to be used as values for establishmentMeans, then that vocabulary would be managed via a TDWG standards process.  That means that any proposed changes to the vocabulary itself would have to go through a potentially laborious and time-consuming process.  TDWG does not specify a particular controlled vocabulary for the values of Iptc4xmpExt:CVterm, so there is no particular procedure that would need to be followed for changes to the vocabularies that provide theIptc4xmpExt:CVterm values.

Use Cases

Here are some use cases I delineated for a controlled vocabulary for dwc:establishmentMeans/dwciri:establishmentMeans:
  1. Discover from use of the existing GBIF URIs that those terms are included as part of a standard TDWG vocabulary.
  2. For each controlled value, discover the single string that should serve as the literal value for dwc:establishmentMeans.
  3. Determine preferred labels for the controlled vocabulary terms in multiple languages from a source other than the standards document itself.  (The preferred labels should not be included in the standards document since additions or changes to those labels should not invoke the TDWG standards process.)  This is to facilitate term selection by non-English language users.
  4. Discover broader controlled value terms that are entailed by the hierarchical relationships expressed in the human-readable description of the GBIF controlled value terms.
  5. Discover alternate labels that might be useful for disambiguation when cleaning literal value data.  

The Controlled Vocabulary

(A document containing the complete graph for the examples below is here.)  Here's how I described the imaginary TDWG controlled vocabulary:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix dwc: <http://rs.tdwg.org/dwc/terms/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcmitype: <http://purl.org/dc/dcmitype/>.
@prefix dcam: <http://purl.org/dc/dcam/>.
@prefix dcat: <http://www.w3.org/ns/dcat#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix emv: <http://rs.gbif.org/vocabulary/gbif/establishment_means/>.

<http://rs.tdwg.org/cvterms/establishment_means> a skos:ConceptScheme,
                                                   dcmitype:Dataset;
     rdfs:label "TDWG Establishment Means Vocabulary"@en;
     dcterms:title "TDWG Establishment Means Vocabulary"@en;
     rdfs:comment "This is a controlled vocabulary whose IRIs are intended to be used as values for dwciri:establishmentMeans."@en;
     skos:note "This is a controlled vocabulary whose IRIs are intended to be used as values for dwciri:establishmentMeans."@en;
     skos:hasTopConcept emv:native,
                        emv:introduced,
                        emv:uncertain;
     dc:publisher "Biodiversity Information Standards"@en.

Notes:
  1. The URI for the vocabulary is made up and won't dereference.
  2. I expressed the label for the vocabulary two ways.  rdfs:label is the most generic way to do this and virtually all RDF clients could be assumed to understand it.  dcterms:title is also well-known and and using it is recommended by Linked Open Vocabularies (LOV) as a best practice for vocabularies (although LOV doesn't distinguish between "vocabularies" and "term lists" as I've done here.)
  3. Similarly, I've described the vocabulary using both the well-known rdfs:comment and the SKOS term skos:note (since I'll be using SKOS to describe the controlled values).
  4. I have linked using skos:hasTopConcept to the three terms that are at the top of the hierarchy.
GBIF defines the controlled vocabulary terms in XML here. GBIF doesn't provide an RDF implementation for its term list.  If it did, it could assert something like this:

emv: a dcat:Dataset;
     rdfs:label "GBIF Establishment Means Term List"@en;
     dcterms:title "GBIF Establishment Means Term List"@en;
     rdfs:comment "This document contains definitions for controlled vocabulary terms to be used as values for dwciri:establishmentMeans."@en;
     skos:note "This document contains definitions for controlled vocabulary terms to be used as values for dwciri:establishmentMeans."@en;
     dcterms:modified "2015-02-13"^^xsd:date;
     dcterms:license <https://creativecommons.org/licenses/by/4.0/>;
     rdfs:seeAlso <http://rs.gbif.org/vocabulary/gbif/establishment_means.xml>,
                  <http://terms.tdwg.org/wiki/>; # link to translations document
     dcterms:isPartOf <http://rs.tdwg.org/cvterms/establishment_means>;
     dc:publisher "Global Biodiversity Information Facility"@en.

Notes:
  1. There are two rdfs:seeAlso links.  One, which is not dereferenceable as RDF, is to the XML document that's currently online.  The other is to a translations document that I'll talk about later in the post.  I've assumed that it would be accessible via the TDWG terms wiki, so I used the terms wiki URI as the object.
  2. I've listed the publisher as GBIF.  That's because GBIF minted the URIs that I'm using, even though they didn't create this RDF.  It's possible that they will never assert RDF like this - maybe these assertions will be made by TDWG or someone else.  But it is their list of terms.

Here is a snippet of RDF for one of the terms:

emv:native skos:definition "A species that is a part of the balance of nature that has developed over hundreds or thousands of years in a particular region or ecosystem. The word native should always be used with a geographic qualifier (for example, native to New England)."@en;
           rdfs:comment "A species that is a part of the balance of nature that has developed over hundreds or thousands of years in a particular region or ecosystem. The word native should always be used with a geographic qualifier (for example, native to New England)."@en;
           rdf:value "native";
           rdfs:label "native"@en;
           skos:inScheme <http://rs.tdwg.org/cvterms/establishment_means>;
           rdfs:isDefinedBy emv:;
           dcterms:isPartOf emv:;
           dcam:memberOf emv:;
           a skos:Concept.

Notes:
  1. The definition is the one given in the current GBIF XML.  The definition is linked twice - once as a skos:definition property (using that SKOS property since the term is typed as a skos:Concept) and again as an rdfs:comment (since using this property is somewhat standard for term definitions and RDFS is more "well-known" than SKOS).
  2. The term is linked to the GBIF term list using rdfs:isDefinedBy.  As I mentioned earlier, it is a somewhat standard practice to use this term link vocabulary terms to a defining resource that comes up when the term is dereferenced.
  3. The English rdfs:label property is provided as a normal practice in TDWG vocabularies so that consuming applications will have something to display to humans in lieu of the URI.  However, the multi-lingual preferred labels will be provided in a separate document.
  4. The term is typed as a skos:Concept for the reasons that I discussed in my previous post.  Briefly, the term emv:native is used to organize circumstances of organisms, not to define what it means for an organism to be "native".  
<http://bioimages.vanderbilt.edu/ind-baskauf/42050> dwciri:establishmentMeans emv:native.

Specifying the controlled value string

In the design considerations, I mentioned that it has been a longstanding practice within the biodiversity informatics community to use literals as values for many Darwin Core terms whose purpose is to categorize resources.  It is an essential characteristic of those controlled value literals that there should be a single string that is used to represent a particular category.  It seems like this approach should be uncomplicated, but a proliferation of spellings, language variants, and abbreviations makes it possible for data providers to use many different literals for a particular category [4]  An important function of a standard controlled vocabulary would be to specify that single required string for each category.  Unfortunately, SKOS isn't designed for such a task.  Although SKOS provides a mechanism for specifying the preferred label for a concept using the skos:prefLabel property, SKOS allows a different skos:prefLabel value for each language.  That is fine if one assumes that the term URI will serve as the unambiguous identifier for the category concept, but it would be counterproductive in the circumstance where a single literal string is needed to identify the category in lieu of a URI.  

I spent considerable time pondering an appropriate property to specify the controlled literal value.  I finally settled on rdf:value.  It has the desirable characteristic that it is well-known (as part of the basic RDF vocabulary), but it's meaning is vague enough that it could be used in this way.  Its definition states that it "may be used in describing structured values.  rdf:value has no meaning on its own." and goes on to say "Despite the lack of formal specification of the meaning of this property, there is value in defining it to encourage the use of a common idiom in examples of this kind."  I refer you to the RDF specification for the "example of this kind".  Basically, rdf:value is used when one wants to provide a literal value for a resource without expressing that literal value directly as the object of the triple.  The disadvantage of making the literal value be the direct object of the triple is that one could not further describe the resource represented by the literal, since literals can't be used as the subjects of triples.  In the example provided in the RDF specification, rdf:value was used to link to the literal value from a node representing the resource, and the additional information provided about the resource was the units associated with the value.

From http://dublincore.org/documents/dc-rdf/ (c) DCMI, CC BY
Another example of the use of rdf:value is provided in the Dublin Core guide for expressing Dublin Core metadata as RDF.  The example provided in section 4 of the guide covers a situation quite similar to ours.  Their example shows how a cataloging subject could be represented as an RDF representation based on the Dublin Core Abstract Model (DCAM).  The subject itself is a non-literal value node that is identified by a URI (just like our GBIF concept for the "native" category).  Objects of rdf:value triples provide values that are text and abbreviation literals that represent the subject.  In an RDF representation of the DCAM, it appears that the "value string" may be a plain literal without a language tag.  Although their example show a language tag, I think in situation of a controlled vocabulary literal, it's best to omit the language tag, since we don't want to imply that there are other acceptable values in other languages.  The value that we supply is the ONLY acceptable value.  

An additional feature of the Dublin Core example is use of the term dcam:memberOf to specify the "vocabulary encoding scheme" associated with the controlled value.  In my example, I have asserted the triple

emv:native dcam:memberOf emv:.

which by the range of dcam:memberOf entails that the GBIF term list is a dcam:VocabularyEncodingScheme.  That seems appropriate, since the GBIF term list is a scheme for encoding a vocabulary and meets the definition of dcam:VocabularyEncodingScheme: "An enumerated set of resources."  I'm not sure how important it would be to provide this triple, since I'm not sure whether this aspect of the DCAM is widely used and "understood" by clients.  But I don't think it is a bad thing to follow the DCAM, since it is the model that underlies what may be the most well-known metadata vocabulary (Dublin Core). 

SKOS relationships in the vocabulary

Some of the terms in the controlled vocabulary (emv:naturalised, emv:invasive, and emv:managed) are narrower subcategories of emv:introduced.  Following the normal practice for SKOS, they are linked using skos:broader like this:

emv:naturalised skos:broader emv:introduced.

The property skos:inScheme is used to relate each term to the TDWG vocabulary.  The range of  skos:inScheme entails that the vocabulary is a skos:ConceptScheme in addition to a dcmitype:Dataset.  I was a bit uncertain about whether it was best to call the TDWG vocabulary the concept scheme.  It would also be possible to instead assert that the GBIF term list was the concept scheme. The notes on skos:ConceptScheme say:
The notion of an individual SKOS concept scheme corresponds roughly to the notion of an individual thesaurus, classification scheme, subject heading system or other knowledge organization system. 
However, in most current information systems, a thesaurus or classification scheme is treated as a closed system — conceptual units defined within that system cannot take part in other systems (although they can be mapped to units in other systems). 
Although SKOS does take a similar approach, there are no conditions preventing a SKOS concept from taking part in zero, one, or more than one concept scheme.
In this particular instance, the controlled vocabulary contains only terms "borrowed" from a single GBIF vocabulary.  But I could imagine a situation where terms from several vocabularies might be borrowed to be included in a single TDWG-defined concept scheme that serves as a controlled vocabulary associated with a standard.  As indicated by the quote above, this "mix and match" approach to creating concept schemes is as allowed by the SKOS model.  In a case where terms from several vocabularies were reused to construct a concept scheme for a controlled vocabulary, one would not want to depend on the entity defining the terms to assert the concept scheme - that entity might not even consider the terms to be a skos:Concept. The assertions that terms are SKOS concepts and members of a concept scheme could be made in a TDWG document even if the term definitions were made elsewhere.  So under these circumstances, I think it makes sense to consider the TDWG vocabulary to be the concept scheme rather than the GBIF term list.


Satisfying the use cases

The SPARQL queries (below) that satisfy the use cases can be run on the Vanderbilt Heard Library SPARQL endpoint at http://rdf.library.vanderbilt.edu/sparql?view.  The endpoint "knows" about the namespace abbreviations used in the examples, so the example queries can be run without the need for namespace declarations.

1. Discover from use of existing GBIF URIs that those terms are included as part of a standard TDWG vocabulary.

Satisfying this use case would depend on inclusion of the triple: 

emv: dcterms:isPartOf <http://rs.tdwg.org/cvterms/establishment_means>.

in the document provided when emv: namespace terms are dereferenced.  That would require a certain level of cooperation between GBIF and TDWG, which is actually pretty likely, so I've put that triple in my RDF description of the GBIF term list.[5]  Assuming that the TDWG vocabulary is related to some standard using dcterms:isPartOf, the following query would discover the identities of both the standard and the vocabulary that are related to the term emv:naturalised

SELECT ?standard ?standardName ?vocab ?vocabName WHERE
     {
     emv:naturalised dcterms:isPartOf+ ?standard.
     ?standard a dcterms:Standard.
     ?standard rdfs:label ?standardName.
     ?vocab dcterms:isPartOf? ?standard.
     ?vocab a dcmitype:Dataset.
     ?vocab rdfs:label ?vocabName.
     }

This query uses the SPARQL property path "+" form to find standards one or more dcterms:isPartOf links from the term emv:naturalised.  This would allow for a less complex set of hierarchical relationships than the model I described:

term isPartOf termList isPartOf vocabulary isPartOf standard

In order for the query to discover the vocabulary, it uses the SPARQL property path "?" form to find vocabularies zero or one dcterms:isPartOf links from the standard.[6]  Using zero or one links rather than just one link allows for the vocabulary to be synonymous with the standard (although I argued that wasn't a good idea) as long as it's typed as both a dcterms:Standard and a dcmitype:Dataset.[7]  

2. For each controlled value, discover the single string that should serve as the literal value for dwc:establishmentMeans.

That one is easy:

SELECT ?value ?description WHERE
     {
     ?term skos:inScheme <http://rs.tdwg.org/cvterms/establishment_means>.
     ?term rdf:value ?value.
     ?term rdfs:comment ?description.
     }

3. Determine preferred labels for the controlled vocabulary terms in multiple languages from a source other than the standards document itself.  (The preferred labels should not be included in the standards document since additions or changes to those labels should not invoke the TDWG standards process.)  This is to facilitate term selection by non-English language users.

In order to satisfy this use case, we need a separate document that contains the preferred labels for various languages.  This separate document would not be part of the standard itself, and therefore could be changed or added to without invoking any standards change policy.  Here is a possible document:

emv:native skos:prefLabel "native"@en;
           skos:prefLabel "nativo"@es;
           skos:altLabel "indigenous"@en;
           skos:altLabel "reintroduced"@en.

emv:introduced skos:prefLabel "introduced"@en;
               skos:prefLabel "introducido"@es;
               skos:altLabel "exotic"@en;
               skos:altLabel "alien"@en.

emv:naturalised skos:prefLabel "naturalised"@en;
                skos:prefLabel "naturalizado"@es;
                skos:altLabel "naturalized"@en-US.

emv:invasive skos:prefLabel "invasive"@en;
             skos:prefLabel "invasor"@es.

emv:managed skos:prefLabel "managed"@en;
            skos:prefLabel "gestionado"@es;
            skos:altLabel "cultivated"@en;
            skos:altLabel "captive"@en.

emv:uncertain skos:prefLabel "uncertain"@en;
              skos:prefLabel "incierto"@es;
              skos:altLabel "unknown"@en.

(The Spanish translations were done using Google translate, so take them with a grain of salt.)  To display all of the values in a particular language (perhaps to generate a pick list), one could use this query:

SELECT ?strippedLabel WHERE
     {
     ?term skos:inScheme <http://rs.tdwg.org/cvterms/establishment_means>.
     ?term skos:prefLabel ?label.
     FILTER(langMatches(lang(?label), "es"))
     BIND (str(?label) AS ?strippedLabel)
     }

For a language other than Spanish, replace "es" with the ISO 639-1 code for the other language (assuming that preferred labels are available for that language).  You can try it with "en".

4. Discover broader controlled value terms that are entailed by the hierarchical relationships expressed in the human-readable description of the GBIF controlled value terms.

The following query would discover labels of concepts that were broader than the term emv:naturalised

SELECT ?termName WHERE
     {
     emv:naturalised skos:broader+ ?term.
     ?term rdfs:label ?termName.
     }

In this example, there is only one solution ("introduced"@en) [8], but for more complex controlled vocabularies, using the SPARQL property path "+" form (one or more skos:broader links) would allow a user to discover broader concepts on any level.  This query could be combined with the language filters used in the previous example to generate labels in any of the supported languages.

5. Discover alternate labels that might be useful for disambiguation when cleaning literal value data.

In addition to the preferred labels in various languages, the "translations" document can use skos:altLabel to link to alternate but synonymous labels, as well as spelling variants.  The following query will list strings that are either preferred or alternate labels in any language, along with their corresponding controlled vocabulary string (to be used as a literal value for dwc:establishmentMeans) and URI (to be used as a literal value for dwciri:establishmentMeans).

SELECT DISTINCT ?strippedLabel ?cvString ?term WHERE
     {
     ?term skos:inScheme <http://rs.tdwg.org/cvterms/establishment_means>.
       {?term skos:prefLabel ?label.}
          UNION
       {?term skos:altLabel ?label.}
     ?term rdf:value ?cvString.
     BIND (str(?label) AS ?strippedLabel)
     }

In this query, I used the DISTINCT keyword, since in some cases the same string might be the literal value in several languages.

Summary

1. If a controlled vocabulary is to be specified as part of a standard, and if that vocabulary borrows terms from other vocabularies, the structure of the RDF will probably be more complex than a non-standard controlled vocabulary established by a single provider.

2. Elements of the controlled vocabulary that are likely to change or be extended (such as preferred and alternate labels) should be specified outside of the standards documents.

3. The property rdf:value could be used to specify a literal value that is composed of the single string that is "the controlled value" for that term.  This is important information in cases where metadata are conventionally stored and transmitted as CSV files (using literals exclusively) rather than as RDF (making use of URIs).

4. Although I've used RDF/Turtle to describe the properties that relate the components of the controlled vocabulary, many controlled vocabulary users won't actually care about the RDF representation.  Nevertheless, being able to clearly describe those relationships has value in itself.

I'm going to wrap up this series of posts by looking at a case where the recommended controlled vocabulary for a term is completely specified outside of TDWG: the Getty Thesaurus of Geographic Names, which is specified for use as values for the Darwin Core property dwc:country.

Notes

[1] From the Vocabulary of a Friend specification.
[2] See the W3C Interest Group Note on VoID.
[3] In addition to the notions of dcat:Dataset and dcterms:Dataset, there is also the notion of an "RDF dataset" as defined by the SPARQL Query Language for RDF W3C Recommendation, Section 8.  In that context, an RDF dataset is a specific collection of RDF graphs.  In this post, I generally refer to a "dataset" in the general sense of Dublin Core.  However, the notion of an RDF dataset is an important one in the context of containment and provenance of triples.  For the purpose of tracking the contents of versions of a standard over time, provenance is also important, so RDF datasets may be relevant in that context.  See Section 5.3 of the SKOS primer for more on this subject.
[4] See https://soyouthinkyoucandigitize.wordpress.com/2013/07/18/data-diversity-of-the-week-sex/ for an interesting example.
[5] If GBIF doesn't provide an RDF document that is returned when the GBIF terms are dereferenced, a client couldn't "discover" the relationship to the standard by "following its nose".  However, the query would still work as long as a client had access to all of the triples in the examples, for example if TDWG asserted the triples and they were loaded into a triplestore queriable by SPARQL.
[6] For unknown reasons, the Heard Library Callimachus-based SPARQL endpoing hangs when the SPARQL property path "?" form is used in a query.  To make the query work on the endpoint, you can substitute
?vocab dcterms:isPartOf* ?standard.
for the sixth line of the query.  This "*" form is for 0-to-many property matches rather than 0 or 1 property matches.
[7] The query would mess up if the standard were the same as the vocabulary, and the term list were typed as a dcterms:Dataset (explicitly or through entailment) and dcterms:isPartOf the vocabulary.  But in order for the query to work at all would require some level of conformance to the pattern I laid out, so it's probably not productive anyway to waste time creating a complex query that would catch all of the possible ways to not conform.
[8] There is only one solution if the query operated only on asserted triples.  If reasoning were enabled, this query would also find all of the preferred and alternate labels for emv:naturalised in all languages, since both skos:prefLabel and skos:altLabel are rdfs:subPropertyOf rdfs:label.

No comments:

Post a Comment