Steve Baskauf's blog: Controlled values for Subject Category from Audubon Core

Disclamer: this post contains opinions that are entirely my own and do not represent any official position related to any of my work as part of TDWG.

A while ago, I got drafted to serve as the convener of a task group of Biodiversity Information Standards (TDWG) charged with (among other things) drafting a Standards Documentation Specification. This specification should lay out how TDWG standards would be structured to make them clear and understandable for humans and machines. With respect to human-readable documents, there were precedents to follow, and there is one precedent (Darwin Core) to follow for machine-readable representations. So we've made some progress on a draft specification.

However, what has really caused me to stall out on this work is trying to figure out how TDWG should specify controlled vocabularies. There are several places in TDWG vocabulary standards where it is stated that best practice is to use a "controlled vocabulary" for values associated with a property. Here are some examples:

Iptc4xmpExt:CVterm (from Audubon Core)
dwc:establishmentMeans (from Darwin Core)
dwc:country (from Darwin Core)

The questions spinning around in my head were:

What exactly is a controlled value?
What is the purpose of a controlled value?
How should one describe controlled values using RDF?

These questions set me off on an exploration of ISO 25964 and SKOS, which I wrote about in my previous blog post. In that post, I asserted that perhaps the most important question to be answered before constructing a vocabulary of any sort is to carefully lay out the use cases to be satisfied by that vocabulary. That seems obvious, but the TDWG email discussion list is littered with long arguments that would have been more productive if use cases had been clearly laid out at the start (mea culpa!). So I've spent some time recently thinking about what the use cases are for controlled vocabularies in the TDWG context. I've concluded that "controlled vocabulary" is a pretty vague term and that there really are several intended purposes for controlled vocabularies in the context of TDWG properties. The differences in these intended purposes should inform the design of the RDF that specifies the controlled vocabulary terms. In this and subsequent posts, I'm going to talk about my attempts to specify three experimental controlled vocabularies to be used as values for TDWG terms, and to lay out the use cases to be satisfied in each situation.

Subject Category from Audubon Core (Iptc4xmpExt:CVterm)

Audubon Core is "a set of vocabularies designed to represent metadata for biodiversity multimedia resources and collections".[1] Although Audubon Core mints some new terms, wherever possible it reuses existing terms. One of these reused terms is http://iptc.org/std/Iptc4xmpExt/2008-02-29/CVterm (abbreviated as Iptc4xmpExt:CVterm), which comes from the IPTC Standard Photo Metadata terms and is labeled by Audubon Core as "Subject Category". The property Iptc4xmpExt:CVterm supports classification of media items by linking to controlled vocabulary terms for subjects of the items.

Audubon Core is considered to be a data model that does not prefer any particular implementation. As such, it can be used in structured text, such as CSV, or as RDF (although a final RDF implementation has not been completed at this point). Because Audubon Core may be used in implementations that are predominately text-based, unqualified literals are permitted as values instead of URIs if the term is either from one of the Audubon Core recommended sets, or if the source vocabulary is specified using the ac:subjectCategoryVocabulary (i.e. http://rs.tdwg.org/ac/terms/subjectCategoryVocabulary) property. However, in the context of RDF, it is probably better for the object of Iptc4xmpExt:CVterm to be a URI whenever possible, since that would guarantee uniqueness, and permit discovery of other properties of the controlled value, such as preferred human-readable labels.

SERNEC Live Plant Image Group standardized image views

Because I manage the Bioimages website, I'm very interested in categorizing plant images in a systematic way. In 2008, Bruce Kirchoff and I created a system of standardized views that would allow plant images to be organized in a systematic way based on the plant part, and the viewing angle and orientation of the part. These views were published in Vulpina 7:16-30 and were subsequently vetted by a group of live plant photographers under the auspices of the Southeast Regional Network of Expertise and Collections (SERNEC). I made an attempt at an RDF representation of the standardized views in the form of an ontology, but at the time I was struggling with task because I was pretty sure that what I was doing was not "right". So the effort to continue development of the standardized views and extend them to new plant groups and animals stalled out.

After educating myself recently about SKOS and thesauri, it was clear to me why formalizing the standardized views as an ontology back then was the wrong approach. The views form a hierarchy; for example, a view of the perianth of a flower was categorized in the group of views associated with inflorescences, and inflorescence views were one of the categories of views that applied to herbaceous angiosperms. My initial attempt modeled the views as classes, and the hierarchical relationships were expressed using rdfs:subClassOf. This allowed reasoning membership in classes at higher levels in the hierarchy. There were two problems with this approach. One was that I really wanted to use the views as values with the Iptc4xmpExt:CVterm property. But as classes, it would be more appropriate to make the link using rdf:type. The second problem was that the subclassing didn't make any sense. If I said

perianth subClassOf inflorescence subClassOf herbaceous angiosperm

do I really mean that something that is a perianth is also an inflorescence and is also an herbaceous angiosperm? Not really. I suppose that I could get around this problem by saying that what I was defining was a "view" of perianth and a "view" of an inflorescence and a "view" of an herbaceous angiosperm, and in that context the subclassing might make some sense. But if I make the link using rdf:type and say something like

<image> a <perianth>.

then am I saying that an image is a perianth? Am I saying that an image is a view of a perianth? What exactly am I saying?

Designing a SLPIG standardized view thesaurus

I could get around this confusion if I consider a view to be a skos:Concept rather than a class. The purpose of a skos:Concept is to allow humans to categorize things in a knowledge organization system. That's exactly what the standardized views are for: to categorize images, not really to describe the nature of images, or of perianth, or of herbaceous angiosperms.

Once I had that epiphany, then I knew that what I wanted to do was to design a thesaurus, not an ontology. A major design consideration is that the thesaurus needed to reflect the structure of the existing view hierarchy, since the hierarchy was already in use. It should also satisfy these use cases:

Allow a user to select a view by presenting labels for concepts that are appropriate for a particular category of organisms (e.g. gymnosperms, herbaceous angiosperms, etc.)
Allow images to be grouped by major categories (leaf, bark, fruit, etc.) and within those categories by views that were appropriate for each major category.
Group images that fall into the same major category regardless of whether they are in the same category of organism (e.g. show bark of any trees whether they are woody angiosperms or gymnosperms).
Support labels in multiple languages.

There were several ways I could have achieved these design goals. I decided to create a skos:ConceptScheme for each of the organism groups: woody angiosperms, herbaceous angiosperms, and gymnosperms. Within each concept scheme, the top concepts were the major view categories for that group (e.g. entire organism, stem, leaf, inflorescence, fruit, and seed for herbaceous angiosperms). The views within the major categories were linked to their category using skos:broader. Views that were the same as views in another major category were linked using skos:exactMatch. skos:prefLabel was used to specify the preferred label for a language. Here is an incomplete snippet of the part of the thesaurus that organizes the views that apply to herbaceous angiosperms:

<http://bioimages.vanderbilt.edu/rdf/stdview#02>
a skos:ConceptScheme;
rdfs:isDefinedBy <http://bioimages.vanderbilt.edu/rdf/stdview>;
rdfs:seeAlso <http://www.cals.ncsu.edu/plantbiology/ncsc/vulpia/pdf/Baskauf_&_Kirchoff_Digital_Plant_Images.pdf>;
skos:hasTopConcept <http://bioimages.vanderbilt.edu/rdf/stdview#0200>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0201>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0202>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0203>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0204>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0205>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0206>;
skos:note "II. Herbaceous angiosperm views"@en;
skos:prefLabel "herbaceous angiosperms"@en.

<http://bioimages.vanderbilt.edu/rdf/stdview#0203>
a skos:Concept;
skos:definition "II.C. Leaf"@en;
skos:exactMatch <http://bioimages.vanderbilt.edu/rdf/stdview#0104>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0304>;
skos:inScheme <http://bioimages.vanderbilt.edu/rdf/stdview#02>;
skos:prefLabel "leaf"@en.

<http://bioimages.vanderbilt.edu/rdf/stdview#020302>
a skos:Concept;
skos:broader <http://bioimages.vanderbilt.edu/rdf/stdview#0203>;
skos:closeMatch <http://bioimages.vanderbilt.edu/rdf/stdview#010401>,
<http://bioimages.vanderbilt.edu/rdf/stdview#020301>,
<http://bioimages.vanderbilt.edu/rdf/stdview#030401>;
skos:definition "II.C.2. leaf on the upper stem, with the apex up"@en;
skos:inScheme <http://bioimages.vanderbilt.edu/rdf/stdview#02>;
skos:prefLabel "upper stem leaves"@en.

<http://bioimages.vanderbilt.edu/rdf/stdview#020303>
a skos:Concept;
skos:broader <http://bioimages.vanderbilt.edu/rdf/stdview#0203>;
skos:definition "II.C.3. margin of upper surface of leaf; part of the lower surface of another leaf with major veins visible should be shown behind the upper surface"@en;
skos:exactMatch <http://bioimages.vanderbilt.edu/rdf/stdview#010402>;
skos:inScheme <http://bioimages.vanderbilt.edu/rdf/stdview#02>;
skos:prefLabel "margin of upper and lower leaf surface"@en.

The entire thesaurus can be retrieved from http://bioimages.vanderbilt.edu/rdf/stdview.rdf as RDF/XML or http://bioimages.vanderbilt.edu/rdf/stdview.ttl as RDF/Turtle. One annoying thing is that the RDF editor I use (rdfEditor) balks if I use the namespace abbreviation stdview: for http://bioimages.vanderbilt.edu/rdf/stdview# because that causes the local name string to begin with a numeric character. I can't remember if that's just a problem in XML or if it really applies to Turtle as well. In any case, that's why the full URIs are listed in the example above instead of abbreviating them as something like stdview:020302. The SPARQL endpoints I've experimented with don't seem to mind the abbreviations, however.

In the example, the specific view of the margin of the upper and lower leaf surface is linked to the general category of leaf views using skos:broader. It's linked to the herbaceous angiosperm organism group using skos:inScheme. It's linked to the view of the margin of the upper and lower leaf surface in the woody angiosperm concept scheme using skos:exactMatch. The view of an upper stem leaf isn't exactly the same thing as a view of a lower stem leaf (stdview:020301), nor of a view of a whole leaf in woody angiosperms (stdview:010401), nor of a needle in gymnosperms (stdview:030401). But it's similar to those views, so it's linked to those them using skos:closeMatch.

Using the SLPIG standardized view thesaurus

the 13958 images in the Bioimages database are all categorized using standard view URIs as values of the Iptc4xmpExt:CVterm property. (See http://bioimages.vanderbilt.edu/tsn/19312 as an example of how the SLPIG views are used to sort out images.) The thesaurus has been loaded in the Vanderbilt Heard Library triplestore and is queriable at its SPARQL endpoint. So we can test out the thesaurus there by pasting the queries that follow into the endpoint's query box.

Use case 1 (Allow a user to select a view by presenting labels for concepts that are appropriate for a particular category of organisms (e.g. gymnosperms, herbaceous angiosperms, etc.):

This query shows all of the categories and views for the scheme labeled "woody angiosperms":

PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>

PREFIX stdview: <http://bioimages.vanderbilt.edu/rdf/stdview#>

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?categoryLabel ?viewLabel

WHERE {

?scheme skos:prefLabel "woody angiosperms"@en.

?scheme skos:hasTopConcept ?viewCategory.

?view skos:broader ?viewCategory.

?viewCategory skos:prefLabel ?categoryLabel.

?view skos:prefLabel ?viewLabel.

}

You can replace "woody angiosperms" with "herbaceous angiosperms" or "gymnosperms" to display the categories and views for other groups. An application could present a user with a pick list of categories, then views after a particular scheme is chosen. The pick list could be used to categorize images as their metadata were recorded.

Use case 2 (Allow images to be grouped by major categories (leaf, bark, fruit, etc.) and within those categories by views that were appropriate for each major category):

I've built a test application at http://bioimages.vanderbilt.edu/sparql-search.htm that uses SPARQL queries to narrow the search categories. To see how the dropdown controls were created, view the page source. The guts of the SPARQL queries and the dialogue with the Heard Library endpoint can be seen in the http://bioimages.vanderbilt.edu/sparql-search.js file. Each of the dropdown pick lists is populated by querying the endpoint to find out what values are available for each of the categories. Here's the function that generates the query that requests the data needed to populate the category dropdown (using some jQuery calls in addition to generic Javascript):

function setCategoryOptions(passedGenus) {

 // create URI-encoded query string
        var string = "PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>"+
                    "PREFIX skos: <http://www.w3.org/2004/02/skos/core#>"+
                    "PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>"+
                    "PREFIX foaf: <http://xmlns.com/foaf/0.1/>"+
                    "PREFIX dsw: <http://purl.org/dsw/>"+
                    'SELECT DISTINCT ?category WHERE {' +
                    "?identification dwc:genus " + passedGenus + "." +
                     "?organism dsw:hasIdentification ?identification." +
                    "?organism foaf:depiction ?image." +
                    "?image Iptc4xmpExt:CVterm ?view." +
                    "?view skos:broader ?featureCategory." +
                    "?featureCategory skos:prefLabel ?category." +
                    '}'
                    +'ORDER BY ASC(?category)';
 var encodedQuery = encodeURIComponent(string);

        // send query to endpoint
        $.ajax({
            type: 'GET',
            url: 'http://rdf.library.vanderbilt.edu/sparql?query=' + encodedQuery,
            headers: {
                Accept: 'application/sparql-results+xml'
            },
            success: parseCategoryXml
        });

 }

After the function runs, it passes the results XML to a function that pulls the values from the elements and uses them to populate the dropdown lists. The function shown above runs when the page loads, and in that case the value of passedGenus is ?genus, which places no restrictions on the genus in the query. However, the function also fires when there is a change in the genus drop-down. In that case, the selected value of the genus is inserted into the query as a literal (e.g. "Acer"). The query then finds out what categories are actually present in the data for that genus (as opposed to what categories "should" be appropriate for that genus). So for example, if the genus Bradburia is selected, only the categories "entire organism", "inflorescence", and "leaf" are loaded into the pick list because there aren't any stem, fruit, or seed images in the database. One down side of this is that the query takes long enough to run that the dropdown sometimes doesn't get populated with the appropriate values before the user makes a selection. As I noted in an earlier post, Callimachus (which the Heard Library endpoint is currently using) runs queries much more slowly than Stardog, so it's possible that performance here would be much better with a faster endpoint.

Once the user selects a category, that fires the function to query the endpoint to find the views that fall into that category:

function setViewOptions(passedCategory) {
 // create URI-encoded query string
 var string = "PREFIX skos: <http://www.w3.org/2004/02/skos/core#>"
                    +'SELECT DISTINCT ?viewLabel WHERE {'
             +'?featureCategory skos:prefLabel '+passedCategory+'.'
             +'?view skos:broader ?featureCategory.'
             +'?view skos:prefLabel ?viewLabel.'
             +'}'
                    +'ORDER BY ASC(?viewLabel)';
        var encodedQuery = encodeURIComponent(string);

...

As before, the XML results eventually end up on the drop-down pick list for selecting the view. Unlike the previous query, it doesn't (at this point) display only the labels that are used for images in the database that meet the other search criteria - it displays all possible views that fall into the selected category. It would be nice to restrict the views to those used in images that meet the other search criteria, but I haven't spent the time necessary to make the code that complex.

Use case 3 (Group images that fall into the same major category regardless of whether they are in the same category of organism, e.g. show bark of any trees whether they are woody angiosperms or gymnosperms):

The test SPARQL web search Javascript cheats on this one by including a triple pattern that requires a match to the label string:

             +'?featureCategory skos:prefLabel '+passedCategory+'.'

If the category passed to the function were "bark"@en, then the triple pattern would be

?featureCategory skos:prefLabel "bark"@en.

causing the views to be returned for any category that has the preferred label "bark"@en. That's fine as long as the major categories have the same preferred label, but if I'd used the labels "angiosperm bark"@en and "gymnosperm bark"@en, that trick wouldn't work.

A better approach would be to make use of this information:

stdview:0102 skos:exactMatch stdview:0302.

where stdview:0102 is the bark category from the woody angiosperm concept scheme and stdview:0302 is the bark category from the gymnosperm concept scheme.

Let's say that I want to find all of the views that are in any category that is equivalent to the category of the view of http://bioimages.vanderbilt.edu/baskauf/41954 (a photo of redwood bark). Maybe I'm in the redwood forest and I want to see all of the kinds of bark that might be there I don't care if the tree is an angiosperm or gymnosperm. I could use this query to discover the other bark views:

SELECT DISTINCT ?otherViews ?label
WHERE {
<http://bioimages.vanderbilt.edu/baskauf/41954> Iptc4xmpExt:CVterm ?view.
?view skos:broader ?viewCategory.
?viewCategory skos:exactMatch* ?equivCategory.
?otherViews skos:broader ?equivCategory.
?otherViews skos:prefLabel ?label.
}

which produces these results:

otherViews label
stdview:030202 bark of a medium tree or large branch@en
stdview:030201 bark of a large tree@en
stdview:030203 bark of a small tree or small branch@en
stdview:030200 unspecified bark view@en
stdview:010201 bark of a large tree@en
stdview:010203 bark of a small tree or small branch@en
stdview:010202 bark of a medium tree or large branch@en
stdview:010200 unspecified bark view@en

Although some of the labels are the same, all of the view URIs are different because they fall into two different concept schemes. The key triple pattern in this graph pattern is:

?viewCategory skos:exactMatch* ?equivCategory.

In that triple pattern, I used the "*" arbitrary length path matching operator, which matches with paths zero to many properties long. In theory, I could have used the "?" operator (paths zero or one property long), except that for whatever reason, doing that hangs the Callimachus endpoint. I used "*" rather than "+" (paths one to many properties long) because I also want the query to pick up the views that are in the same category as the redwood bark picture (stdview:0302).

When I was writing the thesaurus, one design consideration that I had to decide about was whether I wanted to assume that the endpoint would support reasoning using SKOS as a schema (Tbox). According to the SKOS spec, skos:exactMatch is transitive and symmetric, so if I knew for sure that the endpoint were going to reason entailed relationships, I could have related multiple equivalent categories for entire organisms in my thesaurus like this:

stdview:0101 skos:exactMatch stdview:0201.
stdview:0201 skos:exactMatch stdview:0301.
stdview:0101 skos:exactMatch stdview:0401.

or in any other way that connected the equivalent classes via at least one link. However, since I wasn't sure that endpoints would have that capability, I stated the relationships like this:

stdview:0101 skos:exactMatch stdview:0201.
stdview:0101 skos:exactMatch stdview:0301.
stdview:0101 skos:exactMatch stdview:0401.

stdview:0201 skos:exactMatch stdview:0101.
stdview:0201 skos:exactMatch stdview:0301.
stdview:0201 skos:exactMatch stdview:0401.
stdview:0301 skos:exactMatch stdview:0101.
stdview:0301 skos:exactMatch stdview:0201.
stdview:0301 skos:exactMatch stdview:0401.
stdview:0401 skos:exactMatch stdview:0101.
stdview:0401 skos:exactMatch stdview:0201.
stdview:0401 skos:exactMatch stdview:0301.

which causes every equivalent category to be explicitly linked to every other class via skos:exactMatch. That was annoying, but safe. Another possibility would have been to have just defined a single category concept like stdview:entireOrganism and reused it in all of the concept schemes. There is nothing in the SKOS guidelines that says that a concept cannot be used in several concept schemes. However, since the view categories had already been assigned category URIs that were in use, it seemed best to keep using those and to link them with skos:exactMatch.

Use case 4 (Support labels in multiple languages):
At present, the preferred labels are only given in English. But they are language-tagged literals, so at some point in the future when preferred labels are provided for other languages, preferred labels for one language could be distinguished from preferred labels in other languages by using a filter. For example, one could add

FILTER(langMatches(lang(?label), "en"))
BIND (str(?label) AS ?strippedLabel)

The FILTER statement requires the labels to be some variety of English (en, en-US, en-GB, etc.) and the second statement binds the string part of the label (minus the language tag) to a new variable ?strippedLabel, which can be displayed to users. You can try adding this to the query above [2], although it will only work for the English language tag "en" at the present.

Conclusions

1. I was really very pleased with how this thesaurus has worked out. I was able to keep it relatively simple, with only two levels in its concept hierarchy. The semantics of SKOS seem to be right for the task and thus far I haven't thought of any tasks that I haven't been able to easily write SPARQL queries to complete. I'm getting an immediate bang for my buck by being able to search for images after loading the thesaurus triples and accessing them through the Heard Library SPARQL endpoint.

2. I think that if I were starting from scratch, I'd still define a concept scheme for each organism group (herbaceous angiosperms, gymnosperms, etc.) but would designate a single concept for each category and view to avoid having to declare many concepts as equivalent.

3. I should note that I'm not considering that this thesaurus would be the controlled vocabulary for Iptc4xmpExt:CVterm in Audubon Core. It would be a controlled vocabulary that could be used with Iptc4xmpExt:CVterm. It would have value to the extent that it were widely used. The point I'm trying to make in this post is that it would be advantageous for the value of Iptc4xmpExt:CVterm to be populated with URIs that dereference to SKOS concepts rather than populated with strings that would have to be cleaned up and reconciled with some list of standardized strings.

4. This thesaurus is uses SKOS in a fairly conventional way. It assumes that the controlled values will be specified completely and sufficiently by associating the image with only a single view URI in the metadata, rather than using literals and requiring aggregators to perform string matching. As you'll see in upcoming posts, this won't be the case for other controlled vocabularies with which I'm experimenting. In this case, I make no attempt to associate strings with the image record because I assume that the labels may change or be added at any time, and that producers and consumers can access the labels at will from the thesaurus.

Endnotes

[1] http://terms.tdwg.org/wiki/Audubon_Core
[2] Like this:

PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>
PREFIX stdview: <http://bioimages.vanderbilt.edu/rdf/stdview#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?otherViews ?strippedLabel
WHERE {
<http://bioimages.vanderbilt.edu/baskauf/41954> Iptc4xmpExt:CVterm ?view.
?view skos:broader ?viewCategory.
?viewCategory skos:exactMatch* ?equivCategory.
?otherViews skos:broader ?equivCategory.
?otherViews skos:prefLabel ?label.
FILTER(langMatches(lang(?label), "en"))
BIND (str(?label) AS ?strippedLabel)
}

Steve Baskauf's blog

Monday, March 21, 2016

Controlled values for Subject Category from Audubon Core