Monday, December 19, 2011

Towards an interactive taxonomic article: displaying an article from ZooKeys

One of the things I keep revisiting is the way we display scientific articles. Apart from Nature's excellent iPhone and iPad apps, most efforts to re-imagine how we display articles are little more than glorified PDF viewers (e.g., the PLoS iPad app).

Part of the challenge is that if we make the article more interactive we immediately confront the problem of how to link to other content. For example, we may have a lovingly crafted ePub view (e.g., Nature's apps), but what happens when the user clicks on a citation to another paper? If the paper is published by the same journal, then potentially it could be viewed using the same viewer, but if not then we are at the mercy of the other publisher. They will have their own ideas of how to display articles, so the simplest fallback is to display the cited article in a web browser view. The problem with this is that it breaks the user experience - the other publisher is unlikely to follow the same conventions for displaying an article and its links. If we are lucky the cited article might be published in an Open Access journal that provides, say, XML based on the NLM DTD standard. Knowing whether an article is Open Access or not is not straightforward, and different journals have their own unique interpretation of the NLM standard.

Then there is the issue of other kinds of content, such as taxonomic names, specimens, DNA sequences, geographic localities, etc. We lack decent services for many of these objects, as a result efforts like PLoS Biodiversity Hub end up being underwhelming collections of reformatted journal articles, rather then innovative integrations of biodiversity knowledge.

With these issues in mind I've started playing with ZooKeys XML, initially looking at ways to display the article beyond the conventional format. Ultimately I'd like to embed the article in a broader web of citations and data. ZooKeys articles are available in PDF, HTML, and XML. The HTML has links to taxon pages, maps, etc., which is nice, but I personally find this a little jarring because it interrupts the reading experience. The ZooKeys web site also surrounds the article with all paraphernalia of a publisher's web site:

Zookeys
As a first experiment, I've taken the XML for article At the lower size limit for tetrapods, two new species of the miniaturized frog genus Paedophryne (Anura, Microhylidae) http://dx.doi.org/10.3897/zookeys.154.1963 and used a XSLT style sheet to reformat the article. I've borrowed some ideas from Nature's apps, such as the font for the title, displaying the abstract in bold, and showing all the figures in the article as thumbnails near the top. I've also added some basic interactivity, which you can see in the video below. Instead of figures being in one place in the article, wherever a figure is mentioned in the article (e.g., "Fig. 1") if you click on the reference to the figure it appears. If the article display a point locality using latitude and longitude, instead of launching a separate browser window with a Google map, click on the locality and the map appears. The idea is that the flow of reading isn't interrupted, figures, maps, and citations all appear in the text.


This demo (which you can see live at http://iphylo.org/~rpage/zookeys) is limited, but most of its functionality comes from simply reformatting XML using XSLT. There's a little bit of jQuery for animation, and I ended up having to write a PHP script to convert verbatim latitude and longitude coordinates to the decimal coordinates expected by Google Maps, but it's all very light weight. It wouldn't take much to add some JSON queries to make the taxon names clickable (e.g., showing a summary of a taxon from EOL). Because ZooKeys uses the NLM DTD for its XML, some of this code could also be applied to other journals, such as PLoS, so we could start to grow a library of linked, interactive taxonomic articles.

Monday, December 12, 2011

Exporting data from Australian Faunal Directory on CouchDB

Quick note to self about exporting data from my Australian Faunal Directory on CouchDB project. To export data from a CouchDB view you can use a list function (see Formatting with Show and List). Following the example on the Kanapes IDE blog, I created the following list function:

{
"_id": "_design/publication",
"_rev": "14-467dee8248e97d874f1141411f536848",
"language": "javascript",
"lists": {
"tsv": "function(head,req) {
var row;
start({
'headers': {
'Content-Type': 'text/tsv'
}
});
while(row = getRow()) {
send(row.value + '\\t' + row.key + '\\n');
}}"
},
"views": {
.
.
.
}
}


I can use this function with the view below, which lists Australian Faunal Directory publications by UUID ("value"), indexed by DOI ("key").

Couch

I can get the tab-delimited dump from http://localhost:5984/afd/_design/publication/_list/tsv/doi. Note that instead of, say, /afd/_design/publication/_view/doi to get the view, we use /afd/_design/publication/_list/tsv/doi to get the tab-delimited dump.

I've created files listing DOIs and BioStor ids for publications in the Australian Faunal Directory. I'll play with lists a bit more, specially as I would like to extract the mapping from the Australian Faunal Directory on CouchDB project and add it to the iTaxon project.

Sunday, December 11, 2011

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Banner05
Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more. It's the "Darwin Core triplet", which creates identifiers for voucher specimens in the form <institution-code>:<OPTIONAL collection-code>:<specimen-id>. For example,

MVZ:Herp:246033

is the identifier for specimen 246033 in the Herpetology collection of the Museum of Vertebrate Zoology (see http://arctos.database.museum/guid/MVZ:Herp:246033).

On the face of it this seems a perfectly reasonable idea, and goes some way towards addressing the problem of linking GenBank sequences to vouchers (see, for example, http://dx.doi.org/10.1016/j.ympev.2009.04.016, preprint at PubMed Central). But I'd argue that this is a hack, and one which potentially will create the same sort of mess that citation linking was in before the widespread use of DOIs. In other words, it's a fudge to postpone adopting what we really need, namely persistent resolvable identifiers for specimens.

In many ways the Darwin Core triplet is analogous to an article citation of the form <journal>, <volume>:<starting page>. In order to go from this "triplet" to the digital version of the article we've ended up with OpenURL resolvers, which are basically web services that take this triple and (hopefully) return a link. In practice building OpenURL resolvers gets tricky, not least because you have to deal with ambiguities in the <journal> field. Journal names are often abbreviated, and there are various ways those abbreviations can be constructed. This leads to lists of standard abbreviations of journals and/or tools to map these to standard identifiers for journals, such as ISSNs.

This should sound familiar to anybody dealing with specimens. Databases such as the Registry of Biological Repositories and the Biodiversity Collectuons Index have been created to provide standardised lists of collection abbreviations (such as MVZ = Museum of Vertebrate Zoology). Indeed, one could easily argue that the what we need is an OpenURL for specimens (and I've done exactly that).

As much as there are advantages to OpenURL (nicely articulated in Eric Hellman's post When shall we link?), ultimately this will end in tears. Linking mechanisms that depend on metadata (such as museum acronyms and specimen codes, or journal names) are prone to break as the metadata changes. In the case of journals, publishers can rename entire back catalogues and change the corresponding metadata (see Orwellian metadata: making journals disappear), journals can be renamed, merged, or moved to new publishers. In the same way, museums can be rebranded, specimens moved to new institutions, etc. By using a metadata-based identifier we are storing up a world of hurt for someone in the future. Why don't we look at the publishing industry and learn from them? By having unique, resolvable, widely adopted identifiers (in this case DOIs) scientific publishers have created an infrastructure we now take for granted. I can read a paper online, and follow the citations by clicking on the DOIs. It's seamless and by and large it works.

On could argue that a big advantage of the Darwin Core triplet is that it can identify a specimen even if it doesn't have a web presence (which is another way of saying that maybe it doesn't have a web presence now, but it might in the future). But for me this is the crux of the matter. Why don't these specimens have a web presence? Why is it the case that biodiversity informatics has failed to tackle this? It seems crazy that in the context of digital data (DNA sequences) and digital databases (GenBank) we are constructing unresolvable text strings as identifiers.

But, of course, much of the specimen data we care about is online, in the form of aggregated records hosted by GBIF. It would be technically trivial for GBIF to assign a decent identifier to these (for example, a DOI) and we could complete the link between sequence and specimen. There are ways this could be done such that these identifiers could be passed on to the home institutions if and when they have the infrastructure to do it (see GBIF and Handles: admitting that "distributed" begets "centralized").

But for now, we seem determined to postpone having resolvable identifiers for specimens. The Darwin Core triplet may seem a pragmatic solution to the lack of specimen identifiers, but it seems to me it's simply postponing the day we actually get serious about this problem.





Tuesday, December 06, 2011

Google doesn't like BioStor anymore

According to Google Analytics BioStor has experienced a big drop in traffic since the start of October:

Panda

At one point I'm getting something like 4500 visits a week, now it's just over a thousand a week. I'm guessing this is due to Google's 'Panda' update. I suspect part of the problem is that in terms of text content BioStor is actually pretty thin. For each article there is some metadata and a few links, so it probably looks a little like a link farm. The bulk of the content is in the page images, which of course, Google can't read.

I'd be interested to know of any other sites in the field that have been affected in the same way (or, indeed, sites which have seen no change in their traffic since October).

Monday, December 05, 2011

These are my species - finding the taxonomic names I published using Mendeley

The latest addition to my mapping of taxonomic names to the literature (http://iphylo.org/~rpage/itaxon/) is the ability for authors with Mendeley accounts to find the names they've published. This is an extension of the "I wrote that" tool I developed earlier.

Let's say I want to show the names that a given author has published. I could search by that author's name, but that raises all sorts of issues (see my earlier posts ReaderMeter: what's in a name? and Equivalent author names), especially for this database where I have incomplete citations and in many cases lack author names beyond surname.

Another way to tackle the problem is if I have a list of publications for an author, then all I need to do is match that list to the publications in my taxonomic database. If both lists have identifiers for the publications, such as DOIs, then the task is trivial. But, where do I get these lists?

An obvious source is Mendeley, where people are building lists of their own publications (as well as other publications that they are interested in). For example, my publications are listed at http://www.mendeley.com/profiles/roderic-page/.

But I don't want to have to get these lists myself, I'd much rather that a Mendeley user could go to my taxonomic database, say "I have this Mendeley account, show me the names I've published". One reason I'd like to do this is that if I want people to engage with this project it would be nice to be able to offer an immediate reward, in this case, a place where you can show your contribution to the task of cataloguing life on this planet.

Finding my taxonomic names

If you have a Mendeley account here's what you do:

Go to http://iphylo.org/~rpage/itaxon/. At the top right you will see a "Sign in using Mendeley" link.

M1
Click this and you will be taken to Mendeley where you will be asked if you'd like to allow http://iphylo.org/~rpage/itaxon/ to connect to your account (if you're already logged in to Mendeley then you'll see an Accept button, otherwise Mendeley will ask you to log in).

M2
If you click on Accept then you will be taken back to my site and you should now see your profile name and picture on the top right:

M3

If you click on the Profile link then my site will talk to Mendeley and get a list of your papers and look for them in my database. If it find a paper it outputs the taxonomic names published in that paper. For example, here is my profile:

M4

Listed are the species of bird lice in the genus Dennyus described in a paper on which I was a coauthor (http://dx.doi.org/10.1046/j.1365-3113.1996.d01-13.x).

This list is incomplete as earlier papers of mine on crab and isopod taxonomy aren't listed because these lack identifiers. This is something I need to work on, but for now this seems like a simple way to enable someone to go to the http://iphylo.org/~rpage/itaxon/ mapping between taxonomic names and literature and find the names they've authored.

If you have a Mendeley account, and your list of publications in Mendeley includes papers describing new animal species, go to http://iphylo.org/~rpage/itaxon/ and try it out.