Friday, July 12, 2013

Learning from eLife: GitHub as an article repository

Playing with my eLife Lens-inspired article viewer and some recent articles from ZooKeys I regularly come across articles that are incorrectly marked up. As a quick reminder, my viewer takes the DOI for a ZooKeys article (just append it to http://bionames.org/labs/zookeys-viewer/?doi=, e.g. http://bionames.org/labs/zookeys-viewer/?doi=10.3897/zookeys.316.5132), fetches the corresponding XML and displays the article.

Taking the article above as an example, I was browsing the list of literature cited and trying to find those articles in BioNames or BioStor. Sometimes an article that should have been found wasn't, and on closer investigation the problem was that the ZooKeys XML has mangled the citation. To illustrate, take the following XML:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

I've highlighted the contents of the article-title (title) and source (journal) tags, respectively. Unfortunately the actual title and journal should look like this:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

Tools to find articles that rely on accurately parsed metadata, such as OpenURL, will fail in cases like this. Of course, we could use tools that don't have this requirement, but we could also fix the XML so that OpenURL resolves will succeed.

This is where the example of the journal eLife comes in. They deposit article XML in GitHub where anyone can grab it and mess with it. Let's imagine we did the same for ZooKeys, created a GitHub repository for the XML, and then edited it in cases where the article metadata is clearly broken. A viewer like mine could then fetch the XML, not from ZooKeys, but from GitHub, and thus take advantage of any corrections made.

We could imagine this as part of a broader workflow that would also incorporate other sources of articles, such as BHL. We could envisage workflows that take BHL scans, convert them to editable XML, then repurpose that content (see BHL to PDF workflow for a sketch). I like the idea that there is considerable overlap between the most recent publishing ventures (such as eLife and ZooKeys) and the goal of bringing biodiversity legacy literature to life.