Saturday, August 23, 2008

Reasons text mining will fail. I. UTM Grid References and GenBank accession numbers

OMG. Playing with extracting identifiers from text, I have a regular expression for GenBank accession numbers that looks something like this:
(A[A-Z])[0-9]{6} | (U[0-9]){5} | (D[A-Z])[0-9]{6} | (E[A-Z])[0-9]{6} | (NC_)[0-9]{6}).
OK, it won't get everything, but what is more worrying are the things it will pickup that aren't GenBank accession numbers.

For example, I ran Robert Mesibov's 2005 paper "The millipede genus Lissodesmus Chamberlin, 1920 (Diplopoda: Polydesmida:
Dalodesmidae) from Tasmania and Victoria, with descriptions of a new genus and 24 new species" [PDF here] through a script, and out came loads of GenBank accession numbers ... which is a worry as there aren't any sequences in this paper.

Turns out, Mesibov uses UTM grid references to describe localities, and these look like just GenBank accessions. There is a nice web site here which describes how UTM grid references are determined in Tasmania (from which the image below is taken).

Not all the "accession numbers" in Mesibov(2005) exist in GenBank, but some do, for example grid reference DQ402119 (41°26'31''S 146°17'02''E) is also a sequence DQ402119 and, you guessed it, it's not from a millipede. So, I need to be a little bit careful in extracting identifiers from text.

No comments: