Wednesday, August 10, 2016

On asking for access to data

In between complaining about the lack of open data in biodiversity (especially taxonomy), and scraping data from various web sites to build stuff I'm interested in, I occasionally end up having interesting conversations with the people whose data I've been scraping, cleaning, cross-linking, and otherwise messing with.

Yesterday I had one of those conversations at Kew Gardens. Kew is a large institution that is adjusting to a reduced budget, a changing ditigal landscape, and a rethinking of it's science priorities. Much of Kew's data has not been easily accessible to the outside world, but this is changing. Part of the reason for this is that Defra, which part-funds Kew, is itself opening up (see Ellen Broad's fascinating post Lasers, hedgehogs and the rise of the Age of Yoghurt: reflections on #OpenDefra).

During this conversation I was asked "Why didn't you just ask for the data instead of scraping it? We would most likely have given it to you." My response to this was "well, you might have said no". In my experience saying "no" is easy because it is almost always the less risky approach. And I want a world where we don't have to ask for data, in the same way that we don't ask to get source code for open source software, and we don't ask to download genomic data from GenBank. We just do it and, hopefully, do cool things with it. Just as importantly, if things don't work out and we fail to make cool things, we haven't wasted time negotiating access for something that ultimately didn't work out. The time I lose is simply the time I've spent playing with the data, not any time negotiating access. The more obstacles you put in front of people playing with your data, the fewer innovative uses of that data you're likely to get.

But it was pointed out to me that a consequence of just going ahead and getting the data anyway is that it doesn't necessarily help people within an organisation make the case for being open. The more requests for access to data that are made, the easier it might be to say "people want this data, lets work to make it open". Put another way, by getting the data I want regardless, I sidestep the challenge of convincing people to open up their data. It solves my problem (I want the data now) but doesn't solve it for the wider community (enabling everyone to have access).

I think this is a fair point, but I'm going to try and wiggle away from it. From a purely selfish perspective, my time is limited, there are only so many things I can do, and making the political case for opening up specific data sets is not something I really want to be doing. In a sense, I'm more interested in what happens when the data is open. In other words, let's assume the battle for open has been won, what do we then? So, I'm essentially operating as if the data is already open because I'm betting that it will be at some point in time.

Without wishing to be too self-serving, I think there are ways that treating closed data as effectively open can help make the case that the data should (genuinely) open. For example, one argument for being open is that people will come along and do cool things with the data. In my case, "cool" means cross linking taxonomic names with the primary literature, eventually to original decsriptions and fundamental data about the organisms tagged with the taxonomic names (you may feel that this stretches the definitoon of "cool" somewhat). But adding value to data is hard work, and takes time (in some cases I've invested years in cleaning and linking the data). The benefits from being open may take time, especially if the data is messy, or relatively niche so that few people are prepared to invest the time necessary to do the work.

Some data, such as the examples given in Lasers, hedgehogs and the rise of the Age of Yoghurt: reflections on #OpenDefra will likely be snapped up and give rise to nice visualisations, but a lot of data won't. So, imagine that you're making the case for data to be open, and one of your arguments is "people will do cool things with it", eventually you win that argument, the data is opened up... and nothing happens. Wouldn't it be better if once the data is open, those of us who have been beavering away with "illicit" copies of the data can come out of the woodwork and say "by the way, here are some cool things we've been doing with that data"? OK, this is a fairly self-serving argument, but my point is that while internal arguments about being open are going on I have three choices:

  1. Wait until you open the data (which stops me doing the work I want to do)
  2. Help make the case for being open (which means I engage in politics, an area in which I have zero aptitude)
  3. Assume you will be open eventually, and do the work I want to do so that when you're open I can share that work with you, and everyone else
Call me selfish, but I choose option 3.