Open data sets in science

I have a question to challenge all my colleagues working with research data in Computer Science: When was the last time you could replicate a previous study, from other author(s)?

For different reasons, over the past few months I have found myself diving into the rich collection of previous research works in several areas: Wikipedia studies, libre software engineering, social media and social network analysis, to name a few. Probably, many of you already know my inborn bias towards quantitative research (but also for multidisciplinar research methods). So, it may sound totally unsurprising that most of the publications I was reviewing included empirical experiments on different datasets gathered from a wide variety of sources, target systems and virtual communities. As I was scrolling through the pages, I realized, once again, the huge proportion of research work that cannot be replicated in a easy way. Still a sad lesson to be learned, considering that, today, most of us researchers work with digital data. And bits can be duplicated or sent to the other side of the world at negligible cost.

I already commented in my first post the curious study conducted by my colleague Gregorio Robles, about replicability of research works published in MSR. For those of you unfamiliar with MSR series, this is a working conference (formerly a workshop) devoted to the art of “Mining Software Repositories”. It is also co-located with ICSE, preeminent conference on software engineering, so it attracts the top-notch specialists in
this area. One would expect that a scientific conference focused on such an empirical, hands-on activity would encourage (and even demand) the ability to access all datasets and tools used in previous experiments, in order to i) better learn the insights of different methods and practical solutions to problems in this area and ii) to make their life easier to other researchers willing to build on top of existing methods, tools and results.

Far from this, the conclusions from the replicability study were quite dissapointing. From the 171 papers published in the 6 previous editions of MSR, the most frequent case (64 papers) is that of a study that uses publicly available data sources, but it doesn’t offer access to the processed dataset (the results), or to the tools/scripts to perform that study, either. Even more worrisome is a trend discovered in these publications: as time goes by, the number of papers with publicly available processed datasets was lower!! Therefore, the situation is getting worse.

But even the original sources, the publicly available digital datasets that nurture empirical research in Computer Science (and many other areas) may become scarce. Sometimes, privacy concerns are put on the table to support the data locking approach, overlooking methods to anonymize information (useful at least for certain studies). Other times, we witness how social media platforms like Twitter unilaterally change their TOS making it more difficult to conduct research that uses its (public?) datastream. Not only that, but also forcing other very useful services like to stop providing a useful service for the research community. Suddenly, the company wants to monetize the analysis of this rich data stream through well-known resellers.

In the Wikipedia article about Science you can find a great citation attributed to Richard Feynman:

…there is an expanding frontier of ignorance…things must be learned only to be unlearned again or, more likely, to be corrected


Photo by Jerry Daykin CC-BY 2.0

I believe that classic researchers, living in a world without digital artifacts and cutting-edge communication networks understood this much better than us. In Physics, Chemistry or Natural Sciences, most of times theories had to be contrasted with experiments, and results must be validated by other independent scholars to become widely accepted. And not for a naive argument of suspecting other colleagues committed errors or imprecisions. It is a matter of learning from previous work, and augment it in a more efficient way. I’m usually amused at the stories of other colleagues or students, who found themselves struggling with empirical studies, sometimes involving huge datasets. “You know, I spent much more time to retrieve data, to organize it, to deal with incorrect or nonsense values and to prepare data to be analyzed than actually performing the analysis and interpreting results!” I smile, then nod. “But research papers usually tell nothing about that!” Many people will say that’s because of “space limits” to write your report. Uh!

For example, Wikimedia Foundation is now undertaking an ambitious plan to revamp the Meta:Research wiki, that currently conveys a lot of information about past, present and future studies and research iniatives on Wikimedia projects and datasets. One of the goals is to organize and facilitate open access to original and processed datasets produced by researchers in this field, so that the next time you want to replicate or build on top of previous research results you don’t hit a brick wall.

Next time you come back from CHI, CSCW, ICWSM, ICSE, MSR, OSS, WikiSym, or any other relevant conference ask yourself this question as for the best research works you found: Could I replicate this work? Could I extend  it in an easy way, without starting all this process from scratch? Where can I find the data source, the dataset and the tools presented in these paper? Just as we find natural to archive research publications, why don’t we find just as natural to archive and give free access to research data and artifacts (processed datasets and code)? Why can’t we measure research impact, as well, by the number of times other researchers re-use research data and artifacts, already available, for their own convenience?

I’m sorry, but I think Computer Science scholars cannot simply dismiss this big issue. We should remember a good-old saying from school: the best way to learn something is by doing it. Let’s make it easier for others to learn from our work, and improve it.