Reproducible research

As a resarcher focused on quantitative anlyses of on-line communities, I need to keep up-to-date on the field. I have to read papers and articles, written by other colleagues and scholars on related topics. I must search for new methods and algorithms to cut out execution times, and finish before the next deadline. I have to evaluate new tools that let me create new graphs or compute new analyses. And I have to review many papers in different conferences, presenting results in this area. In this context, I’m still surprised by finding the same problem, over and over again.

When I started to study Wikipedia, 4 years ago, I was puzzled by the lack of reproducibility in most (but not all) of the papers and analyses I could find at that time. No source code available. Few implementation details. Little discussion on how to set up a similar environment and replicate the analysis. If you were lucky, you could access some evaluation version of a new cool tool, just to discover that it was deadly limited. Forget about the code. Try and do it yourself, if you can. That’s why, since the very beginning, one of the main goals of my PhD. was to publish an alternative, open source software tool to analyze any language version of Wikipedia. At least, I succeeded in publishing all scripts related to

my dissertation. I called the tool WikiXRay (yes, I know I’m not an expert at
choosing names…). Unfortunately, as I got busier, I had less time to update it. I admit that I haven’t had time to upload some of the very last scripts, yet. So, I eventually became another piece in the big machine.

That’s why I was astonished when, some weeks ago, my workmate and friend Gregorio Robles told us that he had a paper accepted at MSR 2010. The topic was very familiar for me, but somewhat novel in these venues: research reproducibility. In short, Gregorio checked the full list of papers previously published at MSR, looking for reproducibility conditions like:

  • Analyses based on public data repositories (and still accessible today).
  • Methodology clearly explained.
  • Implemented using open source software tools, available for dowloading.
  • Enough documentation and implementation details as to obtain the same (or updated) results and graphs as displayed on the paper.

Unsurprisingly, a very, very low number of papers matched these conditions. I was very glad to hear that the paper was accepted, since we all need this kind of healthy critics. I look forward to learning about the comments gathered in the presentation.

Later that night, I was reading the preface of the second edition of Handbook of Statistical Analyses Using R, one of the top reference manuals for statistical computing with GNU-R. There I found a nice summary on this topic, even including some references ([Leisch, 2002a,b], [Leisch 2003], [Leisch and Rossini 2003]). This supports the idea that GNU-R is becoming a reference for reproducible statistical research. In fact, the Sweave package is a landmark contribution in this direction.

Conclusion: we should reflect on this emerging call on reproducible research. Specially in quantitative analyses, these are of very little help if other people cannot reproduce them, learn how to implement them, and contribute back to improve them. It’s not a matter of trusting the authors of the study. It’s about learning, explaining discoveries to other people and let them share the same approaches. Think about the many paramount discoveries in Physics or Chemistry over the past centuries. Do you think that somebody could take them seriously before being able to reproduce the experiments, and verify their claims with facts and empirical results?

It would be really absurd that, living in the digital era of endless data sources, FLOSS, on-line collaborative communities with millions of participants, and huge information repositories published by some goverments, we were limited to watch a long parade of fancy results and graphs, passing by the shop window, while we’re unable to test them, touch them and learn how they work and how to improve them.

[Leisch, 2002a] Leisch, F. “Sweave: Dynamic generation of statistical reports using literate data analysis”, in Compstat 2002 — Proceedings in Computational Statistics, eds. W. Härdle and B. Rönz, Physica Verlag, Heidelgerg, pp. 575-580. 2002. ISBN 3-7908-1517-9

[Leisch, 2002b] Leisch, F.”Sweave, Part I: Mixing R and Latex”, R News, 2, 28-31, URL http://CRAN.R-project.org/doc/Rnews/. 2002.

[Leisch, 2003] Leisch, F.”Sweave, Part II: Package vignettes”, R News, 3, 21-24, URL http://CRAN.R-project.org/doc/Rnews/. 2003.

[Leisch and Rossini, 2003] Leisch, F. and Rossini, A. J. “Reproducible statistical research”, Chance, 16, 46-50. 2003.