The private side of public data

No, this is not just a play on words. So far, I’ve collected some background examples to illustrate this apparent contradiction: public data on the Internet may be not so public. To be more precise, data privacy rights and other interests from some companies can play a major role in this problem.

As a researcher on open on-line communities (and open here is a synonym of publicly accessible virtual groups), this is a key concern for me. Likewise, it should be important for many other colleagues in this field, and the myriad of companies collecting and mining huge data sets on a daily basis.

Let me show you some examples. On March 26-27, I attended the CPOV conference in Amsterdam, organized by the Institute of Network Cultures. In the same session, Stuart Geiger presented an overview of the influence of Wikipedia bots in the editorial work of the community. One example quickly got my attention: HagermanBot. This bot was created to look for unsigned comments on Wikipedia talk pages, then automatically insert the signature of the corresponding author next to them. The bot raised a strong controversy among users, since many of them thought that the guideline of signing comments on talk pages was just that, a recommendation, not a rule to be enforced. Some could be embarrased by login names
displayed next to comments. Wait a minute: isn’t that information public? Yes, of course it is. Just click on the history tab and you get the revision history page, tracking all comments and their corresponding authors.

In most wikis (including MediaWiki) the goal is accessing and editing content. Identity of authors is secondary. However, sometimes the identity of authors does matters. Not for the readers, but for the meritocratic system of the community. You need to follow the contributions of a certain editor to decide whether she is trustable or not. And you may be interested in following her comments on talk pages. In turn, some authors may want to link some comments with their profile, while leaving others unsigned. The problem is that, in a public system with logs, you can almost never achieve that. The difference, though, is how much work you need to retrieve the info. Perhaps forcing users to go through the vast track of history changes is all you need to discourage them. But maybe not.

Another example is the study of FLOSS. Researchers ellaborating activity reports get now and then some curious requests from very active users. In short, those users are worried about being highlighted as very active contributors. In particular, they are afraid of their hourly efforts being published. They work for private companies and… they’re not supposed to be volunteering on these projects in working hours. Again, the same problem. That information is public. And anyone is able to retrieve it, including the companies for which these developers work. What’s the trick, then? The cost of accessing the info. Wandering through a database dump is very different from reading pretty formatted results on a report.

One of the best-known examples on this matter during past weeks has been the case of Peter Warden and his Facebook data collection. You can read the full story on this post on Pete’s blog. To put it simple: Pete collected a bunch of data from Facebook public accounts using a crawler. Of course, before that he checked that he was allowed to do so. He planned to use that information to provide innovative assessment services on personal connections, in his new startup Mailana. Suddenly, he realized that the information could also be useful for analysts studying social networks. Thus, after publishing some nice graphs, he announced the release of this archive, containing public info from thousands of Facebook accounts, for research purposes. Then, Facebook attorney jumped
in and forced him to: a) Stop any attempt to publish de dump; b) Destroy all copies (including those already in hands of some third-party companies).

As Pete explains on his own blog, “Checking Facebook’s robot.txt, they welcome the web crawlers that search engines use to gather their data”. So, he wrote his own crawler. And he was threatened to get sued by Facebook, just for following the common standard regulating “the way the web has worked for the last 16 years since robots.txt was introduced“. Well, my bet is that Facebook discovered that Pete wasn’t doing anything they didn’t permit, but at the same time, they were scared about the possible consequences of a mass of annoyed users who learned that their (public) data can be crawled. I wonder how many people have done the very same thing before without telling anybody, and maybe without such a honest intention as Pete pursued. Stop for a moment. Think about it. If you upload your data to Facebook (in fact, to any public social network on the Internet) anybody else is able to access it,
ultimately. That’s why Facebook has enabled filters. They should have done so long before these privacy concerns raised. Nevertheless, some info can still be publicly accessed. We should keep that in mind.

Most probably, Facebook decision was influenced by the recent controversy raised about the Buzz service launched by Google on February 9, 2010. The service was automatically activated on all Gmail accounts, apparently discarding the usual betatesting period. The new service got many critics for invading the privacy of users, since they couldn’t control who had access to their top used email contacts. Criticism also reached Facebook, from comments by researcher Danah Boyd who spoke on this at SXSW. Facebook was also  mentioned due to their recent changes on
privacy policies
. So, the last thing Facebook wanted was more bad press.

But the height of nonsense are some corners of the EU legislation on data protection. As for this, personal data is defined in a complex way, but we can safely assume that any kind of information about a living, identifiable individual is personal data. So vague. Well, for me it’s clear that your full name, national ID or passport number, your postal address, etc. all can be considered personal data. Even cell phone numbers could be considered as such. So far, so good. It may force you to declare as a personal data archive your own agenda in your cell phone, but it’s ok . Now we turn to more complicated examples. What about email addresses? Login names? All this info can be found publicly, scattered around mailing lists, forums, wikis, etc. When you compile that info, there are some basic rules you should follow (not legal, but common sense rules of thumb). Example: you shouldn’t publish a endless list of emails, all well-packed and nicely formatted, since you’d be doing the hard work for spammers.

So if you are running a project in EU collecting information about on-line communities (not many of that kind, unfortunatley) you’re almost surely forced to declare the existence of the archive, who’s in charge of retrieving info and who’s responsible for maintaining it. Wikipedia is again an interesting case study, since my colleague emijrp, from University of Cadiz, pointed me to this interesting paragraph about the data privacy policy in Wikipedia:

User contribution

User contributions are also aggregated and publicly available. User contributions are aggregated according to their registration and login status. Data on user contributions, such as the times at which users edited and the number of edits they have made, are publicly available via user contributions lists, and in aggregated forms published by other users.

Therefore, Wikipedia acknowledges the right of researchers to go through this public information. However, you should also consider the possible side-effects of your results. For example,  to be polite you should avoid as much as possible highlighting individual users patterns linked with their real login, unless you have checked with them that they don’t have any problem about that. And of course, you should adhere to the European legislation (as much as you can) if you work with personal data from European citizens.


In summary: dealing with public data could be more complex that you might think at first sight. Don’t underestimate the dark side of public data. In the Internet we find a lot of public information. But compiling it and processing it (specially on a massive scale) can be tricky. Not technically, but legally. The debate is reaching a point at which the whole research area on on-line communities could be affected by next decisions made by big shots in this business. The question is not what we can do (technically) with these data. Now, the question turns to: what are we allowed to do?

Personally, I would be happy to follow a common set of guidelines to anonymize massive information sets retrieved from social networks, since we’re primarily interested in understanding the activity patterns followed by millions of users collaborating together. We don’t want to uncover who’s doing what. We are concerned about respecting the legitimate privacy rights of participants in these networks. It would be a pity that, in the new era of Information Society, we cannot learn more about the new projects unleashing the power of virtual collaboration worlwide because we fail to meet the expectations on how to manage the private side of public data.