Sunday, July 6, 2014

A new Style of News Reporting: Wikileaks and Data-driven Journalism

Update:  I uploaded a PDF version in the Social Science Open Access Repository, available under

This article was originally written and published in 2011 in the Open Access journal Cyborg Subjects. While it was included in a book release (Amazon-Link), it is no longer available online on the journal's homepage. I therefore decided to republish it here.

The coverage of Wikileaks’ huge amounts of leaked data was a challenge for newspapers – they had to figure out how to get stories out of extensive and complex data sets and how to present their findings to readers. The result significantly differs from traditional news reporting; including illustrations, interactive web applications and reading instructions to make the material accessible. This style of news reporting is called data-driven journalism. The international interest in the leaks combined with collaborative work between newspapers from different countries made it a new trend in current journalism. A key lesson from working with this kind of material is that data collection is essential for the effectiveness of the used techniques. If journalists would adapt this insight to their own, internal data collection process, this form of news reporting could be used on a large scale and be much more common. The coverage of Wikileaks’ might give a glimpse of how journalism will look like in the future.

A new Style of News Reporting. Wikileaks and Data-driven Journalism
Newspapers are still struggling with the changing media environment that is undermining their traditional business model and are unsure how to make profits online (Freedman 2010). With growing commercialization, journalists tend to use new technology foremost to speed up the news production process rather than experimenting with the new possibilities or enhancing quality (Phillips 2010). However, the collaboration with Wikileaks challenged traditional newspapers and forced them to think about new ways of finding and telling stories. They had to work with large and extensive data sets. To take an example, the Afghanistan War Logs consisted of about 92,000 documents written in a military jargon (Rogers 2011). The obvious problem is accessibility – both for journalists who want to get a story out of the material and for readers who want to take a closer look at it. Letting journalists go through everything individually would be too time consuming and writing about the findings in a traditional manner seemed insufficient for the coverage. Especially The Guardian and New York Times realized that early on. Tools were used to go through the data and to create visualizations and interactive web application which made the material accessible for readers. This form of news reporting is called data-driven journalism – and Wikileaks contributed to its development as a trend.

Data-driven Journalism
Scholars and professionals started to discuss data-driven journalism very recently. In April 2010, the European Journalism Center and the University of Amsterdam initiated the one day event Data-driven journalism: What is there to learn? to define it and discuss possible implications. At this event, Lorenz defined data-driven journalism as “a workflow, where data is the basis for analysis, visualization and – most important – storytelling” (2010: 10). Due to the storytelling aspect, the end product is more than just a visualization of data – it is also contextualizing and highlighting of important aspects. Bradshaw (2010) explains this data-driven workflow in more detail and distinguishes four steps: finding the data (1), interrogating data (2), visualizing data (3) and mashing data (4). Finding can involve having expert knowledge, good contacts or technical skills to gather data. The interrogation requires a good understanding of the used jargon and wider context of the data. Visualization and mashing can involve the work of designers and/or free tools. An example is IBM’s ManyEyes, where users can easily upload and visualize data for free. As Bradhsaw points out, these four steps require teamwork: “The reality is that almost no one is doing all of that“ (2010). At the end of this workflow, raw data should be accessible for readers. Lorenz describes it as a process of refinement, raw data is transformed into something meaningful: “As a result the value to the public grows, especially when complex facts are boiled down into a clear story that people can easily understand and remember” (Lorenz 2010: 12).

Data-driven journalism is not something completely new. As Rogers (2010a) shows, it can be considered to be quite old instead. He describes Florence Nightingale as one of the first data-journalists in the 19th century who already worked with visual presentations of information to tell stories. What really is new, however, is the media environment journalists are working in. Especially these four aspects indicating a growing importance of data-driven journalism:
  • The sheer amount of publicly relevant data available online. Especially in the United States and Britain, huge data sets are available in connection with the open government initiative. The problem here is the same as described above: Having access is not enough without accessibility. To take Britain, most governmental data is released as a simple and static PDF file (Stay 2010). Journalists from The Guardian and New York Times saw the potential and started to fill this gap by offering interactive tools and illustrations to add public value to the data.
  • The existence of free tools to handle this data, like the already mentioned ManyEyes.
  • The possibility to make the data accessible in an interactive way with web applications.
  • Time is precious for journalists, they are always under pressure to get the story out fast (see Phillips 2010). By giving access to the raw data, it is possible to involve people outside the newsroom in the process of news production with crowdsourcing – the collaborative analysis by volunteers. This can save time and resources for researching.
Obviously, data-driven journalism greatly benefits from the possibilities of new media. Its perception as a trend is therefore not surprising.

The role of Wikileaks for Data-driven Journalism
Is Wikileaks data-driven journalism in itself? Two contra arguments are that it does not provide visualizations and does not attempt to generate stories out of its materials (only a brief contextualization is given) – both is largely left over to established news media or is considered to be done by ‘users’ (see Lovink et al. 2010). In regard to the workflow of data-driven journalism, Wikileaks is doing the first and second step of collecting and interrogating data without going further. A key aspect, the transformation of raw data into something meaningful to add public value, is not given. To what extent Wikileaks can be considered journalistic more generally remains open for debates, but it is not a form of data-driven journalism alone – but surely an important actor in the data-driven workflow nonetheless. From this perspective, Wikileaks is a source for data that needs to be ‘refined’ to add public value.

Wikileaks as a data-source can be called a driving force of data-driven journalism and has contributed to its development as a trend for three main reasons. First and obviously, to analyze and cover its huge amounts of leaked (raw) data, data-driven journalism techniques are essential both for journalists who want to get a story out and present it to their readership and for readers who can access the material through visualizations and reading instructions. The second reason is that the leaks were interesting for an international audience. The released data from the open government initiatives in the United States and Britain were only interesting for national audiences and there was no need for foreign newspapers to work with it. Connected to this, the third reason is the collaborative work between newspapers from different countries combined with the simultaneous release date of their coverage. The coverage of the Afghanistan War Logs therefore internationally demonstrated the advantages data-driven journalism can have. In comparison, not all of Wikileaks‘ media partners were able to keep up with The Guardian and New York Times. In Germany, where the open government movement was (and still is) much weaker, Der Spiegel covered the Afghanistan War Logs in a much more ‘traditional’ way, using no interactive illustrations at all and focusing on the print version (Krebs 2010). The experience in Britain and the United States to work with huge amounts of data was clearly an advantage for the coverage and made newspapers from other countries aware of the potential. As a result, almost every media partner followed their example and offered visualizations for the second major leak, the Iraq War Logs. As Simon Rogers from The Guardian states: “Wikileaks didn’t invent data journalism. But it did give newsrooms a reason to adopt it” (Rogers 2011).

Using data-driven journalism on Wikileaks’ materials: What was there to learn?
To be more concrete about how data-driven journalism was used in connection with Wikileaks, lets take a closer look at the Iraq War Logs and the ‘Cablegate’ (focusing on The Guardian as an example).

The War Logs contained 391,832 field reports from soldiers. Since each report describes only a single incident, visualizations are extremely helpful to see patterns and get a bigger picture. Two important characteristics made it relatively easy to automatically separate those logs into categories: The standardized format and the use of a dense military jargon, giving meta-data about date, location, type of incident etc. (Matzat 2010). In other words: The data set was largely readable for machines. The Guardian concentrated on incidents where someone had died and separated them into cause of death, who were killed (for example civilians or hostile forces), time, location etc. (Rogers 2011). Then they used Google Fusion tables and marked every single death in Google Maps. The map was released alongside with key findings from their statistical analysis (Rogers 2010b). This gave an overview of the amount of people killed and further information to contextualize it (for example, most of these people were civilians). In addition, The Guardian took all incidents from a single day to create an interactive graphic (Dant et al. 2010). While a timer is running from the first to the last minute of this day, a map shows the location of each incident, gives a description of what happened and counts the total amount of dead people. It also offers a link to the original report of each incident. As Lorenz described, abstract numbers were broken down into something meaningful. By visualizing a single day, you can get a better picture of the atmosphere and violence that shines through the logs. Apart from that, the fact that the material was readable for machines did not only help to create visualizations to present the news and make the material accessible for readers. The automatic separation into categories was used to guide the selection of documents worth reading for the coverage – which can speed up the generating of stories out of the data set.

Compared to the War Logs, visualizations for the ‘Cablegate’ are rare. According to Matzat (2010), this is not only due to the broad geographical reference but mainly to the content of the material. While the War Logs could be categorized and visualized relatively easy due to their clear structure, the diplomatic dispatches (‘cables’) are extensive reports and complex analysis. As Rogers from The Guardian points out, their “reporters ended up with the enormous task of actually going through each cable, reading it and seeing what stories were there” (2011). Still, The Guardian created a static world map showing how many cables come from which locations and how they are classified. This may be useful to get an overview of the material, but without knowing the actual content of the cables it does not give readers a better access to it. The fact that 1,083 cables have been sent from London to Washington is not interesting without knowing what is written in it. Seeing the problem, The Guardian also offers a more ‘context-rich’ interactive map. Users can click on a country and get list of both the original cables from Wikileaks and a list of articles covering the content of those cables, which is a very useful tool to investigate the material. However, only a small amount of cables is available on this map yet, partly due to the material and to the releasing policy of Wikileaks (not all cables have been released simultaneously, they continue to be steadily released in stages). For this kind of unstructured material, crowdsourcing or alternative web resources for investigating it is still an advantage of data-driven journalism. There are a couple of crowdsourcing projects or search engines for the cable releases, for example CableWiki or CableSearch (see an overview here). These resources can form the base for further visualization attempts in the future.

The coverage of the Iraq War Logs and the Cablegate showed that the effectiveness of data-driven journalism techniques is dependent on the material at hand. For structured and machine-readable data, they are very helpful for both showing journalists where to find a story in the material and for readers who can get access through visualizations. For more extensive and unstructured data like the diplomatic cables, visualizations are not as useful and there is no other way than reading everything individually.

First Precursor of a new Journalism?
With more and more publicly relevant data available online and a further development of visualization techniques, data-driven journalism is at least likely to become a more established form of news reporting. However, it is questionable if such data will continue to come from Wikileaks. The recent release of the Guantánamo Bay files seems to be “very nearly the final” (Gabbatt 2011) cache of the huge data set the platform supposedly obtained from Bradley Manning. I think such persons who have access to those files and are willing to leak it are far from the norm. Even if Wikileaks is this initial spark for a ‘leaking culture’ (which can be assumed due to the rise of more specialized and local leaking platforms like Greenleaks) it is unlikely that leaked data with the same impact and size as the Cablegate or the Iraq War Logs will be common. Apart from that, the future of open government initiatives is unclear as well – especially after the budgets for this project have been cut in the United States (Yau 2011). When newspapers solely rely on the success of leaks and open government, data-driven journalism may remain a niche form of news reporting.

Therefore, I would argue that the real lesson journalists can learn from the collaboration with Wikileaks is shown by Kayser-Bril et al. (2011). They suggest that media organizations should not wait for the release of other data sets and, instead, further embrace the opportunities of data-driven journalism by becoming ‘trusted data hubs’ themselves. They should not only focus on handling externally produced data sets, but also develop and structure their own, internal database. Even though Kayser-Bril et al. do not refer to Wikileaks, they largely take the experience with its materials into account by stressing that the way data is collected is essential. Basically, all content produced by journalists is already data. What has to be changed is the way this data is collected, making it readable for machines and enable journalists to quickly analyze large and complex data sets and build stories around them. Every event can be broken down by some fundamental information (latitude, longitude etc.), described in a structured manner and linked to other events in a database. As an example of the possibilities, Kayser-Bril et al. mention the crime page of a newspaper. Instead of just giving a list of articles about crime events, it could be transformed into a web application that plots the events over time with the options to sort the data by time, type of crime, location and visualizing it on a map – similar to The Guardian’s map for the War Logs.

When newspapers adopt these ideas, data-driven journalism will surely be a more common and established form of news reporting that can come into use regardless of leaks or open government. Journalism could benefit from the new possibilities for finding, telling and presenting stories demonstrated in the coverage of Wikileaks‘ material on a large scale. As Phillips (2010: 100) and Benson (2010: 192) are pointing out, more important than the capabilities of new technology is the way journalists actually use it. Becoming data-hubs could make them aware that they can and should use the new possibilities to improve the quality of news reporting and not only the speed of production. This would be an important step forward – not least initiated due to Wikileaks.


Friday, July 4, 2014

#asmc14 afterthoughts: Big Data and Democracy

'Big data' is usually conceived as a way to generate knowledge by analyzing ever larger and 'messier' quantities of data. The rationality behind big data is often associated with centralized control and surveillance: Grab as much (if possible: all) data there is about a phenomenon and analyze it to discover patterns and predict future behavior. Not something one would easily associate with democratic values or citizen empowerment.

However, from a historical perspective it seems that big data is the latest expression of what can be described as the "two-faced nature of quantifying society". Porter illustrates this two-faced nature when he points out that the notion of "objectivity" is
evidently required for basic justice, honest government, and true knowledge. But an excess of it crushes individual subjects, demeans minority cultures, devalues artistic creativity, and discredits genuine democratic political participation. (1995, p. 3)
Fears over excessive objectivity seem to echo our modern-day critique on big data. We could re-articulate such fears with relation to data by asking: When data is used rhetorically as "that which is given prior to argument" (Rosenberg 2013, p. 36) - as the 'factual' and indisputable basis for debate - where is room for argument and debate when data is everywhere? On the other hand, Porter's observation also points out that "quantification was important for democratization", as Bernhard Rieder mentioned after his excellent presentation (thanks to him for pointing out Porter's book to me!). Since increased quantification can have negative and positive effects, we should not only criticize big data but also think about the conditions under which 'datafication' - the ubiquitous quantification of social life underpinning big data (van Dijck 2014) - can actually be good for democracy. Of course, this does not mean that critique is not important! My point is that we have to accept the fact that these technologies are here to stay. Thus, thinking about how to overcome the dangers of big data's modern-day practices and rationalities is valuable and important.

Looking at alternative data rationalities

I think a good starting point is to look at alternative approaches or rationalities around data that do not follow the categories and logics of big data. Therefore, I want to point out some presentations from the Social Media and the Transformation of Public Space conference that addressed alternative approaches to datafication:
  • In my own presentation about the Open Data movement (you can get the slides here) I argued that datafication may not only lead to "big data rationalities", but also to a spread of values and practices from the Open Source culture. This idea is based on the observation that Open Data activists take key values and practices from Open Source and apply them to new domains outside the development of software (see also Kelty 2008). For example, 'raw data' is conceived as 'source code' that should be shared openly. For activists, this implies a slightly different role of journalism and a form of political participation that to some degree resembles the 'Bazaar model' of Open Source. Such a spread of Open Source culture could lead to a re-articulation of concepts like journalism, participation and democracy - in ways that may not have seemed possible before.
  • Helen Kennedy's presentation Making Analytics Public: really useful analytics and public engagement (you can find both hers and mine abstract here) asked whether and under which conditions (data) analytics can contribute to the public good. She argued that analytics need to become more public itself in three ways. First, both the data and the analytical tools should be available to the public to use. Second, instead of being proprietary and black-boxed analytics need to be open to public supervision in order to be scrutinized and debated. I think this point connects to the question whether public social media are a good idea. Moreover, Nick Couldry and Joseph Turow made a similar argument in a recently published article, warning that "the emerging culture of big data" may "erode democracy unless their hidden workings are made public and contested broadly" (2014, p. 1711). Thirdly, Helen argues that analytics should be rethought as a more participatory process, which means that they should not only be instruments in the hands of experts but means that offer new forms of representation "by which publics can come reflexively to know and constitute themselves in new ways". In other words, datafication and analytics can be thought of as means that offer publics new ways of constituting themselves, something that could empower citizens and serve a public good.
  • Lonneke van der Velden's presentation Forensic devices for activism: on how activists use mobile device tracking for the production of public proof (abstract) explored how activists use the ubiquitous tracking of their activities for their own ends. She described InformaCam, a mobile phone application that can be used to store images or videos in two versions: one in which identifying meta-data (time, location etc.) is removed, and one in which it is preserved and in which one can even add information manually. This way, the application gives activists the means to produce public evidence without giving up their anonymity. On the notion of activism, I would also like to add Nafus' and Sherman's (2014) study about the Quantified Self Movement. They describe this movement as an alternative big data practice because activists appropriate the techniques and conceptions of big data while at the same time resist its rationality by emphasizing their status as individuals who do not fit into common categories. In Nafus' and Sherman's own words, they "appropriate big data’s attention to granular patterns, but resist the categories that are built into devices and into the market for data" (2014: 1791). Resembling Helen's arguments, the Quantified Self Movement asks "what it means to think of data 'as a mirror' and what kinds of reflection, learning, and personal insights might emerge" (Nafus and Sherman 2014, p. 1787).

We need more research

I think more research like this is necessary to explore what types of alternative rationalities around datafication are emerging - outside the 'big data business'. Nick Couldry has called this type of research social analytics. That is
the study of how social actors are themselves using analytics - data measures of all kinds, including those they have developed and customized - to meet their own ends. For example by interpreting the world in new ways. (Couldry 2013, at minute 47:57)
Whether datafication serves businesses and intelligence agencies more than democratic values and citizen empowerment depends on how data and analytics are utilized and distributed. Research on social analytics will help us to find out under which conditions it might be good for democracy.


    • Couldry, Nick (2013, November 21). A Necessary Disenchantment: myth, agency and injustice in the digital age. Public lecture, London School of Economics and Political Science. Retrieved from
    • Couldry, Nick, and Joseph Turow (2014). Big Data, Big Questions. Advertising, Big Data and the Clearance of the Public Realm: Marketers’ New Approaches to the Content Subsidy. International Journal of Communication 8: 1710–26. Retrieved from
    • Kelty, Christopher. M. (2008). Two Bits: The Cultural Significance of Free Software. Durham: Duke University Press. Retrieved from
    • Nafus, Dawn, & Sherman, Jamie (2014). Big Data, Big Questions. This One Does Not Go Up To 11: The Quantified Self Movement as an Alternative Big Data Practice. International Journal of Communication, 8, 1784 – 1794. Retrieved from
    • Porter, Theodore M. (1995). Trust in Numbers: The Pursuit of Objectivity in Science and Public Life. Princeton, N.J: Princeton University Press
    • Rosenberg, Daniel (2013). Data before the Fact. In L. Gitelman (Ed.), “Raw data” is an oxymoron (pp. 15–40). Cambridge, Massachusetts ; London, England: The MIT Press. 
    • Van Dijck, José (2014). Datafication, dataism and dataveillance: Big Data between scientific paradigm and ideology. Surveillance & Society, 12(2), 197–208. Retrieved from

      Sunday, June 22, 2014

      #asmc14 afterthoughts: Thinking about public social media

      Last week I attended the great Social Media and the Transformation of Public Space conference in Amsterdam. It was an exhausting, but very inspiring week! Here, I want to share some of the ideas and impressions while they are still fresh. I want to start with a question that I asked twice in two different Plenary Conversations:

      Why is there no discussion about 'public' social media (in the sense of public broadcasting)?

      I didn't raise this question because I think it is very realistic to have public social media any time soon, or that they would be a solution to all the problems and concerns raised about social media during the conference, but because it just struck me that there is absolutely no discussion about this idea.

      During the conference, many concerns or questions addressed the commercial nature of social media and the business interests and market strategies of its providers. Especially Bernhard Rieder's Keynote about the rise of algorithmic knowing made a strong argument (see his slides here): The real problem is not that Big Data acolytes promoting the power of this new paradigm are wrong, but that they might be right. Then the contrast between commercial provider interests and civic values (which are often evoked in connection with social media, for example in terms like the 'Twitter revolution') becomes even more problematic. In Bernhard's words, the danger lies in the monopolization of knowledge and a "reconfiguration of publicness according to operational goals that are geared toward profit maximization". When algorithms are powerful engines of order that produce new ways of knowing, the values and interests inscribed into them can shape publicness in many ways. It is therefore important to address these values and interests - and when we do so, it is almost unavoidable to take a normative perspective. What kind of 'public' is shaped by these providers? How do we want a 'public' to be? In what kind of society do we want to live in? This led me to the question: What can we actually expect from social media when they are provided by companies that rely on advertisement? From this perspective, thinking about public social media does not seem far fetched.

      Why thinking about public social media is valuable

      There were some counter-arguments during the keynote and I had an argument with Axel Bruns about it on Twitter (who did a fantastic job covering the conference on his blog). Axel is skeptical about public social media because they would probably not be able to attract enough users to become serious competitors for Facebook or other commercial platforms and therefore remain irrelevant. In his response to my question, keynote speaker Hallvard Moe also argued that public social media are an interesting idea but he thinks it's unrealistic that it is ever going to be build, especially in a neoliberal setting. Both are valid and good arguments of course. Still, I think it is valuable to think about public social media for at least three reasons:
      1. The idea of public social media infrastructures can help us (as researchers) to think about how social media networks should actually look like when they are not based on commercial interests but on civic values and on the normative frameworks that we frequently refer to in our discussions (like Habermas' public sphere theory). How exactly could we inscribe such values and norms in algorithms and infrastructures that are supposed to support a certain form of publicness?
      2. Even though it is unrealistic to happen any time soon, building a public social media infrastructure could have an impact regardless of user numbers. Even if user numbers dwarf in comparison to Facebook or other platforms, building an actually existing alternative based on civic values could have a serious impact on how social media networks are perceived.* I'm not talking here about our perception as researchers, but about a new level of awareness among people outside academia concerning the issues surrounding commercial social media platforms. Then the mere existence of a public social media infrastructure could already have an impact on commercial providers as well, who would be forced to somehow respond to this new perception.
      3. I think arguments like "that's never going to happen" or "it's not going to be successful" both threaten to foreclose a real discussion and are only thinking in short-terms. As mentioned before, I don't think public social media infrastructures are going to be build any time soon (if ever), and even if that happens they probably won't have an immediate impact. However, I suggest that we should think about this in long terms. And from a long-term perspective, setting the initial spark and starting a real discussion about public social media could be something worthwhile.
      Maybe we should think even broader about a public media environment, not only about public social media. Public broadcasting was born in a historically unique media environment. Who knows how and what media we will use in twenty, thirty, or hundred years from now. Thinking in long terms about a public media environment might turn out to be more flexible and successful after all. Or, maybe, public media in the classic sense of public broadcasting is not the solution either, but a more flexible model that includes both public funding and commercial elements?

      * I'm taking inspiration here from Christopher Kelty's book Two Bits, where he argued that one of the reasons Free Software or Open Source became so successful is because it is able to speak to existing forms of power through the creation of alternative infrastructures.

      Creative Commons License
      Except where otherwise noted, this work is licensed under a Creative Commons Attribution 4.0 International License. The Logo was created by anomalous_saga and is licensed under CC BY 2.0.