Harvard University
Image via Wikipedia

Harvard University’s Berkman Center for Internet & Society unveiled their Media Cloud research tool today, bringing Semantic Web goodness from Thomson Reuters’ Calais and affordably scalable Cloud oomph from Amazon Web Services together in powering exploration of a fascinating topic.

As the press release notes,

“Media Cloud was conceived by Berkman Fellow Ethan Zuckerman and Berkman Faculty Co-Director Yochai Benkler.  It was inspired by their debate over whether the blogosphere largely echoed traditional media or was instead a source for original news and democratic agenda-setting.

‘While daily newspapers struggle for survival, political, niche and special interest blogs continue to capture consumer interest,’ said Yochai Benkler, Faculty Co-Director of the Berkman Center. ‘In the midst of this upheaval, it is difficult to know where stories begin, who sets the agenda, and how these dramatic changes impact news coverage on the whole.  We created Media Cloud to help researchers and the public get quantitative answers to these challenging questions.'”

The site itself provides more detail, notably,

“Print newspapers are declaring bankruptcy nationwide. High-profile blogs are proliferating. Media companies are exploring new production techniques and business models in a landscape that is increasingly dominated by the Internet. In the midst of this upheaval, it is difficult to know what is actually happening to the shape of our news. Beyond one-off anecdotes or painstaking manual content analysis, there are few ways to examine the emerging news ecosystem.”

By systematically harvesting full-text content from both new and traditional news sources and passing it through Calais for entity extraction, the Berkman team is able to build a coherent and normalised pool of news that they – and others – can begin to interrogate in order to answer pressing questions as to the shifting news landscape. Today’s site is just the beginning, and the team has ambitious plans to increase the number of sources they harvest, to provide ever-richer visualisations on their site, and to expose public APIs to the underlying data in order that third parties can consume it for their own purposes.

Ahead of the launch I spoke with Barak Pridor, CEO of Thomson ReutersClearForest subsidiary, and Berkman Fellow Stephen Schultze. Both stressed the beta nature of the site, and the fact that the team are still working to optimise a number of areas. That aside, I was impressed by some of the early results.

Barak pointed to the role Calais is playing in enabling the Berkman team to conduct comparisons within and between different news sources, extracting key entities (personal names, companies, places, etc) from the stream of text and normalising the various ways in which we refer to them (IBM, International Business Machines, etc).

Stephen discussed some of the background to the project, and was keen to emphasise the platform nature of their activity; although focussed on a destination web site today, the intention is to expose much of the data via a series of APIs that third parties will be able to consume. In the context of recent comparable moves from the New York Times and Britain’s Guardian, this is clearly to be welcomed and will accelerate the innovation around this incomparable pool of content.

Whilst the site does not currently expose the full richness of the data being harvested, it is already possible to see a number of interesting visualisations.


The maps above, for example, compare the coverage devoted to all the world’s countries by three very different news organisations; the BBC, the Wall Street Journal and FOX News. The apparent non-coverage of Iran is surprising, and suggests an algorithm still in need of tweaking. That apart, it’s actually surprising to note the very similar spread of coverage. European snobbishness about the insularity of US news – and the farcical nature of FOX ‘news’ – may not survive exposure to this data, although we can, of course, reassure ourselves that ‘coverage’ doesn’t always equate to ‘insight,’ ‘analysis’ or ‘truth.’ As more data become available, even those preconceived notions may face an overdue battering.

The system is already capable of addressing more complex questions, and exploring temporal aspects of the ways in which stories break, spread and relate to one another across different media. Stephen shared a number of evolving visualisations that made it possible for me to explore several trends at once, and I look forward to these tools arriving on the site.

The project is directly funded by the Berkman Center at present, and a desire to enrich and stabilise the platform means that significant addition of new news resources may be some months off. I, for one, look forward to seeing more non-US resources such as the UK’s Financial Times and Guardian, or their international equivalents.

The project’s APIs are also being finalised at this point and will be released in due course, along with the source code upon which Media Cloud runs. It will be interesting to see the balance between API access and software download at that point.

Media Cloud represents an excellent example of the uses to which the technologies I cover can be put. It’s not a Semantic Web project. It’s not a Cloud Computing project. It’s an intriguing exploration, that puts those technologies (invisibly) to work in helping to get the job done. That’s the way it should be.

And, as Stephen commented when asked what he was most interested to see,

“the most amazing stuff will be what other folks use it for in asking and answering their own questions.”


Thomson Reuters is already supporting this project by contributing Calais and some technical expertise. I wonder whether some of our more enlightened news organisations might like to help the Berkman get more data in there, faster? As more newspapers close each week, and as we (supposedly) become ever-more insular in the choices we make about the news we consume, there’s a real need to understand the ways in which the media are changing. Many eyes on this data set is one sure way to ensure that the commentary moving forward is informed and informative.

Reblog this post [with Zemanta]