As European readers are doubtless aware, the EC has traditionally been a generous funder of research across Europe’s member states, with Digital Libraries, the Semantic Web and more owing much to the largesse of Europe’s massive ‘Framework Programme‘ funding cycles. We’re currently in the midst of the Seventh Framework Programme, and a few hundred of the academics and technologists hoping to secure some of the €Millions available for ‘Technologies for Information Management‘ have gathered in soggy Luxembourg to hear what they can bid for, to hear how to bid, and to engage in the funding world’s rather bizarre equivalent of Blind Date by pitching their wares to prospective partners.
I’ve been asked to talk about some of the trends and issues around ‘Big Data,’ to provide a context for the technological discussions to follow, and to illustrate some of the ways in which cutting edge implementations of the sort likely to be proposed might solve real problems.
My slides are on Slideshare and embedded, above, and begin by suggesting that the use of ‘the language of catastrophe’ in describing the ‘flood’ of information around us perhaps sets everything off on the wrong foot. Yes, there’s a lot of information out there… but there’s not too much, and maybe we shouldn’t be attempting to control all of it anyway…
European Commission-funded projects were amongst the first to make serious attempts to ‘control’ and ‘manage’ the early Web, with some suggestions that we could (and should) catalogue Web pages just like books in a library. Some of those laboriously curated, instantly obsolete, and hopelessly under-representative Web ghettos still exist today; but the mainstream Web has moved far beyond them, embracing more scalable and effective combinations of machine processing and lightweight community recommendation. Even in their heyday, those for whom these resources were created were all too often to be found applying their efforts to routing around these obstructions to the free flow of information across the Web.
As the volume of data available to us grows, it presents massive new opportunities as well as significant technological and social challenges. Twitter is just one example of the rise of the ‘real-time’ Web, and connected devices such as the iPhone and Google’s Android devices fundamentally shift the ways in which we consume and contribute to the ever-accelerating flows of data.
Use, re-use and control of data, too, are increasingly topical issues with which we should be concerned. The rise of licensing frameworks such as CC0 and the Open Data Commons are part of an attempt to reduce ambiguity in the ways that data may be repurposed, and expectations grow daily that data will be available; whether from Government, community groups or the private sector. Privacy, security, provenance and trust all come into play, with a path to be diplomatically steered between those too blasé to recognise the real issues at stake and those too cautious to countenance progress perceived to be at their expense.
Moving the quantities of data involved is becoming a serious challenge, too, and a generation of researchers accustomed to ‘simply’ throwing data into the Cloud for later analysis, sharing or retrieval must increasingly grapple with latency and bandwidth. Physical proximity of data to computation might actually matter again, and more than one of the Cloud’s biggest players have been heard to suggest that physical media might actually be the most cost-effective way to move these mountains of data around the globe.
I look forward to seeing what the nascent projects likely to coalesce over the next 24 hours contribute to understanding and progress with any or all of these… and hope that my presentation plays its part in shaping their thinking, their proposals and their outcomes.