Image representing Amazon as depicted in Crunc...
Image via CrunchBase, source unknown

It began, as so many things do these days, with an idle tweet.

On 21 November, Amazon Web ServicesDeepak Singh pointed to a new page describing the company’s ‘Public Data Sets on Amazon Web Services.’

Lidija Davis covered the news for ReadWriteWeb two days later and on 4 December Amazon issued its formal press release, prompting a flurry of coverage from Mike Arrington at TechCrunch, Larry Dignan at ZDNet, Krishnan Subramanian at CloudAve, and many others.

Alongside broader discussion of this move, members of the W3C-backed Linking Open Data project delved into the synergies via their public mailing list and Linking Open Data enthusiast Kingsley Idehen‘s company issued a Press Release suggesting ways in which their products might fit within this shifting data landscape.

So what have Amazon done, what does it mean, and how does it ‘bring the Cloud of Data closer’ as the title of this post suggests?

Amazon’s web page describes their offer quite succinctly;

“Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.”

As Krishnan noted in his post,

“By doing this, Amazon is helping research community save money on storage and bandwidth costs associated with assessing these public data from any EC2 instances they use in their research. When the data in question is in hundreds of terabytes or petabytes, we are talking about huge cost savings here.”

In addition, OpenLink’s press release gives an indication of the efficient manner in which services and data already hosted by Amazon can be plugged together as needed;

“As a vital contribution to the momentum behind the burgeoning Web of Linked Data, [OpenLink's product] Virtuoso provides a simple deployment mechanism for highly integrated knowledge bases emerging from the Linking Open Data community. For example, it is now possible to deploy personal or service-specific renditions of DBpedia within 1.5 hours, compared to an 8 – 22 hour effort when performed from scratch.”
(my links)

By offering free hosting for public data, then, Amazon are doing the wider community a huge service. Much of the data there today is reasonably readily available from other sources, so the biggest immediate benefits are those of speed and cost outlined above by Krishnan and OpenLink. For existing or potential users of Amazon’s Web Services to power their applications, this is yet another reason to consider Amazon.

Harvard Medical School‘s Dr. Peter Tonellato was quoted in Amazon’s release, and he is unlikely to be alone;

Public Data Sets on AWS will enable me and many of my colleagues to collaborate with each other by sharing our commonly used data sets, research environments and tools. We can set up a controlled environment in minutes, run our computational analysis for a couple of hours, and shut down the environment. Our results are completely repeatable. I only pay for the compute time I use, and more importantly I can spend more time focusing on research, not downloading and setting up computational infrastructure.

The bigger long-term contribution of this Amazon initiative may actually lie with data that are difficult or impossible to find online today. In a previous existence at the Archaeology Data Service (ADS), for example, my colleagues and I were always being contacted by individuals and organisations with data that they wanted to see online; individuals and organisations that lacked the skills, resources or mandate to mount and maintain the data themselves. How many of those organisations will beat a path to Amazon’s door now… and what sort of resource might we see emerge as a result?

However…

Krishnan concludes his post with a reality-check, commenting;

“this data stored on AWS servers are useful only if the researchers use Amazon EC2 for their computing needs… even if they could tap into it from external platforms, it doesn’t mean much if these public datasets are accessible using some kind of API from their original source itself.”

In other words, much (most? all?) of the advantage Amazon is offering evaporates if developers then have to pull the hosted data off Amazon’s servers and into their own applications running locally or via a competing Cloud provider such as Google.

Although the way in which it is recognised and monetised is finally shifting, data is still valuable, and Amazon (and others) clearly recognise the benefits of enticing users to entrust data to their offering, whilst (almost) imperceptibly making it that little bit more painful to use the data somewhere else.

Kingsley Idehen is quoted as saying,

“The Web’s potential as a globally distributed information space that plugs into disparate databases has never been in question. What has remained unclear is how a federated Web of linked databases would be delivered in a manner consistent with the Web’s core architecture, without compromising its simplicity.”

It is in moving us toward this open vision that Amazon’s offering (although undoubtedly an important step along the way) is ultimately lacking. For that, we may well require the open and linked approach of Semantic Web offerings from companies such as Talis and Kingsley’s OpenLink. These recognise the futility of expecting all data to migrate to a single service provider, whilst still ensuring that those on the ‘inside’ may gain the benefits of proximity on the network, pre-computation of certain indices, etc. Amazon and its services clearly have a place within that emerging ecosystem, but it is a place that they will need to share with others.

The worthwhile philanthropic aspects of Amazon’s announcement apart, the company is certainly doing its part to evangelise the benefits of moving data to the Cloud, and this is to be wholeheartedly welcomed.

CIOs are recognising the benefits of Cloud-based computation, and their resistance to the loss of control implied by individual cost centres’ embracing of SaaS solutions such as Salesforce is diminishing. The proposition of accessing data in the Cloud, at will, is even more profound, and the benefits to be gained require careful and compelling explanation in the face of inevitable fears regarding issues such as data integrity.

Showing everyone the benefits to be gained in sharing disparate public data sets is one more step along the way to widespread acceptance of the value in easing restrictions over access to more sensitive resources.

Reblog this post [with Zemanta]