
- Image via CrunchBase, source unknown
It began, as so many things do these days, with an idle tweet.
On 21 November, Amazon Web Services‘ Deepak Singh pointed to a new page describing the company’s ‘Public Data Sets on Amazon Web Services.’
Lidija Davis covered the news for ReadWriteWeb two days later and on 4 December Amazon issued its formal press release, prompting a flurry of coverage from Mike Arrington at TechCrunch, Larry Dignan at ZDNet, Krishnan Subramanian at CloudAve, and many others.
Alongside broader discussion of this move, members of the W3C-backed Linking Open Data project delved into the synergies via their public mailing list and Linking Open Data enthusiast Kingsley Idehen‘s company issued a Press Release suggesting ways in which their products might fit within this shifting data landscape.
So what have Amazon done, what does it mean, and how does it ‘bring the Cloud of Data closer’ as the title of this post suggests?
Amazon’s web page describes their offer quite succinctly;
“Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.”
As Krishnan noted in his post,
“By doing this, Amazon is helping research community save money on storage and bandwidth costs associated with assessing these public data from any EC2 instances they use in their research. When the data in question is in hundreds of terabytes or petabytes, we are talking about huge cost savings here.”
In addition, OpenLink’s press release gives an indication of the efficient manner in which services and data already hosted by Amazon can be plugged together as needed;
“As a vital contribution to the momentum behind the burgeoning Web of Linked Data, [OpenLink's product] Virtuoso provides a simple deployment mechanism for highly integrated knowledge bases emerging from the Linking Open Data community. For example, it is now possible to deploy personal or service-specific renditions of DBpedia within 1.5 hours, compared to an 8 – 22 hour effort when performed from scratch.”
(my links)
By offering free hosting for public data, then, Amazon are doing the wider community a huge service. Much of the data there today is reasonably readily available from other sources, so the biggest immediate benefits are those of speed and cost outlined above by Krishnan and OpenLink. For existing or potential users of Amazon’s Web Services to power their applications, this is yet another reason to consider Amazon.
Harvard Medical School‘s Dr. Peter Tonellato was quoted in Amazon’s release, and he is unlikely to be alone;
“Public Data Sets on AWS will enable me and many of my colleagues to collaborate with each other by sharing our commonly used data sets, research environments and tools. We can set up a controlled environment in minutes, run our computational analysis for a couple of hours, and shut down the environment. Our results are completely repeatable. I only pay for the compute time I use, and more importantly I can spend more time focusing on research, not downloading and setting up computational infrastructure.“
The bigger long-term contribution of this Amazon initiative may actually lie with data that are difficult or impossible to find online today. In a previous existence at the Archaeology Data Service (ADS), for example, my colleagues and I were always being contacted by individuals and organisations with data that they wanted to see online; individuals and organisations that lacked the skills, resources or mandate to mount and maintain the data themselves. How many of those organisations will beat a path to Amazon’s door now… and what sort of resource might we see emerge as a result?
However…
Krishnan concludes his post with a reality-check, commenting;
“this data stored on AWS servers are useful only if the researchers use Amazon EC2 for their computing needs… even if they could tap into it from external platforms, it doesn’t mean much if these public datasets are accessible using some kind of API from their original source itself.”
In other words, much (most? all?) of the advantage Amazon is offering evaporates if developers then have to pull the hosted data off Amazon’s servers and into their own applications running locally or via a competing Cloud provider such as Google.
Although the way in which it is recognised and monetised is finally shifting, data is still valuable, and Amazon (and others) clearly recognise the benefits of enticing users to entrust data to their offering, whilst (almost) imperceptibly making it that little bit more painful to use the data somewhere else.
Kingsley Idehen is quoted as saying,
“The Web’s potential as a globally distributed information space that plugs into disparate databases has never been in question. What has remained unclear is how a federated Web of linked databases would be delivered in a manner consistent with the Web’s core architecture, without compromising its simplicity.”
It is in moving us toward this open vision that Amazon’s offering (although undoubtedly an important step along the way) is ultimately lacking. For that, we may well require the open and linked approach of Semantic Web offerings from companies such as Talis and Kingsley’s OpenLink. These recognise the futility of expecting all data to migrate to a single service provider, whilst still ensuring that those on the ‘inside’ may gain the benefits of proximity on the network, pre-computation of certain indices, etc. Amazon and its services clearly have a place within that emerging ecosystem, but it is a place that they will need to share with others.
The worthwhile philanthropic aspects of Amazon’s announcement apart, the company is certainly doing its part to evangelise the benefits of moving data to the Cloud, and this is to be wholeheartedly welcomed.
CIOs are recognising the benefits of Cloud-based computation, and their resistance to the loss of control implied by individual cost centres’ embracing of SaaS solutions such as Salesforce is diminishing. The proposition of accessing data in the Cloud, at will, is even more profound, and the benefits to be gained require careful and compelling explanation in the face of inevitable fears regarding issues such as data integrity.
Showing everyone the benefits to be gained in sharing disparate public data sets is one more step along the way to widespread acceptance of the value in easing restrictions over access to more sensitive resources.
Related articles by Zemanta
- The evolution of an all encompassing world of clouds
- Amazon Web Services: Bigger Than Amazon
- Public Data Goes on Amazon’s Cloud
- The Amazon Cloud: no longer a mid-Altantic kludge
- Amazon Tries to Lure Scientific Community into the Clouds
« « Guest post on ReadWriteWeb
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=8e52f4ea-d3e4-4217-8ac8-03193483b71e)
Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database.
View Comments Comments until now.
Having first hand experience with AWS and processing lots of data I have to say it is marvelous (to a startup).
The biggest lacking point when turning around really big datasets is high price for “extra large” instances that you need for that… But that’s business.
Now about LOD datasets. This is wonderful. Maybe next step they could make is creating a big SPARQL distributed server with those datasets with the “pay-as-you-go” model, so semantic web users don’t need to setup their own instances.
I’d say Amazon’s last move is a bit similar to ISPs back in the days setting up local mirrors of Linux distributions and shareware sites, so their users had them close. Good move, but that will not bring long-term differentiator. Services will.
bye
Andraz Tori, Zemanta
Paul,
An important point to note re. Amazon EC2. They currently offer the most flexible solution for the emerging “Data as a Service” (DaaS) frontier.
As noted in some of the commentary above, it’s about who delivers the most cost-effective route to data required by data consumers (typically data analysts). At the current time Amazon, courtesy of the nature of Linux licensing which enables the construction of paid AMIs, leads the pack.
Amazon is basically accelerating the emergence of the Web 3.0 based “citizen analyst” just as Web 2.0 gave us the “citizen journalist” .
Interesting conversation developing for sure
Kingsley
Andraz,
<>
This is exactly what we are doing
See:
1. http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtuosoEC2AMI – General EC2 AMI information
2. http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtEC2AMIDBpediaInstall – DBpedia rendition on EC2 guide
3. http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtEC2AMINeuroCommonsInstall – NeuroCommons renditino on EC2 guide.
The entire LOD is to come, it will be a full cluster edition of Virtuoso and EC2 users will simply be able to setup a variety of cluster models in line with requirements for redundancy (failsafe) and load balancing (scale out).
EC2 is a great thing for Web startups, it provides them with minimal costs at bootstrap time. You no longer need to depend on VCs data center capital etc..
It is also a great thing for data analysts (as already indicated in the post) by alleviating the distracting costs of DBMS or Knowledgebase installation, configuration, loading, and tunning.
Kingsley
[...] Do I need associated data within the providers environment(ie. google, AWS, salesforce.com, [...]
[...] Do I need associated data within the providers environment(ie. google, AWS, salesforce.com, [...]
[...] reach. Apps for Democracy list interesting cases where not only it makes sense to publish public data sets on the cloud but government services that make sense for any citizen. Dan Bricklin envisioned “public IT [...]
[...] interesantes son las Applicaciones para la Democracia que más allá de publicar en internet bases de datos públicas atienden a toda la población de un País. Dan Bricklin predijo “la infraestructura de TI como un [...]
[...] in December of 2008, I wrote about a new initiative from Amazon to make large sets of public data more accessible. Amazon offered to mount the data for free, and [...]
[...] Amazon Public Data Sets bring the Cloud of Data closer (cloudofdata.com) [...]