English: Publicity photo of en:Stephen Wolfram.

Image of Stephen Wolfram via Wikipedia

British-born computer scientist Stephen Wolfram sees ongoing efforts to extend the Internet’s top-level domains (TLDs) beyond the familiar .com, .org, .uk etc as an opportunity to raise the profile of machine-readable data. In a blog post published yesterday, he argues that a new .data domain would increase “exposure of data on the internet—and [provide] added impetus for organizations to expose data in a way that can efficiently be found and accessed.” Whilst wholly in favour of Wolfram’s stated aim, I can’t help feeling that his suggested solution is at best unnecessary and at worst a worrying segregration of data from the ‘proper’ web that everyone else will continue to exploit.

Back in June of last year, the body responsible for coordinating the global domain name system approved a plan to permit new top-level domains (the letters after the final dot in an internet address — the .com in cloudofdata.com, the .uk in bbc.co.uk, the .edu in harvard.edu). Until recently, these top-level domains have been tightly controlled, with a small set of generic domains (.edu, .gov, .mil, .org, etc), a larger set of country domains (.uk, .fi, .nz, etc) and one or two others such as .eu. From tomorrow, anyone with $185,000 will be able to submit a proposal to create and manage a new top level domain, and it’s possible that there could eventually be thousands of them. Wolfram is keen to ensure that data doesn’t miss out on the ‘opportunity.’

As Wolfram himself recognises, there is already an awful lot of machine-readable data on the web. Some of it sits embedded within the web pages that humans read, with specially formatted code waiting to be triggered by the calendars, the address books, or the browser plugins of site visitors. Some of it is packaged up in data files, offered for download. And some of it waits inside a database, ready to be delivered in response to an API call or a query typed into a web form.

There is a growing enthusiasm for exposing this data for reuse. Government transparency agendas have driven public sector data sites like data.gov.uk and data.gov. Similarly, efforts such as data.open.ac.uk and data.southampton.ac.uk see universities beginning to consciously collect data sets together and offer them up for reuse. Similar efforts in the commercial world are less easy to point to, but that reticence has nothing whatsoever to do with the lack of a ford.data, boeing.data, ge.data or astrazeneca.data domain!

In some ways, the convention for gathering significant chunks of data on a data.xxx.yyy site echoes Wolfram’s intention, but with a number of advantages. Data without context is far less valuable than data with context. Much of that context may be inferred from the domain in which the data lives, with data delivered from a .gov or .edu (or .gov.uk or .ac.uk) site perhaps interpreted differently to data hosted on .com, .biz, or .xxx. Southampton University, the Open University, and the US Federal Government are able to gather data up and make it available for download via their existing data. sites if they choose. This offers human visitors to their sites a degree of convenience, whilst retaining the power and brand attributes of their existing domain. Gov.data, gov.uk.data, open.ac.uk.data, southampton.ac.uk.data, though? All are messy, in ways that Wolfram’s own wolfram.data would admittedly not be, and all are simply additional registrations that the institutions would have to pay for in order to stop someone else grabbing the domain.

At the end of the day, the machines don’t actually care. The existing data.open.ac.uk-type sites are human conveniences, not machine enablers. The computers, and the software they run, are quite capable of crawling the public web and finding accessible data wherever it lies on a site. There are plenty of reasons to continue embedding little snippets of data inside human readable web pages, regardless of whether you have a data.wolfram.com or a wolfram.data site. Content negotiation is becoming increasingly capable, such that there really is no need for what Wolfram calls a ‘parallel construct to the ordinary web’ at all. A human being arriving at a web site sees human readable content, whilst various software tools would automatically be presented with very different data or functions, optimised to their capabilities and requirements.

By all means, let us show the curious some of the existing techniques that work in making data more easily accessible. By all means, let us identify the gaps, the issues, the problems (none of which a new TLD even begins to address). Yes, let us definitely and unambiguously set about “highlighting the exposure of data on the internet—and providing added impetus for organizations to expose data in a way that can efficiently be found and accessed.”

But please, let us not be distracted by the false hope that adding yet another TLD to the babel that ICANN is about to unleash can do anything more than consign data to some online ghetto, wallowing unwanted, unloved and unused as companies and their customers lavish love, attention, and clicks upon the .com domain over on the ‘proper’ web.

Thanks to Raphaël Troncy, whose tweet first drew the story to my attention.