Paul Miller

The Cloud of Data


How Open is ‘Open’ ?

LONDON - FEBRUARY 12:  Queen Elizabeth II shak...
Image by Getty Images via Daylife

There has been a recent burst of enthusiasm for making raw data produced by and for Government more ‘open,’ and this must surely be welcomed. Long-running grass-roots efforts such as Tom Steinberg’s mySociety and The Guardian’s Free Our Data campaign continue to innovate, but in an environment that is suddenly more receptive to their ideas. Edge-case adoptions of RDFa and other ’semantic’ specifications, perhaps, are at last moving from being merely the preserve of a few isolated enthusiasts.

Sir Tim Berners-Lee now walks the corridors of power in London and Washington and elected officials (and even the Opposition parties) at least claim to be listening to his call for ‘Raw Data, Now!’ and his talk of Linked Data, URIs, and the rest. How far we have come, but we have much further still to go.

‘Open’ and ‘Transparent’ Government is nothing new. It’s been talked about for a very long time, and there has been some progress. Part of the issue, I think, comes down to interpretations of ‘open.’ Just because it’s possible to download some Government data doesn’t necessarily mean it’s practical for most interested parties to do so.

If a national library puts all of its catalogue online for free, but requires you to query it via an obscure industry protocol, is that ‘open’ ? If they then throttle access so that it would take an inordinately long period of time to ‘copy’ their catalogue, is that ‘open’ ?

If a National Statistics agency makes all of their research freely available, and provides access to thousands of opaquely named csv files by listing them on a web page, is that ‘open’ ?

If a Government department makes all its research reports available online as Microsoft Word files, is that ‘open’ ?

A purist might strenuously assert that none of these are ‘open.’ Most, certainly, are far from ideal… but they still serve a real purpose in making the innards of Government more accountable. How good should be good enough in 2009?

Going the other way, does a Health Authority have to make my medical records visible to the world before it can be called ‘open’ ? It seems almost unthinkable, but extremes of viewpoint do have an annoying habit of quickly becoming that absurd.

The current enthusiasm for ‘Open’ is closely associated to Tim Berners-Lee’s talk of Linked Data and the newly pragmatic Semantic Web, and Berners-Lee provided a short note last week on his current views. Contrast Tim’s discussion of the ways in which Government data should be linkable with the Sunlight Foundation’s attack on the US Federal Government’s transparency flagship, Recovery.gov, for not making any real data available in the first place.

If we can’t even get the existing raw data out of Government as often as we’d like, there’s a long way to go before Berners-Lee’s grander vision can be achieved. He recognises this, of course, writing;

“Government data is being put online to increase accountability, contribute valuable information about the world, and to enable government, the country, and the world to function more efficiently. All of these purposes are served by putting the information on the Web as Linked Data. Start with the ‘low-hanging fruit’. Whatever else, the raw data should be made available as soon as possible. Preferably, it should be put up as Linked Data. As a third priority, it should be linked to other sources. As a lower priority, nice user interfaces should be made to it — if interested communities outside government have not already done it.”

(my emphasis)

To get much further, and to make that progress sustainable, there’s a requirement for a very real shift in attitudes at the heart of Government. Openness (of data or anything else) shouldn’t be a tactic to distract from worse news elsewhere, or a short lived knee-jerk response to the latest embarrassment. Rather, it should be a deep-seated presumption to underpin policy, systems design and more.

Data from Government should, quite simply, be freely and easily available. As a matter of course, and without prevarication. Unless there is a compelling reason to do otherwise.

For all the talk of ‘open,’ that is very far from being true today. The presumption is ‘closed.’ The mindset is (largely) ‘closed.’ ‘Open’ has to be fought for, and ‘Open’ has to be justified. ‘Open’ has to be championed, endlessly, tirelessly, thanklessly.

The exact opposite should be true. Then (and maybe only then?) Berners-Lee and his colleagues can build something wonderful.

Reblog this post [with Zemanta]

Article Tagged: , , , ,
No Comments

Looking back at the Semantic Technology Conference, and the rest of my week in the Valley

Demoing Bing @ SemTech
Image by official_powerset via Flickr

Today’s my first day back at home, following a tour of Silicon Valley and a couple of days in London to present at the ISKO UK Conference.

The main event was this year’s Semantic Technology Conference in San Jose, which I’ve already partially covered via various blog posts over on ZDNet. More on this in a moment.

On the travel front, it was great to chat with JP Rangaswami on the flight out and even greater (no offence, JP!  ;-) ) to recline in comfort on the overnight return journey after a very welcome upgrade from British Airways.

Before and after the conference, I visited several Silicon Valley companies both new and old. Despite evidence of the economic situation at every turn (most shockingly, actually, in the boarding up of two fixtures of many a trip to San Francisco; the Disney Store on Union Square and the big Virgin Megastore on Market) the people I spoke with were full of ideas and enthusiasm for the journey ahead. In so many areas of technology, the edge cases and the cool or interesting concepts are moving towards the mainstream. Mobile, clearly, is everywhere, and there is a growing recognition of the value in structure, linking and data or application portability. We’re even, finally, seeing some viable business models based upon something other than just plastering sites with Ads. Painful as it has been for far too many individuals, this latest economic hiccup appears increasingly likely to have been very good for a tech industry that had become a little too complacent, far too smug, and rather caught up in tech for tech’s sake. The survivors are leaner, meaner, and far more focussed upon understanding the areas in which they bring sustainable value to audiences of a viable size. That’s a good thing, and I can only hope that we remember the lessons we’ve learned when the champagne starts flowing and the car parks of tech startups once again fill with Porsches, Ferraris and Teslas…

It’s a testament to the topicality of the subject matter and the hard work of the team at Wilshire that attendance was actually higher than last year, and the buzz of conversation during the event would suggest that the upward trend shows every sign of continuing. Speaking during a special live episode of the Semantic Web Gang that we recorded on stage just before the end of the conference, Wilshire’s Tony Shaw discussed attendance figures and the breakdown of different audience categories. This podcast should be online shortly. In the midst of this collective tightening of belts, it’s not surprising that the freebie/schwag haul was the lowest I’ve ever seen. I understand why. The children, however, were sorely disappointed and distinctly unimpressed by my attempts at rationalisation. I think they feel that the technology sector owes them schwag.

With a programme that began at 0730 each morning and proceeded through the day with as many as ten parallel sessions, there was something for everyone. Free proper coffee helped to lubricate the corridor conversations, and I probably drank rather too many vanilla lattes as I tended to find myself engaged in those conversations when my panel moderating duties permitted. Keynote sessions each day addressed different aspects of the space, from the Toms (Tague and Gruber) discussing where we’ve been and one aspect of where we might go, through an ultimately disappointing assemblage of Semantic Search’s big hitters to Nova Spivack’s chat with Wolfram Alpha’s Russell Foltz-Smith and the final morning’s (promising, yet disappointingly vague) announcements from the New York Times.

I didn’t see (much) radical innovation, and I didn’t see any great technological leaps forward from last year. However, what I saw again and again was evidence of maturity, adoption and real-world deployment. Ideas that might have seemed a little unrealistic or risqué just a year ago had become an accepted part of the conversation. The Web was more dominant than in previous years. As Ian Davis has often described it, we’ve seen a welcome and perhaps overdue shift from the Semantic (Web) toward (Semantic) Web. Efforts like Freemix package existing capabilities and bring them within reach of the next wave of adopters. Linked Data was on the lips of many, and Mills Davis was amongst those keen to encourage the sector’s serious engagement with the accelerating enthusiasm for Linked Data in Government on both sides of the Atlantic. I agree with him, and look forward to playing my part in ensuring that something real follows the political rhetoric of recent weeks. Although the terminology is different, in many ways it’s the realisation of the vision we set out within the Common Information Environment at the beginning of this decade. I’m trying to line up podcasts with some of those involved inside the US and UK Governments, so watch this space…

Thanks to Tony Shaw, Eric Franzon and the team at Wilshire for once again laying on a great event. Thanks to the panelists on all the panels I moderated for making my job exceedingly easy. And thanks to all of those I met for some great conversation, much of which I’m still digesting.

Roll on next year, when the conference moves out of San Jose to take over the San Francisco Hilton from 21-25 June. See you there!

Reblog this post [with Zemanta]

Article Tagged: , , , , , , , ,
1 Comment

Thomson Reuters turns to FedEx and DHL to boot-strap their Cloud of Data

Thomson Reuters
Image by perspikace via Flickr

Thomson ReutersOpen Calais team have clearly been busy, with several announcements at the Semantic Technology Conference here in San Jose.

On 15 June the company rolled out version 4.1 of Open Calais, embracing Spanish language content and the notion of ’social tags;’

OpenCalais is a great semantic data extraction engine. If you write an article about the relative merits ofPorsche and BMW at the test track in Leipzig, we’ll diligently identify Porsche and BMW as companies and Leipzig as a geography. We’ll create Linked Data URIs to represent these things and open up access to theLinked Data ecosystem so you can enhance your article with other content assets.

But… sometimes you just want a great description. The kind of tags a human would put on the article. Like “Car racing” or “Automobiles”. The kind of tag that would, for example, be very searchable and therefore …. SEO’able (that is definitely is not a word).

In 4.1 we’re introducing OpenCalais Social Tags. Social Tags is our attempt to emulate how a human might tag the document. Social Tags does some fairly sophisticated analysis of your entire document and maps it to a knowledgebase based on Wikipedia and other assets. From that process we generate Social Tags.

This morning, they followed through with a further pair of announcements. Firstly, CNET has been joined by The Huffington Post, DailyMe and the UK’s Mail Online in integrating Open Calais into their workflow.

One of the more interesting aspects of the earlier CNET announcement was the contribution of data back into the pool;

“CNET joins Thomson Reuters as one of the first commercial media companies to publish core data assets for public, programmatic use on the open semantic Web. CNET will leverage OpenCalais’ connection to the rapidly expanding ‘Linked Data cloud’ to allow its original content — such as tech product reviews on laptops, TVs, smart phones, and digital cameras; news articles and blog posts from its CNET News editorial staff; and parts of its core technology product catalog – to be available for public use.”

It will be interesting to see whether these latest media properties are able and willing to do something similar.

The second element announced today in many ways mirrors Amazon’s recent Import/Export service. Thomson Reuters, too, have recognised that it remains impractical to move large quantities of data over the network, and today announced their ‘Archive Express’ which will process up to 20 million documents off a physical storage device within 24 hours, free of charge.

Reblog this post [with Zemanta]

Article Tagged: , , , , , ,
2 Comments

Sir Tim Berners-Lee talks to the BBC’s Rory Cellan-Jones about Linked Data

Rory Cellan-Jones, November 2006
Image via Wikipedia

There has been some coverage of the recent announcement by UK Prime Minister, Gordon Brown, that Sir Tim Berners-Lee will be helping the UK Government to make more of its data available for use and reuse. Charles Arthur at the Guardian was quick off the mark, and the BBC’s Rory Cellan-Jones followed through with an interesting telephone conversation, the full content of which is now available.

“So that government information is accessible and useful for the widest possible group of people, I [Gordon Brown] have asked Sir Tim Berners-Lee who led the creation of the world wide web, to help us drive the opening up of access to Government data in the web over the coming month.”

The ‘coming month’? Singular??? That’s definitely what the transcript says…

Readers of this blog will be no strangers to Berners-Lee, the Linked Data movement, or the opportunities to apply some of its ideas to mainstream applications in the public and private sectors.

It’s good to see the mainstream attention. Now all we have to do is make the most of opportunities like this (and similar initiatives emerging from the Obama administration) to deliver on the promise.

Reblog this post [with Zemanta]

Article Tagged: , , , , , , , ,
1 Comment

Tom Gruber talks about Siri, the Virtual Personal Assistant

Image representing Siri as depicted in CrunchBase
Image via CrunchBase

Tom Gruber will be one of the keynotes at the Semantic Technology Conference in San Jose next week, and there’s a lot of interest in what he’s likely to show. I spoke to Tom yesterday to learn more, and the result has just been released as a podcast.

Well known in the Artificial Intelligence research community, and perhaps better known to the Web world as Founder of RealTravel, Tom is now CTO and co-Founder of Siri.

Siri itself came to notice back in October, when a company previously going by the name Stealth Company announced itself to the world. The intervening months have seen the team hard at work to deliver on their vision, prior to once again coming to wider attention in the past week or two with appearances at All Things Digital and (next week) the Semantic Technology Conference. Interest in the tech media has clearly been piqued, with several posts on TechCrunch, and write-ups in BusinessWeek and the New York Times.

 
icon for podpress  Standard Podcast [51:16m]: Play Now | Play in Popup | Download

Production of this podcast was supported by Talis, and show notes will appear on their Nodalities blog shortly.

As Tom’s abstract for next week notes,

“We are beginning to see a new interaction paradigm for the web: the Virtual Personal Assistant (VPA). A VPA is task focused: it helps you get things done. You interact with it in natural language, in a conversation. It gets to know you, acts on your behalf, and gets better with time. The VPA paradigm builds on the information and services of the web, with new technical challenges of semantic intent understanding, context awareness, service delegation, and mass personalization.

Siri is a virtual personal assistant for the mobile Internet. Although just in its infancy, Siri can help with some common tasks that human assistants do, such as booking a restaurant, getting tickets to a show, and inviting a friend.”

Siri’s first incarnation will be in the form of an iPhone app, which Tom describes in the podcast.

Have a listen, and see whether you think a VPA will become an essential part of your mobile life anytime soon…

Reblog this post [with Zemanta]

Article Tagged: , , , , , , , , , , , ,
5 Comments

TripIt – adding structure, one journey at a time

Image representing TripIt as depicted in Crunc...
Image via CrunchBase

TripIt is one of those web applications upon which I have really come to rely. Like Tungle, it sets about reducing the pain of dealing with the admin behind a boring, repetitive, frustrating yet necessary part of my work.

For Tungle, as I’ve said before, that task is meeting scheduling. For TripIt, it’s organising and tracking the various elements of my travel arrangements. Not only does TripIt reduce the hassle, but it does this in an understated fashion that doesn’t impact adversely upon my workflow; and in the process it adds value so that I end up with less pain, less wasted time… and more valuable information.

I was delighted to have the opportunity to speak recently with TripIt co-founder and VP of Engineering, Andy Denmark. The result has just been released as one of my podcasts.

 
icon for podpress  Standard Podcast [40:00m]: Play Now | Play in Popup | Download

Production of this podcast was supported by Talis, and show notes are available on their Nodalities blog.

By simply forwarding all those automated booking confirmation messages from airlines, hotels, rail companies and car rental sites to TripIt, the site builds an itinerary and makes it available for synchronisation to your calendar. It does all that in less time than it would take to enter the details yourself, whilst also storing a copy online, making it available for sharing with your network via LinkedIn, a blog plugin, etc, and automatically adding additional information such as the weather forecast at your destination, directions from the car rental location to your hotel, and more.

A TripIt itinerary displayed on the iPhonePersonally, I find that the biggest advantage is simply building the itinerary and getting it into iCal quickly, accurately and painlessly. It’s also useful to be able to share flight arrival times, hotel phone numbers etc with family, should they ever need them.

TripIt recently introduced a premium service with some additional features, which I have yet to try.

Behind the scenes, TripIt is drawing upon a wealth of structured data scattered across the Web. It is also doing a lot, internally, to add structure to the free text of those booking emails, and sometimes it is more successful at this than others.

With Yahoo! SearchMonkey and recent announcements from Google likely to drive an explosion in more structured data on the Web, TripIt perhaps shows us a small glimpse of what might become commonplace; dedicated vertical apps mining our online presence to enrich, add value, and make our lives easier in small but important ways. I’d argue that building the next generation of these applications will be even easier, as increased public scrutiny leads to cleaner, richer data with which to work, and ever-more APIs from the Web’s Platform companies sees application builders increasingly able to stand, with ease, upon the shoulders of those giants. I look forward to finding out.

Reblog this post [with Zemanta]

Article Tagged: , , ,
No Comments

Sun moves their Cloud forward at CommunityOne

SANTA CLARA, CA - NOVEMBER 14:  A sign is seen...
Image by Getty Images via Daylife

Sun Microsystems used the CommunityOne East event in New York City this past March to unveil their Cloud Computing offering. I spoke with the company’s Juan Carlos Soto recently, to learn more.

Today, David Douglas (Senior VP, Cloud Computing) opened CommunityOne West in San Francisco discussing ‘Communities, Open Source Platforms, and Clouds.’ I joined the live webcast to see what he had to say.

Dave Douglas kicks off, talking to the importance of ‘community’. He stresses the underlying value of open – source code, protocols, formats, ideas.

“‘Open’ lowers barriers to adoption and innovation.”

A lot of the ideas he’s highlighting are similar to Tim O’Reilly’s call to ‘do stuff that matters;’ but oddly Dave doesn’t mention this.

Lew Tucker, Sun’s Cloud CTO, gets up on stage to talk about Sun’s Cloud Computing with Dave. Their opening gambit is around the on-demand nature of the Cloud, with its ability to pull up (and shut down) Cloud resources on demand, with a credit card. Lew argues that the Cloud doesn’t create lock-in, as it’s based upon open software such as Apache, Solaris and Linux.

Sun’s Storage Service, announced in March, is still on track to be available this summer… so no surprise unveiling from the stage today.

Lew shows some demonstrations of the Sun Compute and Storage Services, building upon those we saw in March to manage resources in the data centre via GUI.

Dave mentioned that ’several thousand’ Sun staff currently use the Sun Cloud internally, every day, “in Open Office” and elsewhere. Is this ‘just’ Cloud-based file storage, or something more?

On an intriguing mix of laptops, other examples from Sun Partners include Vertica and webappVM. The examples definitely leaned towards the sysadmin and developer crowd, and I look forward to seeing some user-facing apps down the line. Dave cites ‘dozens and dozens’ of partners, as their logos flash up on screen behind him.

Lew suggests that the Cloud introduces a change from ‘Download -> Install -> Config’ to ‘Deploy,’ with the implication that this will always be easier.

Turning to Security, Lew points to a new ’secure hardened VM for OpenSolaris,’ available on Amazon S3 today. The Center for Internet Security has assessed this new VM and verified it as secure.

Eric Baldeschwieler from Yahoo! gets up on stage, to talk about the ways in which Apache Hadoop is being used at Yahoo! – and their use of the Sun Cloud.

I look forward to hearing more, face to face, during June’s Semantic Technology and Cloud Computing tour around Silicon Valley; Menlo Park is already on my itinerary, along with sojourns to San Jose, San Francisco and Sunnyvale. Anyone else got things they want to show me, June 14-21?

Reblog this post [with Zemanta]

Article Tagged: , , , , , ,
1 Comment

John Wilbanks talks about Creative Commons, Data, Science and more

John Wilbanks, Science Commons
Image by mecredis via Flickr

My latest podcast is with John Wilbanks, the VP at Creative Commons with responsibility for their Science Commons project.

John has a varied background that includes founding a bio-informatics startup, Harvard’s Berkman Center, the World Wide Web Consortium and the US Congress.

In his current role at Science Commons, he is working to ensure that the outputs of publicly funded science become more available; both for other scientists to use, and for the wider public. The successes of the Open Access movement have led to greater visibility for scientific papers, but the data upon which those papers depend still tends to be difficult to locate.

 
icon for podpress  Standard Podcast [47:48m]: Play Now | Play in Popup | Download

We discuss initiatives at Science Commons and elsewhere, and consider some of the barriers to a change in approach.

Production of this podcast was supported by Talis, and show notes are available on their Xiphos blog.

Reblog this post [with Zemanta]

Article Tagged: , , , , ,
Comments Off

Discussing business search with Robin Johnson, CEO of FT Search

newssift-thumbThe supply of vertical search solutions tailored to particular business niches remains a lucrative and important area, even in these days of Google’s apparently unstoppable growth in generic search market share. Many of the products and companies involved are almost invisible to the general web user, either surfacing only inside the firewalls of large enterprise customers or styled to appear a seamless part of the navigation experience on an e-Commerce site.

FT Search, part of Pearson’s Financial Times Group, recently released the public beta of an interesting new search engine aimed squarely at anyone — “even a CEO” (!) — interested in unearthing information on companies and the external factors affecting them.

Newssift.com brings data from diverse sources together with technology components from established players such as Endeca, Nstein, Lexalytics and ReelTwo to offer an interesting and potentially powerful navigational experience.

I spoke with Robin Johnson, CEO of FT Search, yesterday to hear more about newssift’s capabilities and their intentions for its future development.

 
icon for podpress  Standard Podcast [45:44m]: Play Now | Play in Popup | Download

Production of this podcast was supported by Talis, and show notes are available on their Nodalities blog.

Reblog this post [with Zemanta]

Article Tagged: , , , ,
3 Comments

May’s Semantic Web Gang talks Wolfram Alpha and Google

Pamela L.

Image via Wikipedia

I mentioned the Semantic Web Gang podcast last week, in the context of our upcoming Live appearance at the Semantic Technology Conference in San Jose next month.

This month’s show was recorded yesterday, and is now available. During the conversation, Gang members dig into the two hot stories of the moment; the launch of Wolfram Alpha and Google’s apparent embracing of semantics with their ‘Rich Snippets.’

 
icon for podpress  Standard Podcast [43:34m]: Play Now | Play in Popup | Download

Have a listen, and see what you think.

The Semantic Web Gang podcasts are sponsored by Talis. Show notes are available.

Reblog this post [with Zemanta]

Article Tagged: , , , , , , , , , ,
Comments Off
Rss Feeds