Day 2, and after yesterday’s tutorials the conference is really getting going.
Here’s a stream of consciousness from the morning’s keynotes at this sold-out event.
Conference chair Edd Dumbill is introducing things, talking about William Smith‘s nineteenth century map of geological strata in the British Isles, the rise of industrialisation, and the move to towns. Edd suggests that a similar set of inflections are happening today in the world of data; ‘the start of something big.’
“In the same way that the industrial revolution changed what it meant to be human, the data revolution is changing what it means to be alive.”
The first of this morning’s keynotes; Hilary Mason from link shortener bit.ly.
Data and the people who work with data; “The state of the data union is strong.” Data scientists have an identity – a place to rally around – with Strata.
We have accomplished much, begging, borrowing and stealing from lots of domains. We have the tools. We have the capacity to spin up infrastructure in the Cloud. We have the algorithms to explore data, and to learn from it.
The most important thing we have now that we didn’t have before… is momentum. People are paying attention.
There are still challenges though. Timeliness of data is an issue, especially in real-time. We need to develop systems that can do robust analysis against a moving stream of data. We need to be able to store data in ways that let us operate on it in real-time. Hadoop… amazing ‘because I can run a query and get the result back before I forget why I submitted the query in the first place.’ We need training. We need imagination, not more ad optimisation networks. We have a real opportunity to do something better.
Opportunities (expressed in context of bit.ly); Bit.ly gets lots of data from people shrinking web links. They learn a lot about people; what they like, what they want, what they’re doing. bit.ly also gets rich segmentation data; location, context, etc. bit.ly sees global data, for example clicks on bit.ly links from Egyptian domains.
Now that we have all this data, it offers a window on to the world. What can we do with it? Make the world a better place? What would you do with all of this data?
Next up, James Powell from Thomson Reuters to talk about privacy and behavioural data in B2B contexts. Thomson Reuters gathers large amounts of global data, and filters it for customers. Time and context key; 700,000 updates a second through financial systems, 5,000,000 documents per day served through Open Calais, etc. Thomson Reuters interested in ways to filter information better.
Need to think about B2B implications of behavioural data, especially as we sell/exchange increasing volumes of data with partners. Consumers reasonably comfortable with giving up some personal data in return for a ‘better’ product (Amazon recommendations, etc), that probably doesn’t scale to the enterprise. For example, Open Calais customers submitting large numbers of dummy queries to obfuscate what they’re really looking for…
Key problem that needs to be addressed is ambiguity; many systems in this space still rely upon implicit assumptions, whilst the enterprise is used to explicit contracts. Tension – or recipe for disaster?
Keys to success – need to treat behavioural data differently/better, and avoid the mistake of simply continuing consumer trends.
Next, Mark Madsen from Third Nature, talking about ‘the Mythology of Big Data.’
Lots of assumptions underlying conversations about Big Data. ‘Every technology carries within itself the seeds of its own destruction.’ Code is a commodity; things that a lot of people have built profitable careers around have started to move down-market. Libraries, packages, etc make it easier for third parties to stitch things together rather than start from scratch.
The central myth underlying Big Data that’s erupted over the past 18-24 months; the myth of the gold rush. Everyone wants to be a data scientist. But just like the gold rush, success takes capital. It takes corporate engagement, and infrastructure. The ‘myth tells us you can go it alone… and you can’t.’
1950s-60s – data as product. 1970s-80s – data as byproduct. 1990s-2000s – data as assset. 2010- data as substrate (data as the basis for competition). ‘The real data revolution is in business structure and processes and how the use information.’
Using Big Data; the point isn’t necessarily about ‘Big.’ Much valuable data inside an enterprise is only GB or TB in size. We get tied up in ‘big’ way too much. It’s not really about data either; it’s about applying data. Without an application, it’s trivia.
Next, Amazon CTO Werner Vogels. An overview of how Amazon Web Services look at the data processing being done on their infrastructure by customers… Government, Finance, COmmerce, Pharma… all making use of tools. Plugging The Fourth Paradigm book from Microsoft Research (which is very good).
Vogels – big data is big data when your data sets become so large that you have to innovate to manage them. Customers view big data as collection and curation of data for competitive advantage… with the presumption that bigger is better. For recommendations etc, that is probably true.
There are a number of categories of data, where quality is far more important than quantity.
In the past, data tended to be collected to answer questions. Now, trend to collecting as much as possible before developing the questions you want answered, and the algorithms you will need to use for the analysis.
To do this, you should not be worried by data storage, data processing, etc – which is why you should embrace the scalable Cloud.
Data analysis pipeline; collect – store – organise – analyse – share.
AWS Import/Export – “you shouldn’t underestimate the bandwidth of a FedEx box.” Indeed.
“This is Day 1 for Cloud infrastructure.’
Next up, Microsoft’s Zane Adam talking about data marketplaces. Windows Azure DataMarket; Data as a Service, free or at cost. One stop shop for data (one of many one stop shops, unfortunately!) DataMarket is interesting… but this is far too much of a product pitch for the keynote track.
90 days since launch – 5,000+ subscriptions, 3 Million transactions to date. Given Microsoft’s presence and reach, aren’t those figures a bit low?
“There’s a lot of data out there… but it’s not all good.” A Data Marketplace gives customers access to good data. Does it? Do Microsoft vet every fact in a submitted data set? What would a single bad data set do to the marketplace’s brand recognition?
