A data visualization of Wikipedia as part of t...
Image via Wikipedia

Big Data‘ is currently capturing the imagination, attracting hype, investment and ambitious startups in almost equal measure. Kim and Eric Norlin’s excellent Defrag and Glue events have gained big-name company, with O’Reilly‘s Strata and GigaOM‘s Structure both set to arrive in the first quarter of 2011. Venture firms like IA Ventures have emerged, specifically targeted at finding, funding, and profiting from the big Big Data idea. Giants of the web from Yahoo! and Amazon to Twitter and Facebook solve their own Big Data problems in very different ways, contributing valuable code and experience to the community whilst simultaneously diluting focus and adding to the cacophony.

Flippantly reckoned by many to be ‘anything that requires more than a single machine to run,’ the Big Data reality remains somewhat harder to pin down. To those seeking routine business insight, that mammoth Excel spreadsheet they laboriously query overnight at the end of each month might quite justifiably be thought of as ‘Big.’ At the other end of the scale, data wizards scorn anything that doesn’t require a room full of servers, a mountain of empty pizza boxes, and the careful construction of a bespoke data ingest, management and querying system atop the most bare-bones version of the Linux kernel they can find. Somewhere between the two, a growing mass of cheaply gathered data holds out the promise of invaluable insight. Remote sensors, web clickstreams, social graph interactions, purchaser (and non-purchaser) behaviours. All these, and more, have much to tell planners, builders, makers, sellers, and buyers. If only we could formulate the right questions. If only we could devise the right sampling strategies. If only we had big enough machines to ask lots of questions using lots of sampling strategies. If only we had big enough machines to not bother sampling at all.

On the hardware side of things, even humble domestic laptops typically ship with at least two cores these days; two separate little computers ready to do the data processor’s bidding. Four, eight, sixteen and more cores are not far behind, but mainstream software products typically fail to exploit anything more than a single core. Push Excel as hard as you like, and it won’t do more than take one of your computer’s multiple cores to the max. On that 12-core Mac Pro you persuaded the boss to buy, only one core will be hard at work on your data. Twitter, Mail, YouTube, and ripping DVDs  will each be giving other cores a little light exercise whilst others sit idly by, waiting for the arrival of operating systems and applications capable of exploiting multi-core power. The same is true as jobs grow and move to run across multiple machines, whether under your desk, in your data centre, or out in the Cloud. Those big datasets need to be carved up and shared amongst the available computers before any analysis takes place. You’re typically not accessing a ‘big computer in the Cloud’ at all… but lots of relatively small (commodity) computers, and it takes careful planning and smart software to manage the division and recombination of those jobs in a cost-effective manner. Projects such as Joseph Hellerstein‘s Berkeley Orders of Magnitude (BOOM) begin to demonstrate some of the potential for working natively with multiple processors, but there’s a long way to go before those advances reach the mainstream.

Hadoop, Cassandra, MapReduce, Dynamo, Voldemort. These, and more, are solutions developed by the likes of Yahoo!, Facebook, Google, Amazon and LinkedIn to tackle the influx of data that each faced – and for which each had failed to find an existing solution. Hadoop, with the addition of Cloudera‘s commercial polish, is rapidly emerging as the front runner for an off the shelf Big Data solution, but all of these tools remain rather narrow in their abilities. Find the type of data or the nature of query for which each of these was built and its performance will be unbeatable, but we are a very long way from Big Data’s equivalent of the jack-of-all-trades SQL-powered relational database of old.

And there, for many enterprises, lies the problem. Useful Google searches require the crawler, index and UI to do a relatively small number of essentially similar tasks, very quickly, very cost-effectively, and at massive scale. Focus on that finite set of problems, and you build a solution that delivers the experience we’ve all come to know. Each type of data manipulation or analysis requires a different tool, differently optimised, with the inevitable result that a typically diverse organisation may require a plethora of Big Data tools to get their work done. Or they might just continue to muddle along with Oracle or mySQL, churning inefficiently through their data analysis jobs for interminably long periods of time. These relational database tools are understood, they are mature, and they get the job done. Except in the most data-intensive industries, they have a market presence that will be difficult to disrupt.

The Big Data space is seeing remarkable innovation, but there is a long way to go in order to lift it out of the domain of the technically proficient specialist and place it on desktops across the organisation. As IA Ventures’ Brad Gillespie notes, “Excel is where the world’s data lives… [and] Big Data has to get to that place… so that a CMO can leverage it directly.”

And in all of this fervent of innovation, to return to the title of the post, it strikes me that Big Data is becoming disconnected from the fabric of the web itself. Oh, much of the data certainly comes from the Web, and a lot of it might even be queried on the Web after processing. But, somewhere along the line, the linkedness of the Web has either been forgotten or ignored. That rich set of connections, interconnections and associations has been reduced to a table, an index, or a (large) set of key-value pairs. And in the process, something fundamental has gone away.

This is enough for now, though. Looking more closely at different Big Data approaches, and exploring the potential for re-introducing the Web must wait for future posts.