3997333611_2565fc9a4d_bThere have always, it seems, been people for whom attribution and citation really matter. Some of them passionately engage in arguments that last months or years, debating the merits of comma placement in written citations for the work of others. Bizarre, right?

But, as we all become increasingly dependent upon data sourced from third parties, aspects of this rather esoteric pastime are beginning to matter to a far broader audience. Products, recommendations, decisions and entire businesses are being constructed on top of data sourced from trusted partners, from new data brokers, from crowdsourced communities, or simply plucked from across the open web. Without an understanding of where that data came from, and how it was collected, interpreted or maintained, all of those products, recommendations, decisions and businesses stand upon very shaky foundations indeed.

Data attribution is increasingly important, but it will be essential to make sure that the rules, tools and norms which emerge are both lightweight and pragmatic. Now is not the time to get heavy-handed and pedantic about where the comma goes.

Former colleague Leigh Dodds recently offered a useful discussion of the rationale behind data attribution. Early on, he describes the related (and, often, sloppily interchangeable) notions of attribution and citation;

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

This distinction is important in some circumstances, but it can also be useful to consider a simpler, more selfish, but ultimately more scalable justification. Attribution (and citation) of data quite simply provides an audit trail, enabling you, your bosses, your investors, or your customers, to know more about the data upon which actions are based.

Creators want credit, consumers want to trust

Much of the serious consideration of attribution comes from a relatively small cadre of data owners or creators, who (understandably, perhaps) want credit for their hard work. Perhaps they want to prove use in order to secure future funding or advancement, or perhaps they simply want to track where their data ends up. Through a series of licenses, contracts and Terms & Conditions statements, these creators have done much in codifying the ways that data should be referred to. Leigh discusses some of the licensing terms in his post, but it’s probably fair to say that none of them have really caught on outside a few rather narrowly scoped groups of co-dependent developers and data providers.

Data owners’ requirements for credit run the gamut, from loosely phrased requests for a link back to their website all the way past this rather excessive example quoted by Leigh to end up as lengthy tomes of draconian legalese;

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

All too often, that sort of self-defeating prescription is enough to send prospective users back to Google, in search of a less demanding alternative.

For consumers of data, or for those wondering where the data behind a product or decision came from, things are rather simpler. On the whole, those on this side of the divide are simply looking for a pointer which enables them to learn more. Is my company’s multi-million dollar change of direction based upon detailed data from stable Governments, credible banks and respected analysts, or did the person responsible use some numbers the found on their friend’s blog?

Carefully crafted rules regarding attribution’s wording, placement, colour, size and typeface are an irrelevance, probably deserving to be ignored or ridiculed.

Far better, and far more likely of success, to simply encourage users and re-users of data to sensibly point back (however they like) to their principal sources.

Remembering all the ancestors is a bit daft

Data set A is modified and added to in order to create data set A1. Data set A1 is modified and added to in order to create data set A2. Data set B is modified and added to in order to create data set B1. Data set B1 is modified and added to in order to create data set B2. Data Set C modifies and extends data sets A2 and B2. It seems reasonable to acknowledge the contribution made to C by A2 and B2, but some would argue (loudly) that A, B, A1 and B1 also need to be acknowledged in C. This is one aspect of ‘attribution stacking’, and attribution stacking is, quite simply, stupid.

If I am the creator of data set C, I am selecting A2 and B2 because they are the right data sets for my purpose. That selection will be based upon a range of criteria, including the scope and coverage of the data. The selection will also be based upon my impression of the brands responsible for A2 and B2, and that impression (implicitly or explicitly) will include some awareness of the processes they use to select, validate and manage the data they use. It’s for them to carefully select, validate and provide attribution for A1 and B1, not for me. And it’s A1 and B1’s job to do the same for A and B, not me.

Things get even worse in some open source data projects, where all the individual contributors expect to be acknowledged. Inside the project (and on its website, etc), that’s fine and sensible. Outside, though? It’s ridiculous. So if data set A were created by individuals Aa, Ab, Ac, Ad and all their friends right up to Az, under some licenses there would be an expectation that every single one of those individuals be acknowledged by name in any mention of data sets A, A1, A2 or C. A massive administrative burden for any downstream users of the data set, and of no real benefit to anyone whatsoever. This desire for glory really does need to be challenged, if it is not to stifle free and fair downstream use and reuse of the data. Within the project building A, it may be vital to know that user Aa is a bit sloppy, or that user Ad has a nasty habit of making the data say what she thinks it should rather than what it actually does. But it is the responsibility of the project behind A to put processes and procedures in place to address these issues, and to ensure that all of its participants receive appropriate credit within the project for their contribution. By the time we reach A1 or A2, though, those internal details no longer matter. A1 chose to use A because those processes exist. After an initial evaluation of those processes and their implementation, A1 can — and should — simply trust them, rather than endlessly second-guessing them.

Tracking and Trust are different

Ultimately, the motivations of data creators and data re-users are very different. The processes and procedures put in place by creators and owners in search of kudos or statistics may actively obstruct the use and reuse that they profess to want. Complex forms of attribution, aligned to heavy-handed enforcement of infringements, do nothing to encourage a far broader community of use to emerge.

By attempting to count — and manage — the small number of uses today, data creators are stifling growth that otherwise is ready to explode. A perfect example of the saying (which may not translate beyond Britain’s shores!) of ‘biting off your nose to spite your face.’ Think about it… 😉

Leigh ends his post with

Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

Other than broadening it from ‘Open Data’ to just ‘data,’ I couldn’t agree more. But let’s keep it lightweight, simple, and pragmatic.

Note: or perhaps this post should have been called “When you stand on a giant’s shoulders, it’s a good idea to say thank you.”

Image of Eduardo Paolozzi‘s sculpture of Sir Isaac Newton by Flickr user ‘monkeywing