Here comes data: The end of theory?

This is the image that begins an article in Wired claiming that the “data deluge” will eclipse the need for theory.

A few years ago I heard about a new data server from Microsoft called the Terraserver. It was so called because it both contained a terabyte of data (the largest then attached to the internet) and contained imagery of the earth (terra). Microsoft wanted to test a 1tb data server and decided to use earth imagery because geospatial data was so bulky it provided a quick way to get to 1 terabyte.

As the illustration shows, you can now buy a 1tb portable harddrive for a few hundred dollars. After the terabyte the next measure of data is the petabyte. Are we in the age of the petabyte?

The Wired article, by editor in chief Chris Anderson, is garnering quite a bit of attention across the internet, and suggests that the proliferation of data makes obsolete the need to understand what any of it means. We can just use the data itself:

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

This is obviously a direct challenge to the way we think now, especially those of us who deal in theory on a daily basis. But although the piece is headlined the “end of theory” the article is actually more focused on the end of the traditional scientific model:

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete…

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot…

The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

“Correlation is enough.” We don’t need causal mechanisms, or rather we don’t need to understand them to be able to deal with phenomena. “Theory” (and recognizing that this is a polyvalent term, which includes scientific theory, critical theory, literary theory, theory in the social sciences, history and physics) is superseded by observation.

But is this kind of thinking really new? After all there’s pattern-seeking in datasets which has informed geovisualization since the 1980s and which relies on something called “abduction.” Abduction is a variant on induction or deduction and was introduced into modern logic by Charles Sanders Peirce–a philosopher–over a 100 years ago. It means finding patterns in datasets.

Statistical fluctuations might be enough to see when or how, but would they be enough to understand why? This has been a critique of disciplines like psychology and economics for a while now, in that they are driven by “significant” correlations. But at the end of the day you don’t know what to do with those correlations because they are particular. If we see a statistical fluctuation in the prevalence of something in the 19th century it’s hard to know what might be made of that without attempting to explain it.

And a trite, but telling, critique of Anderson’s piece is that it is itself a theory–that is, an attempt to understand something, putting it in context and explaining it.

Remember in Terry Eagleton’s book Literary Theory he observed that “without theory we wouldn’t know what to count as data”? That’s probably still a good way to put it.

I think Anderson is right as far as he goes, that data is certainly proliferating and finding patterns in it is important (or rather, finding useful and important information in a timely fashion is important). As data moves into the so-called “cloud” (ie off individual laptops and jump drives and into the internet or its successors) it will become more sharable and networked.

But this is not incompatible with our continued attempts to explain, account for, understand, put in context, critique, or model.

5 Responses

  1. the next unit of measure is the giga(tonne), that is in the sense that the USA produces 5.5 Giga Tones of carbon into the atmosphere per year…. corrolation then doesn’t even enter into the implications of such numbers… the world simply dissolves in the acid.

    sdv

  2. Like you wrote, there is nothing particularly new in the idea that correlation is enough… a cursory look at the literature on social network analysis would show that much. But then, there are still studies that deal in the specifics, and sometimes, regressions and rigorous statistical analysis only serve as starting points for more particular and specific studies. Neither is a substitute for the other.

  3. Rather the end av data and information. Today we can leave data to computers and information to (wiki…) archives, and instead enjoy a good espresso and a nice discussion about love, philosophy, resistance, or the new salsa moves.

  4. The smiley was a mistake, just a “)”

  5. Great to see a Peirce reference to abduction here on the Foucault Blog and thought folks might appreciate my perspective on Chris Anderson’s recent technological utopia in the End of Theory and and Carr’s distopia in Is Google Making us Stupid. Hope you enjoy my: “Signs of the Singularity and Why Chris Anderson and Nicholas Carr Won’t Make the Next Cut” here on the Phaneron …

    http://phaneron.rickmurphy.org/?p=26

Leave a comment