This is the image that begins an article in Wired claiming that the “data deluge” will eclipse the need for theory.
A few years ago I heard about a new data server from Microsoft called the Terraserver. It was so called because it both contained a terabyte of data (the largest then attached to the internet) and contained imagery of the earth (terra). Microsoft wanted to test a 1tb data server and decided to use earth imagery because geospatial data was so bulky it provided a quick way to get to 1 terabyte.
As the illustration shows, you can now buy a 1tb portable harddrive for a few hundred dollars. After the terabyte the next measure of data is the petabyte. Are we in the age of the petabyte?
The Wired article, by editor in chief Chris Anderson, is garnering quite a bit of attention across the internet, and suggests that the proliferation of data makes obsolete the need to understand what any of it means. We can just use the data itself:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
This is obviously a direct challenge to the way we think now, especially those of us who deal in theory on a daily basis. But although the piece is headlined the “end of theory” the article is actually more focused on the end of the traditional scientific model:
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete…
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot…
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
“Correlation is enough.” We don’t need causal mechanisms, or rather we don’t need to understand them to be able to deal with phenomena. “Theory” (and recognizing that this is a polyvalent term, which includes scientific theory, critical theory, literary theory, theory in the social sciences, history and physics) is superseded by observation.
But is this kind of thinking really new? After all there’s pattern-seeking in datasets which has informed geovisualization since the 1980s and which relies on something called “abduction.” Abduction is a variant on induction or deduction and was introduced into modern logic by Charles Sanders Peirce–a philosopher–over a 100 years ago. It means finding patterns in datasets.
Statistical fluctuations might be enough to see when or how, but would they be enough to understand why? This has been a critique of disciplines like psychology and economics for a while now, in that they are driven by “significant” correlations. But at the end of the day you don’t know what to do with those correlations because they are particular. If we see a statistical fluctuation in the prevalence of something in the 19th century it’s hard to know what might be made of that without attempting to explain it.
And a trite, but telling, critique of Anderson’s piece is that it is itself a theory–that is, an attempt to understand something, putting it in context and explaining it.
Remember in Terry Eagleton’s book Literary Theory he observed that “without theory we wouldn’t know what to count as data”? That’s probably still a good way to put it.
I think Anderson is right as far as he goes, that data is certainly proliferating and finding patterns in it is important (or rather, finding useful and important information in a timely fashion is important). As data moves into the so-called “cloud” (ie off individual laptops and jump drives and into the internet or its successors) it will become more sharable and networked.
But this is not incompatible with our continued attempts to explain, account for, understand, put in context, critique, or model.
Filed under: Uncategorized