Big Data – Small Minds
I travel a lot. I am now what the Transportation Safety Administration (TSA) refers to as a “Trusted Traveler.” One of the main benefits of being a trusted traveler is that when an airport has a designated TSA pre-check security line, I can pass through airport security without taking my shoes off. I would not have expected, say five years ago, that I would consider myself lucky to be allowed to keep my shoes on when making my way to my gate, but such is the state of our civilization.
I mention my good fortune and trustworthy status because today I was denied entrance to the pre-check screening area. After I handed my boarding pass to the TSA agent with my usual smug, conspiratorial grin, she scanned it, avoided eye contact, and asked me to move to the general security screening area. “Why?” I tried to sound nonchalant. “It’s randomized” was her explanation. “This is never happened before” was all that I could muster. She decided that repetition would help it sink in, “It’s randomized, sir.”
As an explanation, “it’s randomized” feels completely unsatisfying. Even an inane explanation would have been more comforting; “after a recent analysis of travelers flying from Dallas to New York, we have become suspicious of men who hand us their boarding pass with their left hand.” Because my flight was delayed, I had plenty of time to wait in the regular security line. I also had a lot of time to ponder my reaction to the TSA agent’s explanation.
In addition to living in the age of shoe removal at airports, we also live in the age of “big data.” We have been using the term, “big data” for several years now to describe the mounting bits of information available to companies who “mine” the data with computer programs designed to find patterns. All of the sudden, the fact that I’m a 55 year old, white married man with three daughters who spent two minutes looking at black loafers on Zappos.com is now a data point. It’s flattering and a little creepy. What I do online is only a “data point” if someone cares enough to direct a computer program to gather up my activity. If no one cares to pay attention, what I do is not a data point; it’s just what I do.
The patterns uncovered by data miners are intended to explain and predict. While the ability to distinguish unusual activity from usual activity can be incredibly useful, presuming that identified patterns explain things is no less troublesome just because the data set is larger and more diverse. In fact, we’re more likely to be duped by data mining conclusions simply because data miners work with lots of data.
Carl Richards, in a recent New York Times post, referenced a paper written by Professor David J. Leinweber of the California Institute of Technology. Leinweber demonstrated that between the years 1983 and 1993 there was 95% positive relationship between stock market prices and butter production in Bangladesh. The relationship rises to 99% if you throw in butter production in the United States plus sheep production in the United States and Bangladesh, plus cheese production in the United States. In the conclusion of his paper, Leinweber warned, “If you look at 100 regressions that are significant at a level of 95%, five of them are there just by chance. Look at 100,000 models at 95% significance, and 5,000 are false positives.”
Apart from the dangers of confusing chance with something statistically more suggestive than chance, we also need to consider the philosophical assumptions at work in all this data mining. The economics Larry Summers, in a New York Times Sunday Magazine article, was quoted as saying, “people apply patterns to random data.” Upon reflection, I’m tempted to ask, aren’t all data random until we unleash our attention on them? Without human interpretation, what does it mean to call something a pattern or to call something random?
We make very practical and important decisions because we can count on some sets of conditions being used to predict other sets of conditions. Weather forecasts come to mind. Most of the time, when the barometric pressure does x, clouds do y. The dangerous leap is the one that starts with the correspondence between two things and ends with an explanation that we take to mean something distinct from our interpretations. A red sky in the morning doesn’t explain storms any more than smoke rising from a volcano means that the gods are angry. Barometric pressure is simply the paradigmatic explanation du jour.
The TSA agent didn’t feel she owed me an explanation. In the end, I think she gave me something more profound than she intended. I have come to see: “it’s randomized” as perhaps the most comforting and appropriate explanation of all.