Big Data: What does “Big” mean?

Google “Big Data”; you’ll see talk of terabytes, petabytes, exabytes. Yes, searching and summarizing massive amounts of data is a challenge, but there are tougher challenges on the path to – as Lauren Littlefield put it recently – “amazing insights”. Let’s call another challenge “complexity meets the unknown”.

We recently worked on a 42 megabyte data set, to find a quantitative model that identified Alzheimer’s patients. That’s right, 42 megabytes, 4% of a gigabyte, 0.004% of a terabyte. Tiny, right? Not so fast.

The data set was for 327 people, but for each person there were 11,213 measurements of substances in the blood. We found that 14 of these substances mattered in identifying Alzheimer’s patients. If you somehow knew in advance that there were 14, how would you find which 14 out of the 11,213? It happens that there are 5.6 times 10 to the 45th possible subsets of 14 in 11,213. If 1,000 computers tested 1 billion sets of 14 variables every second, it would take 1.8 times 10 to the 26th years to find the right one. In this case, 42 megabytes clearly is Big Data.

There’s an even bigger unknown: Does the right quantitative model involve interactions between the variables? Finding which interactions are useful, and what math governs them, multiplies possible solutions by billions or more.

Think about Capital One partnering with General Motors: How could Capital One use transaction data, demographics, and other variables to find people with a high propensity to buy a Cadillac versus a BMW? With hundreds of transactions per person, credit information, bill payment timing, response to prior offers, etc., the number of variables can easily approach thousands. But what if the best formula to rank customers by likelihood of Cadillac purchase happened to be something like the following?

Michigan upbringing and college football tickets purchased in the last year, times the natural log of income, divided by number of children.

Good luck with regression or any form of traditional statistics. Even though there are only four variables in this hypothetical solution, the math combines multiplication, division, Boolean logic, and a non-linear operation. But this is how the world works; we’ve seen similar dynamics in marketing, revenue forecasting, prognostic medicine, and futures trading. The bottom line: We can’t force fit consumer behavior into fixed math models simply because that’s what everyone is accustomed to doing.

The reason to collect and use Big Data is to produce insights that are new and valuable, not just confirming what we already know. If we’ve done a good job collecting the facts, we’ve solved one part of the problem. Next we need to identify the right questions to ask. Here’s where data integration, domain experts, insightful visualizations, and customized dashboards are indispensable.

Once we know the right questions, what’s left is to deal with all aspects of “Big”:  size, complexity and unknown relationships, to find the amazing insights that help us to spend marketing dollars on the right people at the right time, for the best outcomes.

———

Patrick Lilley is CEO of Emerald Logic, a pioneer in complex analytics. The company’s FACET (FAst Collective Evolution Technology) software combines principles and math from biology, engineering, and particle physics to answer intractable questions with high human and economic impact.