The technology world is abuzz with Big Data. Harvard Business Review dedicated their October 2012 issue to it. Advancement in Big Data have become coupled with (re)emergence of Data Science applied to these large data sets. I can say with 50% confidence that 50% of the statisticians are laughing all the way to the bank now. In the same HBR issues, data scientist was claimed to be the hottest job of the 21st century no less. Here are some quotes from media
Information is the oil of the 21st century, and analytics is the combustion engine: Gartner
Data are becoming the new raw material of business: Craig Mundie in The Economist
These developments are extremely exciting. The possibilities emerging are fantastic. But there is a skeptic in me which takes big proclamations like this with a grain of salt. I happen to be a big fan of Gartner’s Hype cycles. What we have here is most definitely nearing the Peak of Inflated Expectations. What follows of course, is Trough of Disillusionment. There are a few basic things that, if kept in mind, can make a big data project more productive and hopefully avoid a lot of disillusionment.
There are some key aspects of Big Data that need careful examination:
- Big Data implies we have all the data (n=all)
- Inherent biases in the data
- Theory-free hypothesis – that claims data will speak for itself
- Self-fulfilling prophecies and implications on data
Big Data Implies We Have Everything
Big Data, for many people, seems to mean that all that can be measured is being measured and recorded. From a statistics perspective, there is no need to sample. I would contest that it is nearly impossible to capture everything that can be captured even in the age of big data. Further more, making this assumption can lead to erroneous results from analysis. Taking elections as an example, if we assume that everyone is represented because we counted all the votes, we wouldn’t know what 42.5% of the population in the US wants. By capturing interactions customers have with your website or your social channels, you will miss people who may not be engaged in this way. Taking actions only based on insights gleaned from incomplete data may alienate your other customers further. This problem can be further exacerbated by the next issue – inherent biases in data.
Inherent Biases in Data
Typically, we would assume that data represents the actual population correctly. However, there can be significant mismatch in demographics of people using a brand and those represented in the measurements. The classic case of not recognizing these biases is the 1936 US Presidential Election predictions. Literary Digest had a huge dataset (for the time), and was completely off in their predictions, while Gallup had a much smaller but much more representative set of data and better predictions. Pew’s Social Media Update 2013 tell us that Twitter’s demographics is skewed to 18-29 year old black people living in urban or sub-urban areas. Further, there is significant overlap in the demographics of Instagram and Twitter, with more Hispanic population being active on Instagram. On Facebook, women are heavier users than men, and users on average have some college education. Facebook is growing amongst older users. Are these the demographics of customers of your brand? If you are using data from these platforms to derive insights without correcting for the biases on demographics of the user population, insights that Big data can give you will be misleading.
Data Will Speak For Itself
This hypothesis was popularized by the success of Google Flu Trends where correlation and causality were mixed up. Based on trends about what users were searching for (correlation), spread of flu was estimated. However, the reason why certain people were searching for Flu in certain areas was not ascertained (causality). This resulted in Flu Trends losing it’s relevance recently. While Google works on their algorithm, it is cautionary tale for others jumping too quickly to insights without proper validation. The right approach should be:
- Formulate a hypothesis based on the correlations found in data
- Establish test and control groups
- Vary some parameters and test hypothesis for bounds within which it holds
- Derive insights and plan accordingly
Self Fulfilling Prophecies
The last point I would make is around self fulfilling prophecies. If you did some analysis based on the suggestions above, and maybe found that people are searching for a specific type of product on your site. Logically, a call to action, a promoted search result or similar tactics could be applied. Now, a feedback loop has been established. Because it is promoted more, it ends up being viewed more. This is a self-fulfilling prophecy. The data collected needs to be normalized and more sophisticated models need to be created to account for this. Further, it is important to not loose sight of everyone who did not go through the promoted path and see if trends are emerging there. Rinse and repeat!