In the past couple of years, there has been a lot of media coverage about Big Data in general and Hadoop in particular. This blog post aims to provide an easy introduction to Hadoop from a marketer’s point of view and also present a few potential use cases as thought starters.
What is Hadoop?
Hadoop is a Open Source platform that was developed at Yahoo! and contributed to Apache Software Foundation. The platform provides two core services:
- A distributed file system that allows storing large files in a distributed manner across many computers (known as HDFS)
- A distributed computing engine that allows computing tasks to be performed on the distributed files, also in a distributed fashion (Known as YARN, plus an exmple of a compute engine called MapReduce running on top of YARN)
Hadoop ecosystem is comprised of many tools that are built on top of these two components. These tools allows ingestion of massive amounts of data from different systems (Flume, Sqoop), providing familiar data manipulation interfaces (Pig, Hive, HBase), to advanced analytics and artificial intelligence (H20, Mahout). Together, these allow processing two be performed on files ranging from a few hundred MBs to well over Terabytes.
A brief note about the history of how Hadoop came to be – core tenets of the technical underpinnings of Hadoop were invented at Google. They needed to invent new ways of storing and processing huge amounts of data. HDFS has it’s origins in Google File System. The MapReduce distributed computing engine has it’s origin in Google’s MapReduce. The reason I bring this up is because Hadoop is built by the brain trust of the best engineering teams today. Further, Hadoop is backed commercially by large corporations like Cloudera, HortonWorks, MapR, Intel, Microsoft and the list grows every day. Use this when your IT guy says,
But it’s open source….
Do I Have a Big Data Problem?
That is the question, isn’t it? If the current media hype is to be believed, everyone has a Big Data problem. However, let me provide my perspective on this. I think of the tools like Hadoop in the big data ecosystem as just that, tools. these tools are great at doing some things, and not so great at others. Here are some things I consider what I am solving a problem, on whether I should use the Hadoop ecosystem:
A Lot of Different Data Formats
This happens quite a lot for me. Just recently, I got a 100MB file of media runs for advertising campaigns and another equally large file of all the leads received by the call centre calls due to these advertising campaigns. Ofcourse, the data had a lot of errors in it. It was sufficiently large for Excel to not work. These two large files were generated by different departments in the organization. They had to be joined together to calculate effectiveness of media spend and performa some predictive modelling on what was giving the best qualified leads. Pushing this data into Hadoop, joining it, and preparing it for analysis took less than an hour.
There are many such cases in marketing. You get a feed of social media messages, web analytics from Google Analytics, call data from call centre, transactions from stores, e-commerce transactions from the website, customer data from a CRM system and so on. Often, building a data warehouse is a very expensive, time and resource consuming activity. Further, these warehouses are very rigid and it is incredibly hard to discover new relationships as time goes on. A Hadoop based model, with key technical terms being Schema-On-Read, and Extract-Load-Transform (ELT), allows new questions to be asked any time in the future.
A Lot of Data
I was running an analysis for a restaurant chain analyzing 400 million transactions to understand customer behaviour. Well, Excel wouldn’t work for sure. This is a clear case of requiring a big data solution.
Traditionally, this is the type of analysis that marketers wouldn’t be able to perform. However, with the Hadoop ecosystem, and some of the connectors mentioned in the next section, this is fair game for marketers now. Using the Hadoop ecosystem, you can run segmentation on a full years worth of customer transactions and do RFM calculations to identify your most valuable customers. You can join in data from the call centre to determine which segments are the most expensive to serve. Possibilities are great here.
Great, But Sounds Technical
It is hard to escape technology working in digital marketing. However, it doesn’t mean you need to know how to code a MapReduce job. Knowing the eco-system allows intelligent conversations between marketing and IT folks. To make your life simpler, there are some fantastic connectors like:
- Excel Connector: This is connector published by Hortonworks that allows you to use Excel to query Hadoop in the background. Microsoft also provides a similar connector
- Tableau, a nice BI and visualization tool alos provides a guide on how to connect to a Hadoop system. You can then run all your analytics in Tableau
Similar tools and connectors are available for other BI and visualization platforms. You should also checkout some of these tools: Platfora, Actian, Pivotal, SAS amongst others.
What’s the next problem you want to solve?