Posted by James Kobielus on June 6, 2011
Enterprises have options. One of the questions I asked the firms I interviewed as Hadoop case studies for my upcoming Forrester report is whether they considered using the tried and true approach of a petabyte-scale enterprise data warehouse (EDW). It’s not a stretch, unless you are a Hadoop bigot and have willfully ignored the commercial platforms that already offer shared-nothing massively parallel processing for in-database advanced analytics and high-performance data management. If you need to brush up, check out my recent Forrester Wave™ for EDW platforms.
Many of the case study companies did in fact consider an EDW like those from Teradata and Oracle. But they chose to build out their Big Data initiatives on Hadoop for many good reasons. Most of those are the same reasons any user adopts any open-source platform: By using Apache Hadoop, they could avoid paying expensive software licenses; give themselves the flexibility to modify source code to meet their evolving needs; and avail themselves of leading-edge innovations coming from the worldwide Hadoop community.
But the basic fact is that Hadoop is not a radically new approach to processing extremely scalable data analytics. You can use a high-end EDW to do most of what you can do with Hadoop with all the core features — including petabyte scale-out, in-database analytics, mixed-workload support, cloud-based deployment, and complex data sources — that characterize most real-world Hadoop deployments. And the open-source Apache Hadoop code base, by its devotees’ own admission, still lacks such critical features as the real-time integration and robust high availability you find in EDWs everywhere.
So: Apart from being an open-source community with broad industry momentum, what is Hadoop good for that you can’t get elsewhere? The answer to that is a mouthful — but a powerful one. Essentially, Hadoop is vendor-agnostic in-database analytics in the cloud, leveraging an open, comprehensive, extensible framework for building complex advanced analytics and data management functions for deployment into cloud computing architectures. At the heart of that framework is MapReduce, which is the only industry framework for developing statistical analysis, predictive modeling, data mining, natural-language processing, sentiment analysis, machine learning, and other advanced analytics. Another linchpin of Hadoop, Pig, is a versatile language for building data integration processing logic.
The bottom line is that Hadoop is the future of the cloud EDW, and its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade. The roles that Hadoop is likely to assume in your EDW strategy are the dominant applications it’s being used for here and now as a petabyte-scalable:
- Staging layer. The most frequent application I hear for Hadoop is in support of the “T” in ETL (extract, transform, load), but where the data being extracted and transformed are in myriad unstructured, semistructured, and structured formats and where the loading may be to a traditional EDW or — more commonly — to terabyte-scale analytical data marts where predictive modelers and other data scientists work their magic.
- Event analytics layer. Another key use for Hadoop is in doing petabyte-scale log processing of event data for call detail record analysis, behavioral analysis, social network analysis, IT optimization, clickstream sessionization, antifraud, signal intelligence, and security incident and event management. There are more traditional EDW architectures that support these applications, but the industry push is toward Hadoop and other “NoSQL” approaches, such as graph databases, to handle high-volume event processing in batch and real time.
- Content analytics layer. This is an area where Hadoop’s comprehensive MapReduce modeling layer is its key competitive differentiator. Next best action, customer experience optimization, and social media analytics are every CRM professional’s hottest hot-button projects. They absolutely depend on plowing through streaming petabytes of customer intelligence from Twitter, Facebook, and the like and integrating it all with segmentation, churn, and other traditional data-mining models. MapReduce provides the abstraction layer for integrating content analytics with these more traditional forms of advanced analytics, and Hadoop is the platform for MapReduce modeling.
Hadoop is not a religion, any more than traditional EDWs are a religion — although some people seem to think of the latter as such, aligning themselves with this or that EDW architectural school. To promote itself to enterprise prime time, the Hadoop industry needs to focus on what its up-and-coming approach does better than EDWs or does best within the context of a traditional EDW architecture.