Hadoop: Future Of Enterprise Data Warehousing? Are You Kidding?

I kid you not.

What’s clear is that Hadoop has already proven its initial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable staging cloud for unstructured content and embedded execution of advanced analytics. As noted in a recent blog post, this is in fact the dominant use case for which Hadoop has been deployed in production environments.

Yes, traditional (Hadoop-less) EDWs can in fact address this specific use case reasonably well — from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time — one to two years, tops — before all EDW vendors bring Hadoop into their heart of their architectures. For those EDW vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.

Where the next-generation EDW is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the EDW as the hub for all advanced analytics. Forrester strongly expects vendors to incorporate the core Hadoop technologies — especially MapReduce, Hadoop Distributed File System, Hive, and Pig — into their core architectures. Again, the impressive growth in MapReduce as a lingua franca for predictive modeling, data mining, and content analytics will practically compel EDW vendors to optimize their platforms for MapReduce, alongside high-performance support for SAS, SPSS, R, and other statistical modeling languages and formats. We see clear signs that this is already happening, as with EMC Greenplum’s recent announcement of a Hadoop product family and indications from some of that company’s competitors that they have similar near-term road maps.

Please do not interpret this as Forrester forecasting the demise of traditional EDWs built on relational, columnar, dimensional, and other approaches for storing, manipulating, and managing data. All of your investments in pre-Hadoop EDWs, data marts, data hubs, operational data stores, and the like are reasonably safe from obsolescence. The reality is that the EDW is evolving into a virtualized cloud ecosystem in which all of these database architectures can and will coexist in a pluggable “Big Data” storage layer alongside HDFS, HBase (Hadoop’s columnar database), Cassandra (a sibling Apache project that supports peer-to-peer persistence for complex event processing and other real-time applications), graph databases, and other “NoSQL” platforms behind an abstraction layer with MapReduce as its focus.

That trend is also clear. All of this makes me happy that I stated as much in a Forrester report that we published almost two years ago. At that time, in the context of an in-database analytics discussion, I stated that within the next several years, most EDW and advanced analytics vendors will incorporate MapReduce and Hadoop support into their architectures to enable standards-based development of advanced analytics models with flexible in-database pushdown optimization in the cloud.

I also took the analysis to the next evolutionary step, identifying the industry roadmap for embedding of Hadoop/MapReduce into the larger paradigm that we now call “Big Data.” This paradigm involves embedding a more comprehensive range of application functions and logic — both analytical and transactional — into the virtualized cloud EDW. Essentially, the cloud EDW will become the core “application server” for the next generation of use cases — such as next best action — that require tight integration of historical, real-time, and predictive analytics.

Within the Big Data cosmos, Hadoop/MapReduce will be a key development framework, but not the only one. These specifications will form part of a broader, but still largely undefined, service-oriented virtualization architecture for inline analytics. Under this paradigm, developers will create inline analytic models that deploy to a dizzying range of clouds, event streams, file systems, databases, file systems, complex event processing platforms, business process management systems, and information-as-a-service environments.

At Forrester, we see these requirements coming directly from CTOs and other senior decision-makers in large organizations who are driving convergence of investments across all of these formerly separate technology domains. Vendors are racing to address this convergence in their product portfolios.

No kidding. Hadoop is the core platform for Big Data, and it’s a core convergence focus for enterprise application, analytics, and middleware vendors everywhere.

Comments

MapReduce or Hadoop?

James, are you really saying that MapReduce is in the next gen DW? Hadoop seems like a framework that isn't necessary if the EDW or platform can run MapReduce and integrate with the appropriate applications/filesystems...yes?

Yes, MapReduce supported by next-gen EDW

Yes, MapReduce is supported by next-gen EDWs, such as those from Teradata's AsterData and EMC Greenplum. MapReduce is an abstraction layer over various data storage/persistence layers, including the Hadoop ones (HDFS, Hbase), Cassandra, RDBMSs, etc, all of which will be components in the virtualized storage layer of the next-gen EDW. Other abstraction layers (i.e., to support unified access, query, manipulation, etc.) include SQL, SOA, and REST. Hadoop's various layers have various degrees of adoption now, and certainly will be adopted to different degrees by EDW vendors (with some, especially MapReduce, HDFS, Pig, and Hive) enjoying broadest adoption, while others will be options or be ignored altogether.

Ian? Ian who?

I'm not sure who this Ian is that you're referring to. The original post was from me. I agree with the points you're making, but that's a confusing blog reply.

To check if I understood you correctly

James, thanks for a very informative post... if I understood you (and most of the concepts of big data, cloud & Hadoop components) correctly, the following seem likely....please comment.

1) Traditional DW providers will flex / radically change their architecture to enable Hadoop (HDFS, Hbase specifically) based storage for unstructured as well as structured data.
a.Hadoop (Hbase?) for structured data (as an alternative option to RDBMS based structured data storage) may be particularly relevant to organizations that are pushing / thinking of pushing even structured data into the cloud (private / public). Do you agree?

b.I am also saying the above (hadoop based storage for structured data) based on my understanding that RDBMS based DWs concentrate data storage on a single / small set of server H/W whereas cloud by its very distributed nature, would support Hadoop based storage even for structured data.

c. Related question: Are cloud based RDBMS data stores prevalent? Do they even lend themselves to be implemented on the cloud?

2) Other Hadoop components (Hive & HiveQL, specifically) integrated into traditional EDW providers’ architecture will enable faster querying of multi-structured data (especially when the data of all sorts are Hadooped(!) on the cloud)

3) In the meanwhile, bridge solutions like MapReduce enabled SQL for querying, reporting & analytics (like that enabled by SQL-MapReduce from Asterdata), will aim to enhance the performance of core RDBMS platforms in analyzing big data (especially when such data will have to be brought into cluster based on-premise implementations)
a. Related Question: is Asterdata SQL MapReduce fundamentally still a RDBMS in the way it stores& manages data (the only change I see is how queries can leverage MapReduce based User Defined Functions)

4) Bridge solutions like Hadoop adapters (again, like that from Asterdata for its SQL-MapReduce) will extend the SQL querying ease of core RDBMS based platforms to big data + also take advantage Hadoop style Storage (or probably just the ETL features of Hadoop components) for big data

a. Related Question: When data is moved by the Asterdata Hadoop Adapter into its SQL-MapReduce, does data get replicated in relational structure? If that’s the case, doesn’t that add overheads (more so when data is big), in exchange for convenience?

Thanks,
Santhanakrishnan(Santhana)

Great questions...a few quick responses

Santhana:

Excelllent response to my blogpost from June. You understood the main thrust of my blog quite well.

Here are a few quick responses to your questions:

QUESTION: Hadoop (Hbase?) for structured data (as an alternative option to RDBMS based structured data storage) may be particularly relevant to organizations that are pushing / thinking of pushing even structured data into the cloud (private / public). Do you agree?

• KOBIELUS RESPONSE: Yes, columnar databases such as HBase, arranging data in tabular formats, are well-suited to handling structured data. But the same must be said for row-oriented RDBMSs as well. And both types of databases are found in real-world Hadoop deployments in private and public clouds. For example, HBase is found in about a third of the early-adopter Hadoop case studies I’ll be publishing in the next several weeks, but several of them also use an RDBMS such as MySQL (under an MapReduce abstraction layer), and of course Teradata/Aster Data’s SQL-MR abstraction layer integrates with that vendor’s row-oriented nCluster MPP EDW. The range of data storage/persistence approaches in the Hadoop world, under the MapReduce layer, is wide and growing: including HDFS (file-based), HBase (columnar), Cassandra (real-time distributed), RDBMS (MySQL, nCluster, etc.), and NoSQL (apparently, Oracle’s upcoming Big Data Appliance will run MapReduce over a key-value store database).

QUESTION: Are cloud based RDBMS data stores prevalent? Do they even lend themselves to be implemented on the cloud?

• KOBIELUS RESPONSE: Yes. For example, SQL Azure is a cloud-based RDBMS store. So is the new Google Cloud SQL, which is a scalable, hosted MySQL database environment. So, obviously, is Oracle 11g, which can be hosted in Amazon clouds. Almost every database architecture, traditional and new, has taken up residence in clouds (public and private), and these architectures will compete for particular niche roles in the virtualized, heterogeneous, multi-database cloud world.

QUESTION:Is Asterdata SQL MapReduce fundamentally still a RDBMS in the way it stores& manages data (the only change I see is how queries can leverage MapReduce based User Defined Functions)?

• KOBIELUS RESPONSE: Yes, underlying Teradata/Aster Data’s MapReduce platform is an MPP RDBMS, nCluster, which stores data in rows and supports all SQL and SQL-MR functions.

QUESTION: When data is moved by the Aster Data Hadoop Adapter into its SQL-MapReduce, does data get replicated in relational structure? If that’s the case, doesn’t that add overheads (more so when data is big), in exchange for convenience?

• KOBIELUS RESPONSE: That’s a low-level technical question you should ask Teradata/Aster Data directly.

Jim

Thanks Jim

Hello Jim,

Thank you very much for a very structured response (I was a bit hesitant as I fired too many questions at you).

Really served to up my understanding several notches (especially me being a non/quasi-technical consultant).

Planning to read more of you + share this article of yours with my fellow practitioners through my blog as well.

Santhana

My pleasure to help you

Santhana:

I'm glad you found my feedback useful. Have a good one.

Jim

You really give the right

You really give the right information that I am looking for. I found this blog post through Google and I love your contents here. Thanks for sharing such informative articles and I will make sure visit some other blog post on your websites to read more.
Thanks
tragus piercing

Data Model for Big Data EDW solution

Hi,

As we are moving towards HDFS /MapReduce in the Bigdata landscape, will the philosophy of dimensional modeling still holds good in hadoop based EDW solution….what should be the right data modeling practice or how do we implement dimensional modeling in hadoop. If you can throw some light on the Data Modeling side, it will be helpfull.

Thanks,
Ritesh

HDFS & dimensional modeling are separate

Ritesh:

HDFS is a file system that doesn't impose a schema, dimensional or otherwise, on the data aggregated and analyzed there. If you deploy HDFS behind HBase, or behind an RDBMS, columnar database, or dimensional/OLAP cube, you'll still need to model the database schemas in those other databases. If you deploy HDFS behind an associative DBMS of the sort that Boris Evelson examined in this report (http://www.forrester.com/rb/Research/dawning_of_age_of_bi_dbms/q/id/5885...), you can dispense with database modeling, dimensional or otherwise, altogether.

Jim

Very Interesting information shared

This is my first opportunity to visit this website. I found some interesting things and I will apply to the development of my blog. Thanks for sharing useful information.

Mcford University | Mcford University | Mcford University

"KOBIELUS RESPONSE: Yes. For

"KOBIELUS RESPONSE: Yes. For example, SQL Azure is a cloud-based RDBMS store. So is the new Google Cloud SQL, which is a scalable, hosted MySQL database environment. So, obviously, is Oracle 11g, which can be hosted in Amazon clouds. Almost every database architecture, traditional and new, has taken up residence in clouds (public and private), and these architectures will compete for particular niche roles in the virtualized, heterogeneous, multi-database cloud world."

First let me thank you for such a wonderful article.

I have a question on one of your answers to Santhana. The relational databases you mentioned are traditionally OLTP databases which are not really created to exploit a parallel architecture like Hadoop. But when it comes to DWH databases like Teradata, Netezza etc, their architecture enables them to directly exploit their underlying hardware/software parallelism. Their parsing engines know how to route the reads and writes from/ into the database. So what kind of advantage these OLTP(?) databases offer when they are mounted on a cloud? Probably the new breed IMBD databases can perform very well in the cloud?

Another basic question is, how the data transfers from/out of the cloud are handled, when it comes to DWH on cloud ? Woundn't it cause additional data transfer overheads?