Posted by James Kobielus on June 7, 2011
Problems don’t care how you solve them. The only thing that matters is that you do indeed solve them, using any tools or approaches at your disposal.
When people speak of “Big Data,” they’re referring to problems that can best be addressed by amassing massive data sets and using advanced analytics to produce “Eureka!” moments. The issue of what approach — Hadoop cloud, enterprise data warehouse (EDW), or otherwise — gets us to those moments is secondary.
It’s no accident that Big Data mania has also stimulated a vogue in “data scientists.” Many of the core applications of Hadoop are scientific problems in linguistics, medicine, astronomy, genetics, psychology, physics, chemistry, mathematics, and artificial intelligence. In fact, Yahoo’s scientists not only had a predominant role in developing Hadoop but — as exploratory problem-solvers — they are active participants in Yahoo’s efforts to evolve Hadoop into an even more powerful scientific cloud platform.
The problems that are best suited to Hadoop and other Big Data platforms are scientific in nature. What they have in common is a need for analytical platforms and tools that can rapidly scale out to the petabyte level and support the following core features:
- Detailed, interactive, multivariate statistical analysis
- Aggregation, correlation, and analysis of historical and current data
- Modeling and simulation, what-if analysis, and forecasting of alternate future states
- Semantic mining of unstructured data, streaming information, and multimedia
“Scientific” doesn’t always mean theoretical. Essentially, any complex research, engineering, suppy chain, marketing, or other problem is suitable for Big Data. The common denominator is the need — in Hadoop, the EDW, or some other platform — to:
- Iterate predictive models more rapidly. If you can build and score models in shorter cycles with a steady stream of fresh observational data, you can refine models more rapidly and converge on the solutions to your toughest problems.
- Run models of increasing complexity. If you can execute models with ever more computing-intensive algorithms and more variables against larger, more complex, historically deeper data sets, you can tackle more challenging problems and improve the precision, accuracy, and effectiveness of your models.
- Deliver model-driven decisions to more business processes. If you can leverage the analytic platform’s greater horsepower against a higher volume of concurrent jobs, queries, and sessions, you can accelerate the results of model execution and spread decision support benefits across a wider range of users, applications, and business processes. This is the heart of Big Data’s potential to help make model-driven next best action pervasive in both customer-facing and back-office processes.
Hadoop is on a fast track to becoming the world’s pre-eminent scientific analytic platform. The open-source nature of Hadoop plays nicely into the collaborative nature of modern scientific endeavors, and the abundance of MapReduce models that have been written to Hadoop is the lifeblood of many disciplines. As it becomes ubiquitous, Hadoop will become the global “laboratory” from which springs a steady stream of big bad new insights that thrive on extremely scalable analytics.