Saturday, May 21, 2011

OLAP Over Hadoop

In the last few years Hadoop has really come forward as a massively scalable distributed computing platform. Most of us are aware that it uses Map Reduce Jobs to perform computation over Big Data which is mostly unstructured. Of course such a platform cannot be compared with a relational database storing structured data with defined schema. While Hadoop allows you to perform Deep analytics with complex computations, when it comes to performing multidimensional analytics over data Hadoop seems lagging. You might argue that Hadoop was not even built for such uses. But when the users start putting their historical data in Hadoop they also start expecting multidimensional analytics over it in real time. Here “real time” is really important.

Some of you might think that you can define OLAP friendly Warehousing Star Schema using Hive for your data in Hadoop and use a ROLAP tool. But there comes the catch. Even on the partially aggregated data the ROLAP queries will be too slow to make it real time OLAP. As Hive structures the data at read time, the fixed initial time taken for each Hive query makes Hadoop really unusable for real time multidimensional analytics.

The only options left to you are either you aggregate the data in Hadoop and bring the partially aggregated data in an RDBMS. Thus you can use any standard OLAP tool to connect to your RDBMS and perform Multidimensional analytics using ROLAP or MOLAP. While ROLAP will directly fire the queries against the Database, MOLAP will further summarize and aggregate the multidimensional data in the form of cuboids for a cube.

The other option is you use a MOLAP tool that can compute the aggregates for the data in Hadoop and get the computed cube locally. This will allow you to do a really real time OLAP. Moreover if the aggregates can be performed in Hadoop itself that will really make cube computations scalabale and fast.

There can be a big fight over the point that Hadoop is not a DBMS but when Hadoop reaches to users and organizations who look to use it just because it is a buzzword, they expect almost anything out of it that a DBMS can do. You should see such solutions growing in the near future.