Saturday, May 21, 2011

OLAP Over Hadoop

In the last few years Hadoop has really come forward as a massively scalable distributed computing platform. Most of us are aware that it uses Map Reduce Jobs to perform computation over Big Data which is mostly unstructured. Of course such a platform cannot be compared with a relational database storing structured data with defined schema. While Hadoop allows you to perform Deep analytics with complex computations, when it comes to performing multidimensional analytics over data Hadoop seems lagging. You might argue that Hadoop was not even built for such uses. But when the users start putting their historical data in Hadoop they also start expecting multidimensional analytics over it in real time. Here “real time” is really important.

Some of you might think that you can define OLAP friendly Warehousing Star Schema using Hive for your data in Hadoop and use a ROLAP tool. But there comes the catch. Even on the partially aggregated data the ROLAP queries will be too slow to make it real time OLAP. As Hive structures the data at read time, the fixed initial time taken for each Hive query makes Hadoop really unusable for real time multidimensional analytics.

The only options left to you are either you aggregate the data in Hadoop and bring the partially aggregated data in an RDBMS. Thus you can use any standard OLAP tool to connect to your RDBMS and perform Multidimensional analytics using ROLAP or MOLAP. While ROLAP will directly fire the queries against the Database, MOLAP will further summarize and aggregate the multidimensional data in the form of cuboids for a cube.

The other option is you use a MOLAP tool that can compute the aggregates for the data in Hadoop and get the computed cube locally. This will allow you to do a really real time OLAP. Moreover if the aggregates can be performed in Hadoop itself that will really make cube computations scalabale and fast.

There can be a big fight over the point that Hadoop is not a DBMS but when Hadoop reaches to users and organizations who look to use it just because it is a buzzword, they expect almost anything out of it that a DBMS can do. You should see such solutions growing in the near future.


4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Reading this it looks like u advocate MOLAP for "real-time" quering over multi-dimensional Big Data. But I doubt that a full MOLAP approach will work, mainly because:
    1. As a full pre-computed CUBE will result into enormous amount of data that cant be stored locally on commodity hardware.
    2. If we opt to store it onto a distributed architecture it will beat the "real-time" purpose due to slow querying time for

    operations like dill-down and roll-up.

    I believe that a better "real-time" approach would be to store data in-memory with partial pre-computation.

    I came across an interesting case-study of Druid (Distributed In-Memory OLAP Data Store) that is based on the following principal:

    Partial Aggregates + In-Memory + Indexes => Fast Queries

    ReplyDelete
  3. Definitely. But most of the popular OLAP Engines today generate pretty small cubes. I doubt any one generates a full cube. Most of them generate a very compressed from of data with many indexes over it to create very fast query results.

    Of course practically, the raw data in Hadoop will get summarized to quite a level resulting into a very low size cube comparatively.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete