Big Data Analytics

Sunday, December 4, 2011

Server Virtualization

In my last blog I talked about Cloud Computing. Those who read that will know that a Cloud provides computing resources on the fly. These resources can be software resources like a CRM Solution, an Email Solution or a BI Solution or they can be hardware resources like storage. They can also be complete machines/servers with everything setup. This is what Amazon EC2 does. What this means to you as a user is now you can get a server machine of your choice on demand. So you can say that you want a machine with 2 processors, 360G storage space and 4G RAM etc. You can also say that you want Linux with MYSQL or WIndows with Oracle and Tomcat installed.

Now obviously it is not possible that when you put a request for an all set machine somebody will be available to react to your request and on getting the request he/she will set up your machine, install the required software and then provision the machine. This process needs to be on the fly and automated. So what can be the way?

One way is to keep a disk image ready which includes the standard Operating System and some standard software. You remember that Nortorn Ghost Image. Ghost is a disk cloning S/W which allows you create an Image of a disk which you can restore on another blank machine. This type of provisioning is called as Bare Metal Provisioning. Now when you ask for a machine this image can be loaded into a physical machine and that’s it.

But would this approach work with Cloud Computing? I would say NO. Why? Many reasons.
First of all this is slow. Imagine every time a ghost image getting retrieved. Moreover what do we do when the user says now I don’t need the machine anymore? Do we format the disk?
Above that, if we start giving physical machines to a user on the fly just imagine how many physical machines it would need. Too many, resulting into too much of power consumption; generating too much heat and needing too much physical space.

Also we don’t know what the Users’ Software might do to the machine. The system might crash; one erroneous program might halt the machine as there is no controlling entity.
Above all, this is not the best utilization of the resources also. Most of the time a powerful machine given to a user might be sitting idle or may be its full power won’t be in full use. If there could have been a way to share a machine among many users, it would have been a better utilization of resources. So what is the solution?

Why not have multiple machines on a single machine and allocate these smaller machines to the users. This is called server virtualization or just Virtualization.

Virtualization lets you run multiple machines or multiple virtual machines on single physical machine, with each virtual machine sharing the resources of that one physical machine. Different virtual machines can run different operating systems and applications on the same physical machine. What is a virtual machine?

A virtual machine is a software container that can run its own operating system on a physical machine. A virtual machine, just like a physical machine, has its own virtual Processor, Storage, Memory and NIC. Typically these virtual resources are implemented over actual physical resource with a software layer on top of it. So for example a virtual processor might map to a core of a physical processor. Virtual storage or memory might map to a part of physical storage or memory. Normally, Physical Machine is called as Host and Virtual Machine is called as Guest. Neither the software running on the virtual machine nor the users using it can really differentiate it from a physical machine. These Virtual machines are physically files; meta and data files, on the disk. You can collectively call those files as images and when we say that we are launching a virtual machine what we are doing is we are launching a virtual Instance from a physical image.

Now here you need to understand an important concept or benefit of Virtualization. What it is allowing you to do is separating OS from the Hardware. So Virtualization is creating a disconnect between the OS or APPs from the physical hardware and now your virtual machine sits on top of the physical hardware. So without Virtualization if you had to switch your server on a better Hardware machine it could take almost 24 hours. You take back up of your data then set up a new machine. Install same OS and other applications and then deploy your backed up data. With Virtualization it is as simple as copying the image and launching an instance out it on desired Hardware.This happens through Virtualization Software.

Let us try to understand Virtualization Software. There are two types of Virtualization Software :

1. Client based
Here you install an OS over your Hardware and over the OS you install a Client. Now you can deploy or install multiple virtual machines/OSs over that Client.

Examples:
Oracle/Sun Virtual Box
VMware Fusion for Mac

2. Hypervisor based

Typically in this case you have two components:

Hypervisor or Virtual Machine Monitor:

This is just like an OS that you install directly over Hardware. So you don’t install any OS but you put directly this component over hardware. In VM Ware world this hypervisor is called as ESXi. Hypervisor doesn’t do much alone.

Management Software:

For creating VMs you use another piece of software called as management software which in VMware world is vSphere. You install vSphere on a machine which is connected to Server/Hypervisor through Network and now you administer your server through machine. Now through vSPhere you can create VMs. You specify the Harddisk, RAM per your requirements and then install the File System and OS on it.Through management software you can also transfer this image to any Hardware which has this hypervisor installed. VMWare even allows feature like VMotion which enables the live migration of running virtual machines from one physical server to another with zero downtime.

Now a days you also get P2V tools allowing Physical to Virtual migration. So they can literally create a virtual machine image for a physical machine.
Examples:
Vmware Convertor

Acknowledgements:

http://www.youtube.com/watch?v=QYzJl0Zrc4M&feature=related

by Eli The Computer Guy

Sunday, October 9, 2011

Cloud Computing

Today, the condition of the economy is such that, almost every organization wants to cut down the cost and increase the profit. For Non IT Organizations like banks and insurance companies one area to focus to cut down the cost is IT. Of course they understand the significant of IT in automating their business functions. But that’s not their primary business and they would definitely prefer to focus on their core business goals rather than on IT or Software Development.

Furthermore they are also realizing that maintaining an in house IT infrastructure along with the Software is an expensive affaire. Hardware has its own cost which is repetitive. As soon as en employee joins in, you have to get a new machine. Software, be it custom or third party, add to the cost. Maintaining them is also a headache. And then it’s not only about cost. It’s about time also. Every time an employee joins in, you have to look for a new machine with the S/W installation.

What you would want is making these services available to your employees quickly and cost effectively. Solution?

Look to the cloud...

Instead of getting a powerful machine and installing a suite of software on it, you'd buy a basic machine with a simple application on it. That application would allow employees to log into a Web-based service which hosts all the resources needed by the employees.

The service provides all the software needed by the employee.
This is called as cloud computing and it is gaining more and more popularity because of the lower cost and quick provisioning.

Let’s try to understand the basic concept of Cloud and Cloud Computing further. So you know what, I have been doing a lot of googling trying to find the real definition of cloud after an year of study I derived the best definition of the cloud which I would like to share with you.

“Cloud is the cloudiest term which is used by anybody in any way.”

Almost everybody on the earth seems to have a different understanding of the cloud. So then finally I thought of understanding the meaning of all the definitions and deriving my own. I will put that in a moment. At the moment let’s understand some basic concepts.

Let’s begin with a cluster. You know a cluster which is normally a small group of computers connected with LAN. You deploy your web applications across this cluster typically to get performance and availability as there is no single point of failure. So this is what generally you yourself form at your premises.

Then there comes a grid which in simple terms is a bigger cluster. It involves many interconnected computers which are loosely coupled and geographically dispersed. Here the network connecting them can be Internet also.Typically you use a grid for complex jobs processing like Satellite signals. If we talk about web hosting then maybe you deploy a portal for thousands of users on a grid. If you have big amount of money you can form a grid for your need else what is also possible is that there is a grid available for public and typically you rent it or a part of it.

When computing resources like grid or grid with some required pre-installed software are made available to public like a metered utility, it is called as utility computing.

Now some of you might be getting furious at me thinking what is cloud computing then. So cloud computing is a little more. It use Grid as infrastructure and provides computing over it as a utility also. What differentiates it from Grid Commuting and Utility computing is On-Demand, Automated Provisioning of the Computing Resources over a Network as an Abstract Service. So I define Cloud Computing as:

“On-Demand, Automated Provisioning of the Computing Resources over a Network as an Abstract Service.”

Let’s try to understand this definition word by word.

By computing resources I mean Hardware and software both.

By On-Demand, Automated Provisioning over a Network as an Abstract Service I mean you ask for these resources and you get them in real time and there is an automated system which makes this provisioning possible. Moreover all the resources are provisioned as a web based service over a network, typically Internet and the service is very abstract which means you never know how this service has been provided. You don’t know what’s happening inside the infrastructure to make it possible. The internal details are always hidden.
So for ex when you get any resource you don’t need to worry about its maintenance or management. You also don’t know the physical location of the resources.

“The Grid used for cloud computing is called as cloud.”

Clouds typically follow pay-per-use model or pay-as-you-go model which means you pay for all what you use. This is interesting. As I mentioned Cloud Computing is provisioning resources On-Demand so when you want a resource, just grab it, use it and then release it. Pay for the usage. "So you use you pay, if you don’t use you don’t pay opposed to grids where even if you don’t use you pay.

Clouds are considered to be primarily of two types:

Public clouds
These are available for all, organizations s and individual users.

Private clouds
These are used within the organizations and the organization’s IT people manage them.

Hybrid Clouds & Cloud bursting
Cloud Bursting is an application deployment model in which an application uses both, public and private cloud. Normally the application runs in a private cloud or data centre but when the demand for computing capacity spikes, it bursts into the public cloud. The cloud formed by the combination of the private and a public cloud is called as hybrid cloud.The advantage of such a hybrid cloud deployment is that an organization only pays for extra compute resources when they are needed.

Sunday, July 3, 2011

Big Data stretching the scope of BI

Sometimes I wonder looking at how “Business Intelligence” is moving today. Experts in the field are trying their best to stretch the scope as much as possible for this domain. When it comes to storing the data at the back end we are trying to move as backward as possible. While when it comes to display the information we are moving as forward as possible.

Remember those days when you used to store the data in the flat files possibly at your local system. Then you faced the problem of data management and size. You moved further backwards and you started to store the Data in the RDBMS setup at remote machines and distributed across multiple nodes. Now you see even bigger data. You find it further difficult to manage at your own premises. Guess what, now you decide to go further backwards, possibly out of your premises. You start looking for Big Data Storage options located remotely either in the form of Data Centres or as Clouds. Clouds sound as attractive option as we can find many cheaper options which can provide gigantic storage spaces with already setup big data processing frameworks; Amazon Elastic Map Reduce being a good example. Even if we want to use any other commercial solution for big data processing, setting it up on the cloud should not be a big problem. Though, the Safety and Security challenges associated with Clouds still remain. We can still argue and discuss for hours if it is a good strategic decision to move to clouds just like people do today in the enterprise. Leaving behind all these arguments and challenges Clouds are gaining more and more popularity day by day. So don’t be surprised if your kid modifies his or her understanding of the Cloud. Gone are the days when Clouds were found only in the sky.

While on one side we see data moving further backwards, on the other side we can see the information moving further forward. Earlier you used to get the information in the form of reports on the paper. Somebody used to prepare the reports for you, get them printed on the paper and bring those reports to you finally. Then you started getting reports on your computer screens by connecting your Thick Desktop based viewer on you terminal to the Reporting Solution. Then you got Adhoc Analytics over browsers that allowed you to play with your data that too from any location over the web. Now you want the real time interactive Adhoc Analytics over the handheld devices; mobile and tablets. It’s amazing to see the BI solutions today in the markets allowing you to do real time Adhoc Analytics over your big data stored in some cloud on your iPad. It feels great to see the important yet horrible big data appearing in the form of really pretty charts, widgets and dashboards that too on devices like iPads. So now you don’t need to be worried. Just go wherever you want to go still you are not far from making the important strategic decisions instantly.

Saturday, May 21, 2011

OLAP Over Hadoop

In the last few years Hadoop has really come forward as a massively scalable distributed computing platform. Most of us are aware that it uses Map Reduce Jobs to perform computation over Big Data which is mostly unstructured. Of course such a platform cannot be compared with a relational database storing structured data with defined schema. While Hadoop allows you to perform Deep analytics with complex computations, when it comes to performing multidimensional analytics over data Hadoop seems lagging. You might argue that Hadoop was not even built for such uses. But when the users start putting their historical data in Hadoop they also start expecting multidimensional analytics over it in real time. Here “real time” is really important.

Some of you might think that you can define OLAP friendly Warehousing Star Schema using Hive for your data in Hadoop and use a ROLAP tool. But there comes the catch. Even on the partially aggregated data the ROLAP queries will be too slow to make it real time OLAP. As Hive structures the data at read time, the fixed initial time taken for each Hive query makes Hadoop really unusable for real time multidimensional analytics.

The only options left to you are either you aggregate the data in Hadoop and bring the partially aggregated data in an RDBMS. Thus you can use any standard OLAP tool to connect to your RDBMS and perform Multidimensional analytics using ROLAP or MOLAP. While ROLAP will directly fire the queries against the Database, MOLAP will further summarize and aggregate the multidimensional data in the form of cuboids for a cube.

The other option is you use a MOLAP tool that can compute the aggregates for the data in Hadoop and get the computed cube locally. This will allow you to do a really real time OLAP. Moreover if the aggregates can be performed in Hadoop itself that will really make cube computations scalabale and fast.

There can be a big fight over the point that Hadoop is not a DBMS but when Hadoop reaches to users and organizations who look to use it just because it is a buzzword, they expect almost anything out of it that a DBMS can do. You should see such solutions growing in the near future.

Sunday, October 17, 2010

Business Insights from Big Data

Data Data Everywhere, but the right analytics needs to be there.

Big Data - The Data Deluge

In today’s world, almost every enterprise is seeing an explosion of data. They are getting huge amount of digital data generated daily. Almost every growing organization wants to automate most of its business processes and is using IT to support every conceivable Business function. This is resulting into huge amount of data being generated in the form of transactions and interactions. Web has become an important interface for interactions with suppliers and customers generating the huge amount of data in the form of emails etc. Besides this, there is a huge amount of data emitted automatically in the form of logs like network logs and web server logs.

Various Telecom Service Providers get huge amount of data in the form of conversations and Call Data Records. Various Social N/W Sites have started getting TBs of data every day in the form of tweets, blogs, comments, photos and videos etc. Facebook generates 4TBs of compressed data every day. Web Companies like these get huge amount of click stream data generated daily as well. Hospitals have data about the patients, their diseases and the data generated by various medical devices as well. Sensors used in various machines used for production keep generating so much of event data in seconds. Almost every sector like transport, finance is seeing a tsunami of Data.

Such huge amount of data needs to be stored for various reasons. Sometimes any compliance demands more historical data to be stored. Some times organizations want to store, process and analyse this data for intelligent decision making to get the competitive advantage.For example analyzing CDR data can help a service provider know their quality of service and then make the necessary improvements. A Credit Card company can analyze the customer transactions for fraud detection. Server logs can be analyzed for fault detection. Web logs can help understand the user navigation patterns. Customer emails can help understand the customer behavior, interests and some time the problems with the products as well.

Now the important question that arises at this point of time is how do we store and process such huge amount of data most of which is Semi structured or Unstructured.

Big Data Storage & Processing

Let’s see the purpose-built storage options that allow you to store and process big data in a scalable, fault tolerant and efficient manner. You know what, this has been the most innovative sector of the business intelligence industry among the database vendors, both new and old, that have shipped a number of new products in the last few years for big data storage and processing. A lot of progress has also been made at open source platforms. Here is a high-level categorization of these products.

The first category includes massively parallel processing or MPP Data warehouses that are designed to store huge amount of structured data across a cluster of servers and perform parallel computations over it. Most of these solutions follow shared nothing architecture which means that every node will have a dedicated disk, memory and processor. All the nodes are connected via high speed networks. As they are designed to hold structured data so generally you would use an ETL tool to extract the structure from the data and populate these data sources with the structured data.

These MPP Data Warehouses include:

MPP Databases — these are generally the distributed systems designed to run on a cluster of commodity servers.

Examples: Aster nCluster, Greenplum, DATAllegro, IBM DB2, Kognitio WX2, Teradata etc

Appliances — a purpose-built machine with preconfigured MPP hardware and software designed for analytical processing.

Examples: Oracle Optimized Warehouse, Teradata machines, Netezza Performance Server and Sun’s Data Warehousing Appliance

Columnar Databases — they store data in columns instead of rows, allowing greater compression and faster query performance.

Examples: Sybase IQ, Vertica, InfoBright Data Warehouse, ParAccel

Most of them provide SQLs and UDFs to process the data.

Another category includes distributed file systems like Hadoop that allow us to store huge unstructured data and perform Map Reduce computations on it over a clusters built of commodity hardware.

Real Time Deep Analytics from Unstructured data

One of the biggest challenges while dealing with Big Data Analytics is Unstructured Data. As we saw earlier, most of the big data generated is semi structured or unstructured. Structured data is inherently relational and record oriented with a defined schema which makes it easy to query and analyze. However to analyze unstructured data first you need to extract structure from it.

Now the problem is that the process of structuring the data can itself be very complex considering the huge amount of data. Sometimes the computations required to structure the data are complex say Entity extraction from Natural Language text. Sometimes the data generates at a faster pace than the ability of your ETL tool to structure it. Moreover, sometimes you don’t even know what should be the structure of the data. You know that the big unstructured data collected has got a lot of value but you don’t know where it is and so it becomes difficult to structure the data at the time of data collection and loading. Rather you want to delay the structuring of the data till you can actually understand the exact analytics needs.

Another challenge is to carry out complex computations over big data. Sometimes your analysis will include querying the data with simple summaries and statistics or multidimensional analytics over big data. However sometimes you actually want to perform complex computations to carry out deep analytics over big data. You might want your system to mine your data to extract knowledge out of it so that you are not only aware of what has happened or what is happening rather you are able to predict what will happen in the future.

Moreover you always want to keep the latency of the analytical queries as low as possible. You always want that the time required to process this huge amount of data is as low as possible. You want to reduce days into hours, hours into minutes and minutes into seconds. You almost want near real time analytics. So at one side your data is continuously increasing and at other side you want to reduce the processing time, two contradictory things as such.

Approaches for Big Data Analytics

In general for big data analytics, you will need a BI tool over one of the storage options that we discussed earlier. The BI tool will provides a visual interface to query the data and extract information and knowledge out of it so as to make intelligent decisions. Let us see the possible approaches one by one:

Direct Analytics over MPP DW

The first approach for big data analytics is using a BI tool directly over any of the MPP DW. Generally these DWs allow a BI tool to connect to them using a JDBC or ODBC interface with SQL as a mean to get the data for analytics. For any analytical request by the user the BI tool will send SQL queries to these DWs. These DWs will execute the queries in a parallel manner across the cluster and return the data to BI tool for further analytics.

Some of these DWs also allow you to write Map Reduce UDFs that can be used within SQLs to perform the procedural computations over big data in a parallel manner. This is also called as in database analytics, which means that the BI tool does not need to take the data out of the DW to perform complex computation over it rather the computations can be performed in the form of UDFs inside the database.

Important point to note here is that the data needs to be structured before a BI tool can do analytics over it. Either you can use an ETL tool to extract the structure it or you load the unstructured data in a column and use in-database computations in the form of MR functions to structure it.

Generally If the data is structured then this might prove to be a good approach as an MPP database enjoys all the performance enhancement techniques of relational world like indexing and aggregations, compression, materialized views, result caching. However the cost of such a solution is at a higher end which is something worth considering.

Indirect Analytics over Hadoop

Another interesting approach that might suit you is analytics over Hadoop Data but not directly rather indirectly by first processing, transforming and structuring it inside Hadoop and then exporting the structured data into RDBMS. The BI tool will work with the RDBMS to provide the analytics.

Generally one would go for such an approach when the generated data is huge and unstructured and computations required to derive the structure out of it are complex and time consuming and also it is possible to partially process and summarize it before doing the actual analytics. In such cases the huge amount of unstructured or semi structured data can be stored in the Hadoop system. The MR jobs will take care of structuring and summarizing it which can then be easily be put into any standard RDBMS over which a BI tool can work.

Please note that if the structured and summarized data is still too big to go in an RDBMS then this RDBMS can be replaced by an MPP DW as well. If an RDBMS is used here then with a moderate cost it can provide you real time analytics over your data.

Direct Analytics over Hadoop

The last approach is performing analytics directly over Hadoop. In this case all the queries that a BI tool wants to execute against the data will be executed as MR jobs over big unstructured data placed into Hadoop. The complication with this approach is that how a BI tool connects to Hadoop system as MR jobs are the only way to process the data in Hadoop. However in the Hadoop Ecosystem the components like Hive and Pig allow one to connect to Hadoop using high level interfaces. Hive allows you to define the structured Meta layer over Hadoop. Hive supports a SQL like query language called Hive-QL. It also implements the interface like JDBC that a BI tool can easily use to connect to it. Hive is also extensible enough to allow implementing custom UDFs to work on data and SerDe classes to structure the data at run time.

Such an approach will have low cost but it is supposed to be a high latency approach for analytics over big unstructured data as one would require transforming and extracting structure out of data at run time. However the good thing is that somebody does not need to worry about the data schema and modeling till he or she is clear about the analytics need.

Opposed to other approaches, here the data is structured at read time rather than write time. So if one has big un-structured data and batch analysis can suffice his or her needs then this is a good solution. One surely enjoys the scalability and fault tolerance of Hadoop that too with a cluster of commodity servers which not necessarily need to be homogeneous.

Which approach I go with?

The biggest question that that would come to everybody’s mind after reading this blog is which approach he or she should go with.

So as you can see if you want a highly scalable, fault tolerant and low cost solution that allows you to do complex analytics over unstructured data then you might opt to go with Direct Analytics over Hadoop.

If you are looking for an easy to use solution that allows you to do complex and near real time analytics over huge structured data with minimal IT efforts then you might opt to go with Direct Analytics over MPP DW.

Finally if you are looking for a solution that provides you the flexibility in terms of the structure of the data and allows you to do real time analytics over data then you might opt to go with Indirect analytics over Hadoop.

References

URLs:

http://www.asterdata.com/
http://www.cloudera.com/
http://www.impetus.com/
http://www.intellicus.com/

Slides:

http://www.scribd.com/doc/39239619/Business-Insights-From-Big-Data

http://www.slideshare.net/intellicus/business-insightsfrombigdata-5431012

Documents:

Beyond Reporting: Requirements for Large-Scale Analytics
By Wayne W. Eckerson Director, TDWI Research The Data Warehousing Institute

Author: Ankit Khandelwal - Manager, Intellicus Technologies