ABIS Infor - 2013-04

Tools for "Big Data"

Peter Vanroose (ABIS) - 10 April, 2013

Abstract

Have you been collecting massive volumes of "corporate" data which are now waiting to be analysed, and did our earlier article about NoSQL trigger your appetite? In that case you probably want to get started with software for Big Data which is easy to install (and freely available). Below is already a first step in that direction, before you really take the "big" step!

Scetching the context

Apache, a volunteer organisation promoting "Open Source" software, is more and more profiling itself as "umbrella" for several important Open Source projects, including several "Big Data" tools.

The most well-known, but no longer the only project of the "Apache Software Foundation" is the "Apache web server" (http server) software. More recently also OpenOffice became an Apache brand, after Sun (OpenOffice's original "host") merged into Oracle which then took a too commercial direction with this Open Source alternative for Microsoft Office.

The de facto standard "base layer" software for Big Data is Hadoop (1), an Apache product. Lots of Big Data software has been built on top of it, including commercial software packages; but there are also several good open source ones, all under the Apache umbrella, of which I want to briefly discuss the following ones: HBase, Pig, ZooKeeper, and Cassandra.

Hadoop

As a framework for data-intensive distributed applications, Hadoop is based on the so-called MapReduce model, where an algorithm is subdivided into lots of small pieces of work which can be executed in parallel. This will be happening on several computers (the nodes) of a cluster. For the purpose of storing and exchanging data within the cluster, HDFS is used: the Hadoop Distributed File System.

The framework transparently takes care of restarting jobs when one of the nodes fails. As a consequence, everything can safely run on so-called "commodity hardware": cheap (Intel) processors, possibly even (very) old computers. But typically thousands of those per cluster! So Hadoop will be the "gluing factor" of the cluster.

HDFS is actually also providing data redundancy: all data will be stored multiple times, on different nodes, and be transparently replicated, in order to avoid problems with data availability when a node drops out or crashes. Additionally, HDFS will try to make sure that data storage happens as close as possible to the application: same node, or at least same "rack". This model is especially suited for read-only applications; massive data modifications require much more inter-node data traffic, which then quickly become the bottleneck, certainly when using "commodity" network hardware.

Nice to know: the name "Hadoop" was chosen in 2005 by its designer Doug Cutting after the name of the toy elephant of his little son. This explains why the logo of Hadoop is indeed a toy elephant (1).

HBase

Hadoop (or actually: HDFS) is able to store massive quantities of data in "flat files" (which can of course be structured; often JSON). But at some point a "real" database could be useful. Relationally there are lots of possibilities, but in a distributed context like Hadoop with HDFS it makes more sense to have a database which also operates in a distributed way.

HBase (2) profiles itself as the Hadoop database: distributed, scalable, modular, versioned, and robust against failure of data nodes (since HBase builds on top of HDFS). HBase uses a so-called column storage. This is ideal for very large tables: billions of rows with possibly millions of columns. But HBase is not relational: it does not guarantee transactional "ACID"ity (transactions which would be atomic, consistent, isolated, and durable): rollbacks are not possible with HBase, each and every action is a transaction. But in the context of Big Data this is common, and moreover good for performance.

The API of HBase consists of several Java classes, which hence can be called from within a MapReduce job. HBase (and de facto also Hadoop) are essentially assuming the user to implement everything with the Java programming language.

Pig

Hadoop (and HBase) provide a programmatoric API, hence assume the user to access the data at a relatively low level: no "higher-order" 4th generation programming language is provided, style SQL, even not with HBase.

Pig (3) tries to bridge this gap (as do alternatives like Hive): it provides an SQL-like interface which the Pig compiler will translate into a parallelised implementation, suitable to be executed as MapReduce algorithm in Hadoop. Here is an example of a Pig Latin "program" (copy-pasted from (4)), equivalent to the following SQL-query:
SELECT word, COUNT(*) FROM wordlist GROUP BY word ORDER BY 2 DESC :

-- first read in the data from one or more HDFS files; then:
words = GROUP wordlist BY word;
word_count = FOREACH word_groups GENERATE COUNT(words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
-- next write out the word list (with counts) into an HDFS file ...

ZooKeeper

A distributed system like Hadoop with HDFS will store internally the "status" of the different nodes in the cluster: which part of the MapReduce algorithm is running on which node, where is which data, which nodes are not responding, ...

To be able to keep track of similar node information at a higher (application) level, and to keep track of configuration choices for the nodes, ZooKeeper (5) could be used. Remarkable about this configuration and synchronisation software is the fact that it stores its own configuration information in a distributed fashion (with redundancy)! So failing nodes need not be problematic for ZooKeeper.

As an extra bonus, the ZooKeeper interface is very user-friendly: applications use the (Java) API to delegate everything related to group management, availability, etc., thus keeping this logic outside of the actual programming logic.

Cassandra

As already explained, HBase provides a column-based "NoSQL" database on top of Hadoop. Cassandra (6) has a similar goal: providing a distributed NoSQL database. But as opposed to HBase it is a key-value store. Since this is a more natural data structure than a column store, Cassandra is more popular than HBase. Cassandra also provides a command line interface (CLI) with a syntax that resembles a bit that of Pig (or actually more that of XQuery).

Cassandra was originally developed by Facebook, but now also positioned itself under Apache's wings.

Conclusion

Ready to get started with Big Data? Then you most likely want to start using at least the tools mentioned. Still not feeling completely comfortable? In that case you should maybe have a look at the courses we offer in this context:
www.abis.be/html/enDWCalendar.html .

References:

http://hadoop.apache.org/
http://hbase.apache.org/
http://pig.apache.org/
http://en.wikipedia.org/wiki/Pig_(programming_tool)
http://zookeeper.apache.org/
http://cassandra.apache.org/