Big data showdown: Cassandra vs. HBase

Published April 2, 2014 |

Rick Grehan

In this brave new world of big data, a database technology called “Bigtable” would seem to be worth considering — particularly if that technology is the creation of engineers at Google, a company that should know a thing or two about managing large quantities of data. If you believe that, two Apache database projects — Cassandra and HBase — have you covered.

Bigtable was originally described in a 2006 Google research publication. Interestingly, that paper doesn’t describe Bigtable as a database, but as a “sparse, distributed, persistent multidimensional map” designed to store petabytes of data and run on commodity hardware. Rows are uniquely indexed, and Bigtable uses the row keys to partition data for distribution around the cluster. Columns can be defined within rows on the fly, making Bigtable for the most part schema-less.

Cassandra and HBase have borrowed much from the original Bigtable definition. In fact, whereas Cassandra descends from both Bigtable and Amazon’s Dynamo, HBase describes itself as an “open source Bigtable implementation.” As such, the two share many characteristics, but there are also important differences.

Born for big dataBoth Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.