I recently read about Cassandra concepts and internals to understand how it works and why it is suited for handling large volume of data. This is a very interesting and also complex subject and I have merely scratched the surface so far.
Cassandra is an open source scalable and highly available “NoSQL” distributed database management system from Apache. It is classified under the Column-Family NoSQL category. It was initially developed by Facebook and was later taken over by Apache. The core features of Cassandra have been extracted from Amazon’s Dynamo and Google’s Bigtable.
Its support for dynamic columns and distributed counters will resolve a major problem of being able to aggregate most statistics as they are, rather than aggregating them with map/reduce at the later stage.
Another beautiful thing about Cassandra is that it can keep maximum data in its cache (if given enough RAM).
Cassandra Data Model
The Cassandra data model consists of a keyspace (analogous to a database), column families (analogous to tables in the relational model), keys and columns. Here’s what the basic Cassandra table (also known as a column family) structure looks like:
Figure1Error! No text of specified style in document.-1 Structure of a super column family in Cassandra
Don’t think of a relational table
Instead, think of a nested, sorted map data structure.
The following relational model analogy is often used to introduce Cassandra to newcomers:
Figure 1Error! No text of specified style in document.-2 Relational vs. Cassandra Model
This analogy helps make the transition from the relational to non-relational world. But don’t use this analogy while designing Cassandra column families. Instead, think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted.
SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
A nested sorted map is a more accurate analogy than a relational table, and will help you make the right decisions about your Cassandra data model.
Figure 1-3: Cassandra Data Model
- A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans.
- The number of column keys is unbounded. In other words, you can have wide rows.
- A key can itself hold a value. In other words, you can have a valueless column.
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
It’s important to think carefully about your data and your technology choices, and sometimes it can be difficult to do that in a data vacuum. Cassandra, Hive, and Hadoop are considered as the right tools to resolve most of the data challenges.
Your mileage may vary, but feel free to ask us questions in the comments!