What is Cassandra and How Does It Help with Big Data?

With a growing business demand to consume and analyze Petabytes of data with massive transaction rates, the need for a robust solution to efficiently manage this data is paramount.
Apache Cassandra can be an excellent fit when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Its feature set includes:
• Big data scalability – NoSQL database architected for big data
• Peer-to-peer – all nodes are the same
• Elastic scalability – easy online scale-out with automatic data balancing/sharding
• No single point of failure – continuously available
• Location independence – read and write data from and to any location
• Dynamic schema – flexible than relational database management systems
• Tunable data consistency – can be decided on a per-operation basis
• Data compression – saves storage and increases performance
• Cloud-ready – helps realize the benefits of cloud computing
• SQL-like language (CQL) – easy to use
• Memory efficient – removes needs for memory caching software (e.g. memcached)

Cassandra’s Data Model

Cassandra provides a structured key-value store with tunable consistency. Keys map to multiple values, which are grouped into column families between a column-oriented DBMS and a row-oriented Store. This makes Cassandra a hybrid data management system. The column families are fixed when a Cassandra database is created, but columns can be added to a column family at any time. Also columns can be added only to specified keys, so different keys can have different numbers of columns in any given family. Columns are constructs that have a name, a value and a user-defined timestamp associated with them. The number of columns that can be contained in a column family is very large. Columns could be of variable number per key. An instance of Cassandra has keyspace which is made up of one or more column families. The values from a column family for each key are stored together. Each column family can contain one of two structures: supercolumns or columns. “Super columns” are a construct that have a name, and an infinite number of columns associated with them. The number of “Super columns” associated with any column family could be infinite and of a variable number per key. They exhibit the same characteristics as columns.

How Cassandra reads and writes data onto its nodes

• Cassandra distributes data among its nodes transparently to the users. Any node can accept any request (read, write, or delete) and route it to the correct node even if the data is not stored in that node.
• Cassandra handles replica creation and management transparently.
• Tunable consistency: When storing and reading data, users can choose the expected consistency level per each operation.
• Cassandra provides very fast writes, and they are actually faster than reads where it can transfer data about 80-360MB/sec per node. It achieves these using two techniques.
o Cassandra keeps most of the data within memory at the responsible node, and any updates are done in the memory and written to the persistent storage.
o Unless writes have requested full consistency, Cassandra writes data to enough nodes without resolving any data inconsistencies where it resolves inconsistencies only at the first read.

Advantages Of Cassandra

• Minimal Administration.
• No Single Point of Failure.
• Scales Horizontally.
• Write are durable.
• Consistency is tuneable as needed on reads and writes.
• Schema is flexible can be updated live.
• Handles failures gracefully.
• Replication is easy and is rackaware.

Deepthi Achanta is a Hadoop-Big Data developer with Bodhtree.  Bodhtree specializes inBI and Big Data solutions, providing analytics consulting, implementation, data cleansing, and maintenance services.

One thought on “What is Cassandra and How Does It Help with Big Data?

  1. Hello,

    I stumbled upon Bodhtree’s blog today. This blog is enriched with lot of diversified quality content. I was wondering if I can re-publish some of the posts on my personal blog without making any changes to it. (neither to contents nor to ownership)

    By way of introduction, I am an IT consultant working in bay area on various enterprise suite applications. I also run by own blog (http://paragmone.wordpress.com) to share information and knowledge about various enterprise applications, tools and technologies.

    I do have published such posts for other companies like DWG, AppsZen in the past on my blog. My blog has established a decent reader / subscriber base spread across the world with whom you will be able to share your contents / offerings.

    Parag Mone

Leave a Reply

Your email address will not be published. Required fields are marked *