In this blog you can understand about the checklist of Cassandra, which includes understanding your queries, type & size of data, performance, operating costs and disaster recovery. Also learn the the do’s and don’ts of Cassandra.
Know your queries beforehand
In relational DBMS based upon the use case you design your data model, create tables. Based upon the complexity of analytic queries you may even create views. Based on these tables or views, your queries can evolve over time.
In Cassandra, you need to know your queries beforehand and accordingly build the database. There is no fixed database structure which you need to adhere to. Changes to the storage model in Cassandra may lead to data redundancy which Cassandra handles properly.
Size of Data
Consider if your existing DBMS is able to fulfill your data needs and if not, whether Cassandra entirely justifies your new way of visioning data.
PostgreSQL and MySQL can easily handle tables hundreds of gigabytes in size. A normal data node is capable of handling 1TB of data but this does not depend on the size of the data. If a node has more than 1TB of data with lesser RAM size but no random reads, it will do fine, on the other hand for lesser data size if the node has high rate of operations, the latency would increase.
Type of Data to store
Cassandra has an advantage over other relational systems when it comes to data types. It is very good at handling collection data types.
Setting a value of a tuple in Cassandra is an O(1) operation due to the fact that Cassandra writes into a Commitlog and a Memtable and returns immediately. On the other hand in SQL an insert into such a table is an O(log(n)) operation where n is a multiple of SetID and ElementName.
Also it naturally accommodates data in sequence, since while writing, data is sorted and written sequentially to disk, and while reading it is first read by key and then range, which gives high readability.
Cassandra extends the concept of eventual consistency by offering tunable consistency. For any given read or write operation, the client application decides how consistent the requested data must be. Consistency levels in Cassandra can be configured to manage availability versus data accuracy.
Cassandra Writes data first to commit log and Memtable and returns immediately due to which there is no disk IO which results in a very high write throughput. On compaction the data is flushed to disk. On the contrary, reads may suffer based on several factors like heavy workloads, disk IO etc.
In a heavy workload system if you need your data to be read consistently immediately after it is written you need an ACID capable system.
Cassandra comes with the presumed property of data redundancy. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network) when you are thinking of using Cassandra. In order to get the most efficient reads, you often need to duplicate data by increasing the replication factor. Hence Cassandra cluster is scaled by adding more nodes. But this may increase operational cost.
On-premise there is no difference between a multi-node Cassandra cluster vs an RDBMS with multiple read replicas in terms of operations and administrative costs. But in cloud, there are a lot management services for RDBMS like Amazon RDS which makes adding & modifying servers, RAM, cores easier and its maintenance too. Until now Cassandra has only one such tool ‘OpsCentre’ which can do basic administrative tasks and its enterprise functionality is only provided in the licensed version which is quite expensive but required for monitoring production clusters.
Disaster Recovery requirement
Cassandra holds a peer to peer architecture and since there is no master-slave architecture there is no single point of failure. Hence even if some of the nodes are down, based on replication factor, the data will always be available. This is very difficult to achieve with the traditional relational system.
Do’s and Don’ts with Cassandra
- Model the database so that data is distributed uniformly around the cluster.
Rows are spread around the cluster based on a hash of the partition key , which is the first element of the PRIMARY KEY . So, the key to spreading data evenly is this: pick a good primary key. For e.g.:
CREATE TABLE person (person_id int primary key, fname text, lname text, dateofbirth timestamp, email text, phone text );
Here, person_id is selected as PK because it uniquely identifies a row whereas if we would choose fname as PK there are chances of the getting same fname for more than one row. Similar is the case with other columns. Here person_id is also the partition key which decides in which partition the row will reside. Hence you should choose your PK wisely.
- Create a table the query on which gets result from only one partition.
In practice, this generally means you will use roughly one table per query pattern. If you need to support multiple query patterns, you usually need more than one table.
For e.g: In a user database, if we want to get the full details for the user with either lookup method, it’s best to use two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
CREATE TABLE users_by_email (
email text PRIMARY KEY,
- SSD’s although being a little expensive are preferable over spinning rust as it provides extremely low-latency response times for random reads while supplying ample sequential write performance for compaction operations. Commit log writes are append only (sequential I/O) in Cassandra and is used for data recovery. On the other hand data directory is usually random reads and hence these IO patterns may conflict due to which Datastax recommends to keep them separate.Keep your data directory and commit log directory on different physical machines or atleast on different partitions.
Have 16-64GB RAM size for a production environment with minimum being 8GB whereas development environment should have atleast 4GB of RAM.
Larger the RAM size, lesser the number of page loads, more cache size, reduced disk I/O and fewer SStables flush which all result in higher read throughput.
- Run a complete repair of your token range around your ring within gc_grace_seconds for better read consistency.
- To add multiple nodes to C* cluster simultaneously without increasing the risk of consistency issues, before you add the nodes, turn off consistent range movements by running C* with the -Dconsistent.rangemovement=false property. Once the nodes are added, set it back to true.
- Use leveled compaction strategy when read performance is of primary importance and you are using SSD’s.
Assign counter type to a column that serves as the primary key.
Primary key should be created on low cardinality columns and counter columns have the highest cardinality, hence counter type columns should not be considered for primary key or creating secondary index.
Exceed column value by 2GB.
If any column value exceeds 2GB the blob file is split into multiple columns with a maximum of 2 billion columns per row. Hence there can be more than one row per partition. Hence while reading it would hit a large number of partitions which will affect read performance.
Keep compression activated if you have a less CPU size and RAM size cannot be increased further.
Deactivating compression can deprive you of read & write performance, hence as far as possible try to increase the RAM size to meet your performance requirements.
Use Virtual nodes in an DSE Hadoop DC, use single-token-per-node
DataStax Enterprise turns off virtual nodes (vnodes) by default because using vnodes causes a sharp increase in the Hadoop task scheduling latency. This increase is due to the number of Hadoop splits, which cannot be lower than the number of vnodes in the analytics datacenter.
You can use vnodes for any Cassandra-only cluster, a Cassandra-only datacenter, a Spark datacenter, or a Search-only datacenter in a mixed Hadoop/Search/Cassandra deployment.
Create secondary index on high cardinality columns (For e.g.: fields with more than 100 states)
Creating secondary index on a high-cardinality column, which has many distinct values, a query between the fields will lead many seeks for very few results. Instead, it would be good to maintain it in a separate table using Cassandra built-in index.
Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense. Each value in the index becomes a single row in the index, resulting in a huge row for all the false values.
Use row cache unless your use case is high read intensive.
The row cache is similar to a traditional cache like Memcached. When a row is accessed, the entire row is pulled into memory, merging from multiple SSTables stored on multiple nodes if necessary, and cached, so that further reads against that row can be satisfied without any more disk IO.
Do not use the row cache unless you are very sure you will be needing to read the complete row simultaneously and frequently. Making inappropriate use of row cache may lead to Cassandra failure.
Use-Cases best suitable for Cassandra
- Product Catalog/Playlists
- Fraud Detection
- IOT/Sensor data
Use-Cases unsuitable for Cassandra
- Reporting cases