September 19-20, 2022 NYC

Migrating Cassandra to Kubernetes: Features and Solutions

We encounter the Apache Cassandra database and the need to operate it as part of a Kubernetes-based infrastructure on a regular basis. In this piece, we will share our view of the necessary steps, criteria and existing solutions (including a review of operators) for migrating Cassandra to K8s.

So what is Cassandra? It’s a distributed storage system designed to manage large amounts of data while providing high availability without a single point of failure. The project hardly needs a long introduction, so I will only give you the main features of Cassandra that will be relevant in the context of a specific article:

  • Cassandra is written in Java.
  • The topology of Cassandra includes several levels:
  • Node – a single deployed Cassandra instance;
  • Rack – a group of Cassandra instances, united by some feature, located in one data center;
  • Datacenter – a group of Cassandra instances located in one datacenter;
  • Cluster – the set of all data centers.
  • Cassandra uses an IP address to identify a node.
  • Cassandra stores part of the data in RAM for fast write and read operations.

Now to the actual potential migration to Kubernetes.

Check-list for migration

Speaking of migrating Cassandra to Kubernetes, we hope to make it more manageable with the move. What will it take to do this, what will help?

1. Storage for data.

As already clarified, Cassanda stores some of its data in RAM – in Memtable. But there is another part of the data, which is stored on disk – as SSTable. To this data we add Commit Log entity, the records of all transactions, which are also saved to disk.

In Kubernetes, we can use PersistentVolume to store data. Thanks to well-designed mechanisms, working with data in Kubernetes is getting easier every year.

We will assign each pod with Cassandra its own PersistentVolume

It’s important to note that Cassandra itself implies data replication, offering built-in mechanisms for this. Therefore, if you are building a Cassandra cluster with a large number of nodes, there is no need to use distributed systems like Ceph or GlusterFS for data storage. In that case, it would make sense to store data on a node disk using local persistent disks or hostPath mounts.

Another issue is if you want to create a separate developer environment for each feature branch. In that case, the right approach would be to raise a single Cassandra node and store data in a distributed repository, i.e. the mentioned Ceph and GlusterFS would be your option. Then the developer will be confident that he will not lose test data even if one of the Kuberntes cluster nodes is lost.

2. Monitoring

A practically non-alternative choice for implementing monitoring in Kubernetes is Prometheus (we talked about this in detail in a related report). How is Cassandra doing with metrics exporters for Prometheus? And, more importantly, with the Grafana dashboards that fit them?

We chose the former because:

  • JMX Exporter is growing and developing, while Cassandra Exporter failed to get proper community support. Cassandra Exporter still doesn’t support most versions of Cassandra.
  • You can run it as a javaagent by adding the -javaagent flag.
  • There is an adequate dashboad for it, which is incompatible with Cassandra Exporter.

Leave a Reply