Production Readiness Checklist¶

This checklist covers everything needed to prepare a Cassandra cluster for production workloads. Use it as a guide before going live and as an ongoing reference.

Quick Checklist¶

Critical (Must Have)¶

[ ] Minimum 3 nodes per datacenter
[ ] Replication factor ≥ 3
[ ] Authentication enabled
[ ] NetworkTopologyStrategy for all keyspaces
[ ] SSDs for data storage
[ ] Monitoring configured
[ ] Backup strategy implemented
[ ] Repair schedule configured

Important (Should Have)¶

[ ] TLS encryption enabled
[ ] Role-based access control
[ ] JVM tuned for workload
[ ] OS-level tuning applied
[ ] Alerting configured
[ ] Runbooks documented
[ ] Disaster recovery tested

Hardware Requirements¶

Minimum Production Specifications¶

Component	Minimum	Recommended	Notes
CPU	8 cores	16+ cores	More cores = more concurrent operations
RAM	16 GB	32-64 GB	Half for JVM heap, half for OS cache
Storage	500 GB SSD	1-4 TB NVMe	SSDs mandatory for production
Network	1 Gbps	10 Gbps	Low latency critical

Storage Guidelines¶

# Verify SSD performance
fio --name=randwrite --ioengine=libaio --iodepth=32 --rw=randwrite \
    --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 \
    --group_reporting --filename=/var/lib/cassandra/test

Expected minimums: - Random write IOPS: > 10,000 - Sequential write: > 200 MB/s - Latency (p99): < 1ms

Disk Layout Recommendation¶

/                       # OS (50-100 GB)
/var/lib/cassandra/     # Cassandra data
├── data/              # SSTables (largest)
├── commitlog/         # Commit log (fast SSD, separate if possible)
├── saved_caches/      # Caches (small)
└── hints/             # Hints (small)

Configuration Checklist¶

cassandra.yaml Critical Settings¶

# Cluster settings
cluster_name: 'ProductionCluster'  # Cannot change after data written
num_tokens: 16                      # 16 for new clusters

# Network
listen_address: <private-ip>        # Internal communication
rpc_address: <private-ip>           # Client connections
broadcast_rpc_address: <public-ip>  # Only if behind NAT

# Snitch (always use GossipingPropertyFileSnitch for production)
endpoint_snitch: GossipingPropertyFileSnitch

# Authentication (MUST enable for production)
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
role_manager: CassandraRoleManager

# Performance
concurrent_reads: 32                # 16 × number of drives
concurrent_writes: 32               # 8 × number of cores
concurrent_counter_writes: 32

# Compaction
compaction_throughput_mb_per_sec: 64   # Increase for faster compaction
concurrent_compactors: 2               # Default: min(cores, disk count)

# Commit log
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

# Timeouts
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
request_timeout_in_ms: 10000

# Hinted handoff
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000  # 3 hours

# Memory
memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 2048

JVM Settings¶

Edit jvm11-server.options (or jvm-server.options):

# Heap size: 50% of RAM, max 31GB (to use compressed pointers)
-Xms16G
-Xmx16G

# G1GC settings (recommended for Cassandra 4+)
-XX:+UseG1GC
-XX:MaxGCPauseMillis=500
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:InitiatingHeapOccupancyPercent=70

# GC logging
-Xlog:gc*:file=/var/log/cassandra/gc.log:time,uptime:filecount=10,filesize=10M

cassandra-rackdc.properties¶

dc=dc1
rack=rack1
# prefer_local=true  # Enable for multi-DC

Security Configuration¶

1. Enable Authentication¶

# cassandra.yaml
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
role_manager: CassandraRoleManager

After enabling, change default credentials:

-- Connect with default credentials (cassandra/cassandra)
cqlsh -u cassandra -p cassandra

-- Create admin user
CREATE ROLE admin WITH PASSWORD = 'strong_password_here'
    AND SUPERUSER = true
    AND LOGIN = true;

-- Create application user
CREATE ROLE app_user WITH PASSWORD = 'app_password_here'
    AND LOGIN = true;

-- Grant permissions
GRANT ALL PERMISSIONS ON KEYSPACE myapp TO app_user;

-- Disable default cassandra user (after verifying admin works!)
ALTER ROLE cassandra WITH PASSWORD = 'new_random_password' AND LOGIN = false;

2. Enable TLS Encryption¶

Generate certificates:

# Generate keystore for each node
keytool -genkeypair -alias node1 \
    -keyalg RSA -keysize 2048 \
    -dname "CN=node1.cassandra.local" \
    -validity 365 \
    -keystore /etc/cassandra/conf/.keystore \
    -storepass cassandra \
    -keypass cassandra

# Export certificate
keytool -export -alias node1 \
    -keystore /etc/cassandra/conf/.keystore \
    -file node1.cer \
    -storepass cassandra

# Import to truststore (repeat for each node's cert)
keytool -import -alias node1 \
    -file node1.cer \
    -keystore /etc/cassandra/conf/.truststore \
    -storepass cassandra -noprompt

Configure TLS in cassandra.yaml:

# Client-to-node encryption
client_encryption_options:
    enabled: true
    optional: false
    keystore: /etc/cassandra/conf/.keystore
    keystore_password: cassandra
    truststore: /etc/cassandra/conf/.truststore
    truststore_password: cassandra
    require_client_auth: false
    protocol: TLS
    cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA]

# Node-to-node encryption
server_encryption_options:
    internode_encryption: all
    keystore: /etc/cassandra/conf/.keystore
    keystore_password: cassandra
    truststore: /etc/cassandra/conf/.truststore
    truststore_password: cassandra
    require_client_auth: true

3. Network Security¶

Required firewall ports:

Port	Purpose	Access
7000	Inter-node	Cluster only
7001	Inter-node (TLS)	Cluster only
9042	CQL clients	Application servers
9142	CQL clients (TLS)	Application servers
7199	JMX	Monitoring only

# UFW example (Ubuntu)
sudo ufw allow from 10.0.0.0/8 to any port 7000
sudo ufw allow from 10.0.0.0/8 to any port 7001
sudo ufw allow from 10.0.0.0/8 to any port 9042
sudo ufw allow from 10.0.1.0/24 to any port 7199  # Monitoring subnet only

OS-Level Tuning¶

sysctl Settings¶

Create /etc/sysctl.d/99-cassandra.conf:

# Virtual memory
vm.max_map_count = 1048575
vm.swappiness = 1
vm.dirty_ratio = 80
vm.dirty_background_ratio = 5

# Network
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_slow_start_after_idle = 0

Apply:

sudo sysctl -p /etc/sysctl.d/99-cassandra.conf

Limits Configuration¶

Create /etc/security/limits.d/cassandra.conf:

cassandra soft memlock unlimited
cassandra hard memlock unlimited
cassandra soft nofile 1048576
cassandra hard nofile 1048576
cassandra soft nproc 32768
cassandra hard nproc 32768
cassandra soft as unlimited
cassandra hard as unlimited

Disable Swap¶

# Disable swap permanently
sudo swapoff -a
sudo sed -i '/swap/d' /etc/fstab

Disable Transparent Huge Pages¶

Create /etc/rc.local or systemd service:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Monitoring Setup¶

Key Metrics to Monitor¶

Metric	Warning	Critical	Notes
Heap usage	> 70%	> 85%	GC pressure
Write latency (p99)	> 10ms	> 100ms	Disk/compaction issues
Read latency (p99)	> 50ms	> 500ms	Disk/data model issues
Pending compactions	> 20	> 50	Compaction falling behind
Dropped messages	> 0	> 100	Timeout/overload
Disk usage	> 60%	> 80%	Plan for capacity

Enable JMX¶

# cassandra-env.sh
JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<node-ip>"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.port=7199"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.ssl=false"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=true"

Recommended Monitoring Tools¶

AxonOps - Purpose-built for Cassandra

Backup Strategy¶

Snapshot Backups¶

# Take snapshot of all keyspaces
nodetool snapshot -t backup_$(date +%Y%m%d)

# Take snapshot of specific keyspace
nodetool snapshot -t daily_backup my_keyspace

# List snapshots
nodetool listsnapshots

# Clear old snapshots
nodetool clearsnapshot -t old_backup_name

Backup Schedule¶

Type	Frequency	Retention	Notes
Snapshot	Daily	7 days	Full backup
Incremental	Hourly	24 hours	Between snapshots
Commitlog	Continuous	24 hours	Point-in-time recovery

Offsite Backup¶

# Example: Sync snapshots to S3
aws s3 sync /var/lib/cassandra/data/my_keyspace/my_table-*/snapshots/daily_backup \
    s3://my-cassandra-backups/$(date +%Y%m%d)/

Repair Schedule¶

Configure Regular Repairs¶

Repairs should complete within gc_grace_seconds (default 10 days):

# Run repair on a keyspace (one node at a time)
nodetool repair -pr my_keyspace

# Full repair (all ranges, rarely needed)
nodetool repair -full my_keyspace

Recommended Repair Schedule¶

Cluster Size	Repair Frequency	Parallelism
3 nodes	Weekly	Sequential
6-12 nodes	Twice weekly	2 parallel
12+ nodes	Daily (incremental)	3+ parallel

Automated Repair Tools¶

AxonOps Repair - Automated scheduling
Cassandra Reaper - Open-source repair scheduler

Keyspace Configuration¶

Always Use NetworkTopologyStrategy¶

-- WRONG: SimpleStrategy (do not use in production)
CREATE KEYSPACE bad_example WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

-- CORRECT: NetworkTopologyStrategy
CREATE KEYSPACE good_example WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3
};

-- Multi-datacenter
CREATE KEYSPACE multi_dc_example WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};

Migrate Existing Keyspaces¶

ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3
};

-- Then run repair to ensure data is replicated correctly
-- nodetool repair my_keyspace

Pre-Launch Verification¶

1. Cluster Health¶

# All nodes UN (Up/Normal)
nodetool status

# Schema agreement
nodetool describecluster

# No pending compactions backlog
nodetool compactionstats

2. Performance Baseline¶

# Run stress test
cassandra-stress write n=1000000 -rate threads=50 -node 10.0.0.1

# Review latencies
cassandra-stress read n=1000000 -rate threads=50 -node 10.0.0.1

3. Failover Test¶

# Stop one node
sudo systemctl stop cassandra

# Verify cluster still operates (RF=3, CL=QUORUM)
cqlsh 10.0.0.2
SELECT * FROM system.local;

# Start node back
sudo systemctl start cassandra

# Verify it rejoins
nodetool status

4. Backup/Restore Test¶

# Take snapshot
nodetool snapshot -t test_backup my_keyspace

# Simulate data loss (on test cluster only!)
# cqlsh: TRUNCATE my_table;

# Restore from snapshot
# (copy snapshot files back to data directory)

# Verify data
cqlsh: SELECT COUNT(*) FROM my_table;

Documentation Requirements¶

Ensure documentation exists for:

[ ] Network topology diagram
[ ] Node inventory (IPs, DCs, racks)
[ ] Keyspace replication settings
[ ] Backup procedures and schedules
[ ] Restore procedures (tested!)
[ ] Scaling procedures
[ ] On-call runbooks
[ ] Contact information for support

Next Steps¶

After completing this checklist:

Operations Guide - Day-to-day procedures
Monitoring Setup - Detailed monitoring configuration
Troubleshooting - Common issues and solutions
Performance Tuning - Optimize for the workload