Backup and Restore Overview¶
Why Backups Are Necessary¶
A common misconception persists in distributed database operations: "Cassandra replicates data across nodes, so backups are unnecessary." This reasoning conflates two fundamentally different concepts—high availability and data protection.
Replication provides high availability: if one node fails, other replicas serve requests without interruption. However, replication faithfully propagates all changes to all replicas, including destructive ones. When an application bug deletes data, that deletion replicates. When an operator runs DROP TABLE, the table disappears from all nodes simultaneously. Replication ensures consistency—it ensures all replicas reflect the same state, whether that state is correct or catastrophically wrong.
Backups provide data protection: the ability to recover to a known good state after data loss, corruption, or disaster. Backups exist outside the replication system, immune to changes propagating through the cluster.
Disaster Scenarios¶
Understanding the range of potential disasters helps define appropriate backup strategies. Disasters fall into three broad categories: infrastructure failures, human errors, and external threats.
Infrastructure Failures¶
Single Node Failure
The most common failure mode. Causes include:
- Disk failure (SSDs have finite write endurance; HDDs have mechanical failures)
- Memory errors (ECC can correct some; others cause crashes)
- Power supply failure
- Motherboard or CPU failure
- Operating system corruption
- Network interface failure (node appears down to cluster)
With RF=3, single node failures are non-events for availability—other replicas serve requests. However, the failed node's data must be rebuilt, either from backup or by streaming from other replicas.
Rack Failure
Multiple nodes fail simultaneously due to shared infrastructure:
- Top-of-rack switch failure (all nodes in rack lose network)
- PDU (Power Distribution Unit) failure (all nodes lose power)
- Cooling failure in rack zone (thermal shutdown)
- Shared storage failure (if using SAN/NAS)
- Cable tray damage (fire, water, physical impact)
Rack-aware replication (NetworkTopologyStrategy with rack configuration) ensures replicas span racks. A rack failure should not cause data unavailability, but rebuilding an entire rack strains cluster resources.
Datacenter Failure
Complete loss of a datacenter:
- Power grid failure affecting the facility
- Network connectivity loss (upstream provider failure, fiber cut)
- Natural disasters (earthquake, flood, hurricane, fire)
- Cooling system failure (facility-wide thermal event)
- Building access restrictions (civil unrest, pandemic, legal action)
Multi-datacenter deployments with NetworkTopologyStrategy survive DC failures. Single-DC deployments face complete outage until the datacenter recovers or data is restored elsewhere.
Region Failure
Geographic-scale events affecting multiple datacenters:
- Regional power grid failure
- Major natural disaster (earthquake affecting multiple facilities)
- Regional network backbone failure
- Political or regulatory action affecting a jurisdiction
Multi-region deployments provide protection, but most organizations run Cassandra within a single region for latency reasons.
Human Errors¶
Human error causes more data loss incidents than hardware failures. Unlike hardware failures, human errors typically affect all replicas simultaneously—replication provides no protection.
Accidental Data Deletion
-- Intended: Delete inactive users from staging
DELETE FROM staging.users WHERE active = false;
-- Actual: Connected to production
DELETE FROM production.users WHERE active = false;
The deletion replicates to all nodes. By the time the error is discovered, all replicas consistently reflect the data loss.
Accidental Schema Changes
-- Intended: Drop unused table in development
DROP TABLE dev_keyspace.temp_analytics;
-- Actual: Dropped production table
DROP TABLE prod_keyspace.analytics;
Schema changes are immediate and cluster-wide. The auto_snapshot feature (enabled by default) creates a snapshot before DROP operations, providing a recovery path—but only if the operator knows it exists and acts before the snapshot is cleared.
Destructive Maintenance Operations
# Intended: Clean up old snapshots on staging node
nodetool clearsnapshot -t old_backup
# Actual: Ran on production, cleared critical backup
# Intended: Remove test keyspace data
rm -rf /var/lib/cassandra/data/test_keyspace
# Actual: Typo removed production data
rm -rf /var/lib/cassandra/data/prod_keyspace
Application Bugs
- Code deploying to production with incorrect DELETE or UPDATE logic
- Migration scripts with bugs affecting production data
- Race conditions causing data corruption
- Serialization bugs writing malformed data
Application-level corruption is particularly insidious: the bad data replicates normally, and the problem may not be detected until significant damage has occurred.
Configuration Errors
- Incorrect
gc_grace_secondscausing premature tombstone removal - Wrong replication factor leaving data under-replicated
- Misconfigured compaction causing data loss during cleanup
- Authentication changes locking out all users
External Threats¶
Ransomware and Malware
Malicious software encrypting or deleting database files. Ransomware specifically targets backup systems to prevent recovery, making off-site, air-gapped backups essential.
Security Breaches
Attackers with database access may:
- Delete data to cover tracks
- Corrupt data as sabotage
- Exfiltrate and then delete data
- Modify data for fraud
Insider Threats
Malicious actions by employees or contractors with legitimate access:
- Deliberate data destruction (disgruntled employee)
- Data theft followed by deletion
- Sabotage during organizational disputes
Backup and Recovery Theory¶
Enterprise backup strategies are defined by measurable objectives that align technical capabilities with business requirements.
Recovery Point Objective (RPO)¶
RPO defines the maximum acceptable data loss, measured in time.
If RPO is 4 hours, the organization accepts losing up to 4 hours of data in a disaster. This drives backup frequency:
| RPO Target | Backup Strategy Required |
|---|---|
| 24 hours | Daily snapshots |
| 4 hours | Snapshots every 4 hours, or continuous incremental |
| 1 hour | Hourly snapshots or incremental backup |
| 15 minutes | Continuous incremental with frequent sync |
| Near-zero | Commit log archiving (PITR capability) |
| Zero | Synchronous replication to secondary site |
Determining RPO:
Business stakeholders must answer: "If we lose the last N hours of data, what is the business impact?"
Considerations include: - Transaction value (financial systems may require near-zero RPO) - Data recreation cost (can lost data be re-entered or regenerated?) - Regulatory requirements (some industries mandate specific retention) - Customer impact (SLA commitments, reputation damage)
Recovery Time Objective (RTO)¶
RTO defines the maximum acceptable downtime, measured in time.
If RTO is 2 hours, the system must be operational within 2 hours of a disaster declaration. This drives recovery infrastructure:
| RTO Target | Infrastructure Required |
|---|---|
| Days | Off-site tape storage, manual recovery |
| Hours | Remote disk backup, documented procedures |
| 1 hour | Hot standby or rapid restore capability |
| Minutes | Active-active multi-DC, automated failover |
| Seconds | Synchronous replication, instant failover |
Factors affecting actual recovery time:
| Factor | Impact on Recovery Time |
|---|---|
| Backup location | Remote storage adds transfer time |
| Data volume | 10TB takes longer to restore than 100GB |
| Network bandwidth | Limits data transfer rate |
| Restore method | sstableloader slower than direct file copy |
| Cluster size | More nodes = more work, but parallelizable |
| Staff availability | Off-hours incidents take longer |
| Documentation quality | Poor runbooks slow recovery |
| Testing frequency | Untested procedures fail under pressure |
The RTO/RPO Trade-off:
Shorter RTO and RPO require greater investment in infrastructure, tooling, and operational processes. Organizations must balance protection level against cost:
Cost
↑
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│╱
└─────────────────────→ RPO/RTO (shorter)
Approaching zero RPO/RTO requires exponentially increasing investment.
Operational Acceptance Testing (OAT)¶
A backup that has never been tested is not a backup.
OAT validates that backup and recovery procedures work as designed, under realistic conditions, and within required time constraints.
OAT Components for Backup/Restore:
| Test Type | Description | Frequency |
|---|---|---|
| Backup verification | Confirm backups complete successfully | Daily (automated) |
| Integrity check | Validate backup files are not corrupted | Weekly (automated) |
| Partial restore | Restore single table to staging | Monthly |
| Full restore | Restore entire cluster to DR site | Quarterly |
| Disaster simulation | Unannounced DR exercise with time measurement | Annually |
What OAT Should Validate:
- Backup completeness: All required data is captured
- Backup integrity: Files are not corrupted and can be read
- Restore procedure: Documented steps actually work
- Recovery time: Actual time meets RTO requirement
- Data correctness: Restored data matches expected state
- Application functionality: Applications work with restored data
- Staff capability: Team can execute procedures under pressure
Common OAT Failures:
| Failure Mode | Cause | Prevention |
|---|---|---|
| Backup files corrupted | Storage failure, incomplete transfer | Checksums, verification |
| Restore procedure fails | Undocumented dependencies, environment changes | Regular testing |
| RTO exceeded | Underestimated data volume, slow network | Realistic testing |
| Wrong data restored | Incorrect backup selected, timestamp confusion | Clear naming, automation |
| Missing schema | Schema not included in backup | Include schema in every backup |
| Application incompatibility | Schema drift, version mismatch | End-to-end testing |
Business Continuity Planning¶
Backup and restore is one component of broader business continuity:
| Component | Purpose |
|---|---|
| Backup & Restore | Recover data after loss |
| Disaster Recovery (DR) | Recover systems after site failure |
| High Availability (HA) | Prevent outages through redundancy |
| Business Continuity (BC) | Maintain business operations during disruption |
These components complement each other:
┌─────────────────────────────────────┐
│ Business Continuity │
│ ┌───────────────────────────────┐ │
│ │ Disaster Recovery │ │
│ │ ┌─────────────────────────┐ │ │
│ │ │ High Availability │ │ │
│ │ │ ┌───────────────────┐ │ │ │
│ │ │ │ Backup/Restore │ │ │ │
│ │ │ └───────────────────┘ │ │ │
│ │ └─────────────────────────┘ │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
Replication vs Backup¶
Understanding what replication does and does not protect against:
| Failure Type | Replication Protects? | Backups Protect? |
|---|---|---|
| Single node failure | Yes | Yes |
| Multiple node failures (within RF) | Yes | Yes |
| Simultaneous failures exceeding RF | No | Yes |
| Rack failure (with rack-aware placement) | Yes | Yes |
| Datacenter failure (with multi-DC) | Yes | Yes |
DROP TABLE or DROP KEYSPACE |
No | Yes |
TRUNCATE command |
No | Yes |
| Accidental DELETE statements | No | Yes |
| Application bug corrupting data | No | Yes |
| Malicious insider deleting data | No | Yes |
| Ransomware encryption | No | Yes (if offline) |
| Regulatory data retention | No | Yes |
| Point-in-time audit requirements | No | Yes |
The gc_grace_seconds Constraint¶
Even with backups, restoration has a time limit determined by gc_grace_seconds (default: 10 days).
Cassandra uses tombstones (deletion markers) rather than immediately removing data. Tombstones propagate to all replicas, ensuring deletions are consistent. After gc_grace_seconds, tombstones are eligible for removal during compaction.
The resurrection problem:
Timeline:
Day 0: Full backup taken (contains Row X)
Day 3: Row X deleted (tombstone created)
Day 11: Tombstone expires, removed by compaction
Day 15: Restore Day 0 backup to one node
State after restore:
- Restored node: Has Row X (from backup)
- Other nodes: No Row X, no tombstone (tombstone was removed)
Result:
- Anti-entropy repair sees Row X on restored node
- No tombstone exists to indicate deletion
- Row X replicates back to other nodes
- Deleted data "resurrects"
Implications:
| Backup Age | Restore Scope | Safe? |
|---|---|---|
| < gc_grace_seconds | Single node | Yes |
| < gc_grace_seconds | Full cluster | Yes |
| > gc_grace_seconds | Single node | No (resurrection risk) |
| > gc_grace_seconds | Full cluster | Yes (all nodes same state) |
Backup Components¶
A complete Cassandra backup includes:
| Component | Description | Required | Notes |
|---|---|---|---|
| SSTables | Immutable data files | Yes | The actual data |
| Schema | Keyspace and table definitions | Yes | Must restore before data |
| Commit logs | Write-ahead log | For PITR | Enables point-in-time recovery |
| Configuration | cassandra.yaml, JVM settings | Recommended | Cluster settings, tuning |
| Topology | Token assignments, DC/rack layout | Recommended | For disaster recovery |
SSTables¶
SSTables are immutable—once written, they never change. This immutability makes them ideal for backup:
- No risk of partial writes or mid-file corruption during backup
- Can be safely copied while Cassandra is running (after flush)
- Hard links enable instant, zero-space local snapshots
Schema¶
The schema must be restored before data. Without table definitions, SSTables cannot be loaded.
# Export complete schema
cqlsh -e "DESC SCHEMA" > schema.cql
# Include with every backup
Commit Logs¶
Commit logs enable point-in-time recovery (PITR). Combined with a base snapshot, archived commit logs can restore to any point in time:
|───────|─────────────────────────|───────|
Snapshot Failure
<─── Commit logs ───>
Recovery = Restore snapshot + replay commit logs to target time
Backup Methods Summary¶
| Method | Type | RPO | Complexity | Use Case |
|---|---|---|---|---|
| Snapshots | Full point-in-time | Hours-days | Low | Primary backup method |
| Incremental | Changed SSTables | Hours | Medium | Reduce storage between snapshots |
| Commit log archiving | Continuous | Minutes | High | Point-in-time recovery |
See Backup Procedures for implementation details.
Restore Scenarios Summary¶
| Scenario | Complexity | Typical Approach |
|---|---|---|
| Single table, single node | Low | Copy files + nodetool refresh |
| Single node failure | Medium | Rebuild from replicas or restore + repair |
| Rack failure | Medium | Restore nodes + repair |
| Datacenter failure | High | Restore all DC nodes + cross-DC repair |
| Point-in-time recovery | High | Snapshot + commit log replay |
| Migration to new cluster | Medium | sstableloader |
See Restore Procedures for detailed procedures.
Managed Backup with AxonOps¶
Implementing enterprise-grade backup and restore requires significant operational investment:
- Scheduling and orchestration across all nodes
- Off-site storage with appropriate retention
- Monitoring and alerting for backup failures
- Regular restore testing and validation
- Documentation and runbook maintenance
AxonOps provides a fully managed backup solution:
- Automated scheduling with configurable retention policies
- Remote storage integration (S3, GCS, Azure Blob)
- Point-in-time recovery with commit log archiving
- Backup monitoring and alerting
- One-click restore through the dashboard
- Compliance reporting for audit requirements
See AxonOps Backup for configuration and usage.
Next Steps¶
- Backup Procedures - Snapshots, incremental backups, commit log archiving
- Restore Guide - Failure scenarios and recovery approaches