Filesystem Selection for Cassandra¶

Filesystem choice significantly impacts Cassandra performance, particularly for commit log operations. This guide covers filesystem architectures, their interaction with Cassandra's I/O patterns, and recommendations based on workload characteristics.

Filesystem Architecture Comparison¶

ext4 Architecture¶

ext4 (fourth extended filesystem) is the default Linux filesystem, evolved from ext3 with significant architectural improvements for modern storage.

On-Disk Structure¶

Block groups: ext4 divides the filesystem into fixed-size block groups (typically 128MB with 4K blocks). Each group contains:

Group descriptor: Metadata about the group (free blocks, free inodes, checksums)
Block bitmap: Tracks allocated/free blocks within the group
Inode bitmap: Tracks allocated/free inodes within the group
Inode table: Fixed-size array of inodes for files in this group
Data blocks: Actual file content

Extent-Based Allocation¶

ext4 replaced ext3's indirect block mapping with extents, dramatically improving large file performance.

Aspect	ext3 (Indirect Blocks)	ext4 (Extents)
1GB file mapping	~256K block pointers	~8 extents typical
Sequential read	Multiple metadata lookups	Single extent lookup
Metadata overhead	O(n) with file size	O(log n) with extent tree
Maximum file size	2TB	16TB (theoretical 1EB)

Extent structure (12 bytes each):

struct ext4_extent {
    __le32 ee_block;      // First logical block
    __le16 ee_len;        // Number of blocks (max 32768 = 128MB)
    __le16 ee_start_hi;   // High 16 bits of physical block
    __le32 ee_start_lo;   // Low 32 bits of physical block
};

Journaling Subsystem¶

ext4's journal (jbd2) provides crash consistency through write-ahead logging.

Journal modes and write ordering:

Mode	Write Sequence	Crash Behavior
`journal`	Data → Journal, Metadata → Journal, Commit → Checkpoint	Data and metadata always consistent
`ordered`	Data → Disk, Metadata → Journal, Commit → Checkpoint	Data written before metadata; no stale data exposure
`writeback`	Metadata → Journal, Data → Disk (unordered)	Possible stale data exposure after crash

# Check current journal mode
tune2fs -l /dev/sda1 | grep "Default mount options"

# Mount with writeback mode (higher performance, reduced consistency)
mount -o data=writeback /dev/sda1 /var/lib/cassandra/data

# Check journal size and location
dumpe2fs /dev/sda1 | grep -i journal

Delayed Allocation¶

ext4 delays block allocation until writeback, enabling smarter placement decisions.

Benefits for Cassandra:

SSTable flushes benefit from contiguous allocation
Reduces fragmentation for sequential write workloads
Improves extent coalescing for large files

XFS Architecture¶

XFS is a high-performance 64-bit journaling filesystem designed for parallel I/O and extreme scalability.

Allocation Groups¶

XFS's defining feature is its division into independent allocation groups (AGs), enabling true parallel operations.

Allocation group structure:

Component	Purpose	Structure
AG Superblock	AG metadata, free space summary	Fixed location in each AG
AG Free List	Free block tracking	Two B+ trees (by block number, by size)
Inode B+ Tree	Inode allocation tracking	B+ tree indexed by inode number
Free Inode B+ Tree	Available inodes	B+ tree for fast inode allocation

Parallelism implications:

# Check AG count and size
xfs_info /var/lib/cassandra/data

# Example output:
# meta-data=... agcount=16, agsize=65536000 blks
# Each AG can handle independent I/O operations

For Cassandra with multiple concurrent compactions and flushes, different operations can target different AGs simultaneously without lock contention.

B+ Tree Structures¶

XFS uses B+ trees extensively for metadata management, providing O(log n) operations even at massive scale.

B+ trees in XFS:

Tree	Purpose	Location
Inode B+ tree	Maps inode numbers to disk locations	Per-AG
Free space by block	Finds free space at specific location	Per-AG
Free space by size	Finds free space of specific size	Per-AG
Extent map	Maps file logical blocks to physical	Per-inode (data fork)
Directory entries	Maps names to inodes	Per-directory

Delayed Allocation and Speculative Preallocation¶

XFS implements aggressive delayed allocation with speculative preallocation for streaming writes.

Speculative preallocation behavior:

File Size	Preallocation	Rationale
< 64KB	64KB	Small file optimization
64KB - 1MB	2x current size	Growing file pattern
> 1MB	Fixed increment	Prevent excessive preallocation

For Cassandra SSTables that grow to hundreds of MB or GB, this results in large contiguous extents.

Log (Journal) Structure¶

XFS uses a metadata-only log with unique features for high performance.

Log configuration impact:

# Standard formatting (recommended - uses optimal defaults)
mkfs.xfs -f /dev/sdb1

# Verify filesystem parameters
xfs_info /dev/sdb1

XFS automatically calculates optimal log size based on filesystem size. Manual tuning is rarely necessary for modern storage.

ZFS Architecture¶

ZFS is a combined filesystem and volume manager with advanced features like checksumming, snapshots, and built-in RAID. While popular for general-purpose storage, it presents specific challenges for Cassandra workloads.

Core Architecture¶

Key architectural components:

Component	Function	Cassandra Impact
ZPL	POSIX filesystem interface	Standard file operations
DMU	Object-based data management with CoW	Write amplification
ARC	Adaptive Replacement Cache (RAM)	Memory competition with JVM
L2ARC	SSD-based secondary cache	Additional I/O overhead
ZIL	ZFS Intent Log (write log)	Commit log latency
SLOG	Separate ZIL device	Can improve sync writes

Copy-on-Write (CoW) and Write Amplification¶

ZFS never overwrites data in place. Every write creates new blocks, then updates metadata pointers.

Write amplification analysis:

Operation	Traditional FS	ZFS CoW
4KB write	1 write + journal	3+ writes (data + metadata chain)
SSTable write	Sequential + metadata	CoW tree updates
Compaction	Read + write	Read + CoW writes + old block retention

For Cassandra's write-heavy workloads, this CoW overhead compounds with Cassandra's own write amplification from compaction.

ZFS Intent Log (ZIL) and Sync Writes¶

The ZIL provides durability for synchronous writes, critical for Cassandra's commit log.

ZIL considerations for Cassandra:

Configuration	Behavior	Recommendation
Default ZIL	On pool disks	Adequate for most workloads
SLOG (separate log device)	Dedicated fast device for ZIL	Recommended if using ZFS for commit logs
sync=disabled	No synchronous writes	Never use - violates durability

# Check ZIL configuration
zpool status -v

# Add SLOG device (mirror recommended)
zpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1

# Check ZIL statistics
zpool iostat -v tank 1

ARC Memory Competition¶

ZFS's ARC (Adaptive Replacement Cache) uses RAM for caching, competing directly with Cassandra's JVM heap and OS page cache.

Memory configuration:

# Limit ARC size in /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=8589934592  # 8GB max

# Or dynamically
echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max

# Verify ARC usage
arc_summary  # or arcstat

Memory planning for Cassandra on ZFS:

Component	Typical Allocation	Notes
JVM Heap	8-31GB	Standard Cassandra sizing
Off-Heap	50-100% of heap	Memtables, bloom filters, compression
ARC	8-16GB max	Must limit explicitly
OS/kernel	4-8GB	Page cache, kernel structures

Checksumming and Data Integrity¶

ZFS provides end-to-end checksumming, detecting silent data corruption (bit rot).

Cassandra already provides checksums at the SSTable level, making ZFS checksumming redundant for data files. However, ZFS checksums protect:

Commit logs (Cassandra checksums optional)
System files
Any corruption in storage layer

ZFS for Cassandra: Challenges Summary¶

Challenge	Impact	Mitigation
Write amplification	2-3x more disk writes	Use SSDs, accept overhead
CoW fragmentation	Random read patterns over time	Regular scrub, accept degradation
ARC memory	Reduces memory for Cassandra	Limit ARC explicitly
Sync write latency	Commit log performance	Add SLOG device
Snapshots retain blocks	Disk space after compaction	Manage snapshot lifecycle
Complexity	Tuning required	Expertise needed

When ZFS Might Make Sense¶

Despite the challenges, ZFS can be appropriate in specific scenarios:

Potentially suitable:

Development/test environments: Snapshots enable rapid environment cloning
Operational simplicity priority: Single storage stack with built-in features
Data integrity requirements: Regulatory or compliance needs for checksumming
Existing ZFS infrastructure: Team expertise and tooling already in place

Not recommended:

Performance-critical production: Write amplification impacts throughput
Cost-sensitive deployments: Requires more disk I/O capacity
Memory-constrained systems: ARC competition with JVM

ZFS Tuning for Cassandra¶

If using ZFS, apply these optimizations:

# Create pool with appropriate settings
zpool create -o ashift=12 cassandra_pool mirror /dev/sda /dev/sdb

# Create dataset with Cassandra-optimized settings
zfs create -o recordsize=64K \
           -o compression=lz4 \
           -o atime=off \
           -o xattr=sa \
           -o primarycache=metadata \
           cassandra_pool/data

# For commit logs (if separate dataset)
zfs create -o recordsize=8K \
           -o compression=off \
           -o sync=standard \
           -o primarycache=metadata \
           cassandra_pool/commitlog

Key tuning parameters:

Parameter	Data Directory	Commit Log	Rationale
`recordsize`	64K-128K	8K	Match Cassandra I/O patterns
`compression`	lz4	off	Data already compressed in SSTables
`atime`	off	off	Eliminate access time updates
`primarycache`	metadata	metadata	Let Cassandra manage data caching
`sync`	standard	standard	Maintain durability
`logbias`	throughput	latency	Optimize for workload type

# Limit ARC (critical for Cassandra)
echo "options zfs zfs_arc_max=8589934592" >> /etc/modprobe.d/zfs.conf

# Tune transaction group timing
echo "options zfs zfs_txg_timeout=5" >> /etc/modprobe.d/zfs.conf

# Apply changes (requires reboot or module reload)

ext3 (Legacy) - Why to Avoid¶

ext3 lacks architectural features critical for Cassandra performance.

Fundamental limitations:

Aspect	ext3 Behavior	Impact on Cassandra
Block mapping	Indirect blocks (up to triple)	O(n) metadata overhead for large SSTables
Allocation	Immediate allocation	No optimization for write patterns
Maximum extent	N/A (block-based)	Every block tracked individually
Directory scaling	Linear search (without htree)	Slow with many SSTables
Journal	Full data journaling common	2x write amplification

Migration path:

# Check if filesystem is ext3
tune2fs -l /dev/sda1 | grep "Filesystem features"
# ext3: has_journal (no extents)
# ext4: has_journal extents

# Convert ext3 to ext4 (offline)
tune2fs -O extents,uninit_bg,dir_index /dev/sda1
e2fsck -fD /dev/sda1

Architecture Comparison Summary¶

Feature	ext4	XFS	ZFS
Block groups / AGs	Fixed 128MB groups	Configurable AGs (~1GB)	Variable record size
Parallel allocation	Limited	Excellent (per-AG)	Good (per-vdev)
Extent size	Max 128MB	Max 8EB	Variable blocks
Write behavior	In-place	In-place	Copy-on-write
Journal	Data + metadata modes	Metadata only	Intent log (ZIL)
Delayed allocation	Yes	Yes (aggressive)	Yes (TXG batching)
Built-in checksums	Metadata only	Metadata only	All data + metadata
Snapshots	No	No	Yes (instant, CoW)
Online shrink	Yes	No	No
Maximum file size	16TB	8EB	16EB
Maximum filesystem	1EB	8EB	256 ZB
Memory overhead	Minimal	Minimal	ARC cache (significant)
Write amplification	Low	Low	High (2-3x)
Cassandra suitability	Excellent	Excellent	Limited

Cassandra I/O Patterns¶

Understanding Cassandra's I/O behavior is essential for filesystem selection.

Commit Log I/O¶

The commit log provides durability by persisting writes before acknowledgment.

I/O characteristics:

Mode	Implementation	Filesystem Interaction
Uncompressed	`MappedByteBuffer`	Memory-mapped files; OS kernel manages page flushing; highly sensitive to filesystem I/O handling
Compressed	`FileChannel` + `ByteBuffer`	Explicit write calls; more predictable I/O pattern; less filesystem-dependent

The difference in implementation explains why filesystem choice significantly impacts uncompressed commit log performance but has minimal effect on compressed workloads.

Data Directory I/O¶

SSTable operations have different characteristics:

Operation	I/O Pattern	Filesystem Impact
Memtable flush	Large sequential writes	Extent allocation efficiency
Compaction	Sequential read + sequential write	Sustained throughput
Read path	Random reads (with caching)	Block allocation locality

Data directory workloads benefit from XFS's large file handling and parallel I/O capabilities.

Performance Characteristics¶

Commit Log: ext4 vs XFS¶

Benchmark studies reveal significant performance differences for uncompressed commit logs, particularly on certain Linux distributions.

Observed behavior (CentOS/RHEL):

Filesystem	Throughput	Mean Latency	Notes
ext4	~21k ops/sec	~2ms	Superior MappedByteBuffer handling
XFS	~14k ops/sec	~4ms	Higher latency with memory-mapped I/O

This represents approximately 50% throughput reduction and 2x latency increase for XFS with uncompressed commit logs on CentOS systems.

Root cause analysis:

The performance difference stems from how each filesystem handles memory-mapped file synchronization:

ext4: More efficient write coalescing and page cache management for mmap operations
XFS: Allocation group locking and extent management add overhead for frequent small mmap flushes

Distribution variance:

Distribution	ext4 vs XFS Gap
CentOS/RHEL	Significant (50%+ throughput difference)
Ubuntu	Minimal (kernel I/O path differences)

The kernel version and distribution-specific patches affect filesystem behavior. Ubuntu's kernel includes different I/O scheduling and writeback tuning that reduces the performance gap.

Commit Log Compression as Equalizer¶

Enabling commit log compression eliminates the filesystem performance gap:

Configuration	ext4	XFS
Uncompressed	~21k ops/sec	~14k ops/sec
Compressed	~24k ops/sec	~24k ops/sec

Compression switches from MappedByteBuffer to FileChannel I/O, which both filesystems handle similarly.

# cassandra.yaml
commitlog_compression:
  - class_name: LZ4Compressor

Data Directory Performance¶

For data directories, XFS generally provides advantages:

Aspect	ext4	XFS
Large SSTable writes	Good	Better (optimized extent allocation)
Parallel compaction	Good	Better (allocation groups)
Many small files	Better	Good
File deletion	Faster	Slower (extent cleanup)

Recommendations¶

Decision Matrix¶

Use Case	Recommended	Alternative	Avoid
Data directories	XFS	ext4	ZFS (write amplification)
Commit logs (uncompressed)	ext4	XFS (with commitlog compression enabled)	ZFS
Commit logs (compressed)	XFS or ext4	-	ZFS
Mixed (single mount)	XFS (with commitlog compression)	ext4	ZFS
Dev/test with snapshots	ZFS	XFS/ext4 + LVM	-

Commitlog Compression

"Commitlog compression" refers to Cassandra's built-in compression configured in cassandra.yaml, not filesystem-level compression. XFS does not support filesystem compression. Enabling commitlog compression switches Cassandra from MappedByteBuffer to FileChannel I/O, which performs equally well on both ext4 and XFS.

Production Recommendations¶

Preferred configuration (separate mounts):

/dev/nvme0n1p1  /var/lib/cassandra/data         xfs   defaults,noatime  0 2
/dev/nvme1n1p1  /var/lib/cassandra/commitlog    ext4  defaults,noatime  0 2

With commit log compression enabled in cassandra.yaml.

ZFS in Production

ZFS is not recommended for production Cassandra deployments due to:

Write amplification: CoW semantics cause 2-3x write overhead
Memory contention: ARC competes with JVM heap
Compaction interaction: CoW + Cassandra compaction compounds write amplification

If organizational requirements mandate ZFS, see the ZFS Tuning section for optimization guidance.

Alternative (single filesystem):

If using a single filesystem for both data and commit logs:

Use XFS for overall performance
Enable commit log compression to avoid MappedByteBuffer performance penalty

# cassandra.yaml - required when using XFS for commit logs
commitlog_compression:
  - class_name: LZ4Compressor

Filesystem by Distribution¶

Distribution	Data Directory	Commit Log
RHEL/CentOS/Rocky	XFS	ext4, or XFS with commitlog compression
Ubuntu/Debian	XFS	Either (minimal difference)
Amazon Linux	XFS	ext4, or XFS with commitlog compression

Mount Options¶

XFS Mount Options¶

# /etc/fstab
/dev/sdb1 /var/lib/cassandra/data xfs defaults,noatime,nodiratime,allocsize=512m 0 2

Option	Purpose
`noatime`	Disable access time updates; reduces write overhead
`nodiratime`	Disable directory access time updates
`allocsize=512m`	Larger allocation size for streaming writes (optional)

ext4 Mount Options¶

# /etc/fstab - balanced configuration
/dev/sdc1 /var/lib/cassandra/commitlog ext4 defaults,noatime,nodiratime 0 2

# /etc/fstab - maximum performance (reduced durability)
/dev/sdc1 /var/lib/cassandra/commitlog ext4 defaults,noatime,nodiratime,data=writeback,barrier=0 0 2

Option	Purpose	Trade-off
`noatime`	Disable access time updates	None
`nodiratime`	Disable directory access time updates	None
`data=writeback`	Journal metadata only	Reduced durability on crash
`barrier=0`	Disable write barriers	Reduced durability; requires battery-backed cache

Write Barriers

Disabling write barriers (barrier=0) should only be used with battery-backed RAID controllers or when commit log durability is handled at another layer. Data corruption is possible on power failure without hardware write caching protection.

Verification¶

# Check mounted filesystem options
mount | grep cassandra

# Check XFS allocation groups
xfs_info /var/lib/cassandra/data

# Check ext4 features
tune2fs -l /dev/sdc1 | grep -E "features|Default mount"

Formatting Recommendations¶

XFS Formatting¶

# Standard formatting (recommended - uses optimal defaults)
mkfs.xfs -f /dev/sdb1

# Verify filesystem parameters
xfs_info /dev/sdb1

XFS automatically calculates optimal allocation group count and log size based on device characteristics. Manual tuning is rarely necessary for modern NVMe/SSD storage.

ext4 Formatting¶

# Standard formatting
mkfs.ext4 /dev/sdc1

# Optimized (larger inode size, disable journaling for commit log if using replication)
mkfs.ext4 -I 512 /dev/sdc1

Monitoring Filesystem Performance¶

I/O Statistics¶

# Real-time I/O monitoring
iostat -xz 1

# Key metrics to watch
# - await: Average I/O wait time (ms)
# - %util: Device utilization
# - avgqu-sz: Average queue depth

Filesystem-Specific Monitoring¶

# XFS fragmentation
xfs_db -c frag -r /dev/sdb1

# ext4 fragmentation
e4defrag -c /var/lib/cassandra/commitlog

# Check for extent fragmentation in large files
filefrag /var/lib/cassandra/data/keyspace/table-*/nb-*-big-Data.db

Summary¶

Recommendation	Rationale
Use XFS for data directories	Optimized for large files, parallel I/O, delayed allocation
Use ext4 for commit logs OR enable compression	ext4 has better MappedByteBuffer handling
Avoid ZFS for production	Write amplification and memory contention impact performance
Always use `noatime`	Eliminates unnecessary write overhead
Enable Cassandra commitlog compression if using XFS	Neutralizes MappedByteBuffer performance difference
Match filesystem choice to distribution	RHEL-based systems show larger ext4/XFS gaps
If ZFS required, limit ARC and tune recordsize	Minimize performance impact with proper configuration

OS Tuning Overview - Complete OS tuning guide
Commit Log Architecture - Commit log internals
Hardware Recommendations - Storage hardware selection
SSTable Architecture - Data file structure