Skip to content
Maintained by AxonOps — production-grade documentation from engineers who operate distributed databases at scale

Filesystem Selection for Kafka

Filesystem choice impacts Kafka's throughput and latency, particularly for high-volume deployments. This guide covers Kafka's I/O architecture, filesystem recommendations, and optimization strategies.


Kafka I/O Architecture

Log Segment Structure

Kafka stores all data as append-only log segments on disk.

uml diagram

File types per segment:

File Purpose I/O Pattern
.log Message data Sequential write, sequential/random read
.index Offset to position mapping Sparse writes, memory-mapped reads
.timeindex Timestamp to offset mapping Sparse writes, memory-mapped reads

Write Path

Kafka's write path is optimized for sequential I/O.

uml diagram

Key characteristics:

  • Writes append to active segment only (sequential)
  • Data written to OS page cache, not directly to disk
  • Configurable fsync behavior (log.flush.interval.messages, log.flush.interval.ms)
  • Default: rely on OS page cache and replication for durability

Read Path and Zero-Copy

Kafka uses zero-copy (sendfile) for consumer reads, bypassing user space entirely.

uml diagram

Zero-copy requirements:

  • Data must be in page cache (or will be read from disk)
  • Filesystem must support sendfile() - all modern Linux filesystems do
  • TLS/SSL disables zero-copy (data must be encrypted in user space)

Page Cache Dependency

Kafka relies heavily on the OS page cache for performance.

uml diagram

Page cache sizing:

Workload Recommended Page Cache Rationale
Real-time consumers > active data size Recent data served from cache
Catch-up consumers As large as possible Reduce disk reads for historical data
Mixed workloads 50-80% of RAM Balance between heap and cache

Filesystem Comparison

XFS is the recommended filesystem for Kafka log directories.

Advantages for Kafka:

Feature Benefit for Kafka
Allocation groups Parallel writes across multiple log directories
Extent-based allocation Efficient for large sequential segment files
Delayed allocation Better block placement for append workloads
Preallocation Reduces fragmentation during segment growth

XFS behavior with Kafka workloads:

uml diagram

ext4

ext4 is a reasonable alternative, particularly for smaller deployments.

Comparison with XFS:

Aspect ext4 XFS
Sequential writes Good Excellent
Parallel I/O Limited (single lock) Excellent (per-AG)
Large files Good (up to 16TB) Excellent (up to 8EB)
Many directories Good Better (B+ tree dirs)
Extent coalescing Good Better

When ext4 is acceptable:

  • Single log directory (no parallelism benefit from XFS)
  • Smaller deployments (< 1TB per broker)
  • Familiarity and existing tooling

ZFS presents similar challenges for Kafka as for Cassandra.

Issues with Kafka workloads:

Challenge Impact on Kafka
Copy-on-write Write amplification for append workloads
ARC memory Competes with page cache
Checksumming CPU overhead (Kafka has CRC32 already)
CoW fragmentation Degrades sequential read performance over time

uml diagram

If ZFS is required:

# Create dataset with Kafka-optimized settings
zfs create -o recordsize=128K \
           -o compression=off \
           -o atime=off \
           -o xattr=sa \
           -o primarycache=metadata \
           -o logbias=throughput \
           tank/kafka

# Limit ARC to leave room for page cache
echo "options zfs zfs_arc_max=4294967296" >> /etc/modprobe.d/zfs.conf  # 4GB

Filesystem Comparison Summary

Feature XFS ext4 ZFS
Sequential write Excellent Good Poor (CoW)
Parallel I/O Excellent Limited Good
Page cache friendly Yes Yes Competes (ARC)
Zero-copy support Yes Yes Yes
Write amplification Low Low High (2-3x)
Kafka suitability Excellent Good Poor

Configuration

Mount Options

XFS (recommended):

# /etc/fstab
/dev/nvme0n1p1 /kafka/data xfs defaults,noatime,nodiratime 0 2
Option Purpose
noatime Disable access time updates (significant for high-throughput)
nodiratime Disable directory access time updates

ext4:

# /etc/fstab
/dev/nvme0n1p1 /kafka/data ext4 defaults,noatime,nodiratime 0 2

Formatting Recommendations

XFS:

# Standard formatting (recommended - uses optimal defaults)
mkfs.xfs -f /dev/nvme0n1p1

# Verify filesystem parameters
xfs_info /dev/nvme0n1p1

XFS automatically calculates optimal allocation group count and log size based on device characteristics. Manual tuning is rarely necessary for modern NVMe/SSD storage.

ext4:

# Standard formatting
mkfs.ext4 /dev/nvme0n1p1

Multiple Log Directories

Kafka supports multiple log directories for parallelism and capacity.

# server.properties
log.dirs=/kafka/data1,/kafka/data2,/kafka/data3,/kafka/data4

Benefits:

  • Distributes I/O across multiple disks
  • XFS allocation groups provide additional parallelism per disk
  • Partitions distributed round-robin across directories

uml diagram

Best practices:

  • One filesystem per physical disk (no RAID 0 across directories)
  • Use XFS for each mount point
  • Equal-sized disks for balanced distribution

Page Cache Optimization

Kernel Parameters

# /etc/sysctl.conf

# Dirty page ratios
vm.dirty_ratio = 80              # Max dirty pages before blocking writes
vm.dirty_background_ratio = 5    # Start background writeback

# Alternative: absolute values for large memory systems
# vm.dirty_bytes = 2147483648           # 2GB max dirty
# vm.dirty_background_bytes = 536870912  # 512MB background threshold

# Swappiness
vm.swappiness = 1                # Minimize swapping (0 can cause OOM)

# Page cache pressure
vm.vfs_cache_pressure = 50       # Retain page cache over dentries/inodes

Parameter explanations:

Parameter Recommended Rationale
vm.dirty_ratio 80 Allow large dirty cache before blocking producers
vm.dirty_background_ratio 5 Start flushing early to avoid write stalls
vm.swappiness 1 Keep data in RAM, minimal swap
vm.vfs_cache_pressure 50 Favor page cache retention

I/O Scheduler

# For NVMe/SSD (recommended)
echo none > /sys/block/nvme0n1/queue/scheduler

# For SATA SSD
echo deadline > /sys/block/sda/queue/scheduler

# Persistent configuration (/etc/udev/rules.d/60-kafka.rules)
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="deadline"

Read-Ahead

# Check current setting
blockdev --getra /dev/nvme0n1

# Set read-ahead (in 512-byte sectors)
# 0 for NVMe (no benefit)
# 256-4096 for HDDs
blockdev --setra 0 /dev/nvme0n1    # NVMe
blockdev --setra 4096 /dev/sda     # HDD (2MB read-ahead)

Recommendations

Decision Matrix

Deployment Recommended Notes
Production (any scale) XFS Best sequential write performance
Small/Dev ext4 or XFS Either acceptable
Multiple disks XFS per disk Leverage allocation group parallelism
Existing ZFS infrastructure ZFS with tuning See ZFS section for required optimizations

Configuration Summary

Optimal Kafka filesystem setup:

# Format each disk with XFS
mkfs.xfs -f /dev/nvme0n1
mkfs.xfs -f /dev/nvme1n1

# Mount with optimal options
mount -o noatime,nodiratime /dev/nvme0n1 /kafka/data1
mount -o noatime,nodiratime /dev/nvme1n1 /kafka/data2

# Configure Kafka
# server.properties
log.dirs=/kafka/data1,/kafka/data2

Kernel tuning:

# /etc/sysctl.conf
vm.dirty_ratio = 80
vm.dirty_background_ratio = 5
vm.swappiness = 1
vm.vfs_cache_pressure = 50

# Apply
sysctl -p

What to Avoid

Configuration Issue
ZFS for production Write amplification, ARC competition
RAID 0 across log.dirs Single disk failure loses all data
Small page cache Increases disk I/O, disables effective zero-copy
High swappiness JVM and page cache evicted to swap
atime enabled Unnecessary write overhead

Monitoring

Filesystem Metrics

# Disk I/O statistics
iostat -xz 1

# Key metrics:
# - %util: Device utilization (< 80% target)
# - await: Average I/O wait time
# - w/s: Writes per second

Page Cache Statistics

# Page cache usage
free -h
# Cached column shows page cache size

# Detailed memory info
cat /proc/meminfo | grep -E "Cached|Dirty|Writeback"

# Per-file cache status
vmtouch /kafka/data1/topic-*/*.log

Kafka Log Directory Health

# Check disk usage per log directory
du -sh /kafka/data*

# Check segment distribution
find /kafka/data* -name "*.log" | wc -l

# Verify XFS health
xfs_info /kafka/data1
xfs_repair -n /dev/nvme0n1  # Dry-run check

Summary

Recommendation Rationale
Use XFS for all log directories Optimized for sequential writes, allocation groups enable parallelism
Avoid ZFS for production CoW overhead, ARC competes with page cache
Configure multiple log.dirs on separate disks Distributes I/O, increases throughput
Tune page cache parameters Kafka depends on page cache for performance
Use noatime mount option Eliminates unnecessary metadata writes
Leave majority of RAM for page cache Enables zero-copy, reduces disk I/O