Filesystem Selection for Kafka¶
Filesystem choice impacts Kafka's throughput and latency, particularly for high-volume deployments. This guide covers Kafka's I/O architecture, filesystem recommendations, and optimization strategies.
Kafka I/O Architecture¶
Log Segment Structure¶
Kafka stores all data as append-only log segments on disk.
File types per segment:
| File | Purpose | I/O Pattern |
|---|---|---|
.log |
Message data | Sequential write, sequential/random read |
.index |
Offset to position mapping | Sparse writes, memory-mapped reads |
.timeindex |
Timestamp to offset mapping | Sparse writes, memory-mapped reads |
Write Path¶
Kafka's write path is optimized for sequential I/O.
Key characteristics:
- Writes append to active segment only (sequential)
- Data written to OS page cache, not directly to disk
- Configurable fsync behavior (
log.flush.interval.messages,log.flush.interval.ms) - Default: rely on OS page cache and replication for durability
Read Path and Zero-Copy¶
Kafka uses zero-copy (sendfile) for consumer reads, bypassing user space entirely.
Zero-copy requirements:
- Data must be in page cache (or will be read from disk)
- Filesystem must support sendfile() - all modern Linux filesystems do
- TLS/SSL disables zero-copy (data must be encrypted in user space)
Page Cache Dependency¶
Kafka relies heavily on the OS page cache for performance.
Page cache sizing:
| Workload | Recommended Page Cache | Rationale |
|---|---|---|
| Real-time consumers | > active data size | Recent data served from cache |
| Catch-up consumers | As large as possible | Reduce disk reads for historical data |
| Mixed workloads | 50-80% of RAM | Balance between heap and cache |
Filesystem Comparison¶
XFS (Recommended)¶
XFS is the recommended filesystem for Kafka log directories.
Advantages for Kafka:
| Feature | Benefit for Kafka |
|---|---|
| Allocation groups | Parallel writes across multiple log directories |
| Extent-based allocation | Efficient for large sequential segment files |
| Delayed allocation | Better block placement for append workloads |
| Preallocation | Reduces fragmentation during segment growth |
XFS behavior with Kafka workloads:
ext4¶
ext4 is a reasonable alternative, particularly for smaller deployments.
Comparison with XFS:
| Aspect | ext4 | XFS |
|---|---|---|
| Sequential writes | Good | Excellent |
| Parallel I/O | Limited (single lock) | Excellent (per-AG) |
| Large files | Good (up to 16TB) | Excellent (up to 8EB) |
| Many directories | Good | Better (B+ tree dirs) |
| Extent coalescing | Good | Better |
When ext4 is acceptable:
- Single log directory (no parallelism benefit from XFS)
- Smaller deployments (< 1TB per broker)
- Familiarity and existing tooling
ZFS (Not Recommended)¶
ZFS presents similar challenges for Kafka as for Cassandra.
Issues with Kafka workloads:
| Challenge | Impact on Kafka |
|---|---|
| Copy-on-write | Write amplification for append workloads |
| ARC memory | Competes with page cache |
| Checksumming | CPU overhead (Kafka has CRC32 already) |
| CoW fragmentation | Degrades sequential read performance over time |
If ZFS is required:
# Create dataset with Kafka-optimized settings
zfs create -o recordsize=128K \
-o compression=off \
-o atime=off \
-o xattr=sa \
-o primarycache=metadata \
-o logbias=throughput \
tank/kafka
# Limit ARC to leave room for page cache
echo "options zfs zfs_arc_max=4294967296" >> /etc/modprobe.d/zfs.conf # 4GB
Filesystem Comparison Summary¶
| Feature | XFS | ext4 | ZFS |
|---|---|---|---|
| Sequential write | Excellent | Good | Poor (CoW) |
| Parallel I/O | Excellent | Limited | Good |
| Page cache friendly | Yes | Yes | Competes (ARC) |
| Zero-copy support | Yes | Yes | Yes |
| Write amplification | Low | Low | High (2-3x) |
| Kafka suitability | Excellent | Good | Poor |
Configuration¶
Mount Options¶
XFS (recommended):
# /etc/fstab
/dev/nvme0n1p1 /kafka/data xfs defaults,noatime,nodiratime 0 2
| Option | Purpose |
|---|---|
noatime |
Disable access time updates (significant for high-throughput) |
nodiratime |
Disable directory access time updates |
ext4:
# /etc/fstab
/dev/nvme0n1p1 /kafka/data ext4 defaults,noatime,nodiratime 0 2
Formatting Recommendations¶
XFS:
# Standard formatting (recommended - uses optimal defaults)
mkfs.xfs -f /dev/nvme0n1p1
# Verify filesystem parameters
xfs_info /dev/nvme0n1p1
XFS automatically calculates optimal allocation group count and log size based on device characteristics. Manual tuning is rarely necessary for modern NVMe/SSD storage.
ext4:
# Standard formatting
mkfs.ext4 /dev/nvme0n1p1
Multiple Log Directories¶
Kafka supports multiple log directories for parallelism and capacity.
# server.properties
log.dirs=/kafka/data1,/kafka/data2,/kafka/data3,/kafka/data4
Benefits:
- Distributes I/O across multiple disks
- XFS allocation groups provide additional parallelism per disk
- Partitions distributed round-robin across directories
Best practices:
- One filesystem per physical disk (no RAID 0 across directories)
- Use XFS for each mount point
- Equal-sized disks for balanced distribution
Page Cache Optimization¶
Kernel Parameters¶
# /etc/sysctl.conf
# Dirty page ratios
vm.dirty_ratio = 80 # Max dirty pages before blocking writes
vm.dirty_background_ratio = 5 # Start background writeback
# Alternative: absolute values for large memory systems
# vm.dirty_bytes = 2147483648 # 2GB max dirty
# vm.dirty_background_bytes = 536870912 # 512MB background threshold
# Swappiness
vm.swappiness = 1 # Minimize swapping (0 can cause OOM)
# Page cache pressure
vm.vfs_cache_pressure = 50 # Retain page cache over dentries/inodes
Parameter explanations:
| Parameter | Recommended | Rationale |
|---|---|---|
vm.dirty_ratio |
80 | Allow large dirty cache before blocking producers |
vm.dirty_background_ratio |
5 | Start flushing early to avoid write stalls |
vm.swappiness |
1 | Keep data in RAM, minimal swap |
vm.vfs_cache_pressure |
50 | Favor page cache retention |
I/O Scheduler¶
# For NVMe/SSD (recommended)
echo none > /sys/block/nvme0n1/queue/scheduler
# For SATA SSD
echo deadline > /sys/block/sda/queue/scheduler
# Persistent configuration (/etc/udev/rules.d/60-kafka.rules)
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="deadline"
Read-Ahead¶
# Check current setting
blockdev --getra /dev/nvme0n1
# Set read-ahead (in 512-byte sectors)
# 0 for NVMe (no benefit)
# 256-4096 for HDDs
blockdev --setra 0 /dev/nvme0n1 # NVMe
blockdev --setra 4096 /dev/sda # HDD (2MB read-ahead)
Recommendations¶
Decision Matrix¶
| Deployment | Recommended | Notes |
|---|---|---|
| Production (any scale) | XFS | Best sequential write performance |
| Small/Dev | ext4 or XFS | Either acceptable |
| Multiple disks | XFS per disk | Leverage allocation group parallelism |
| Existing ZFS infrastructure | ZFS with tuning | See ZFS section for required optimizations |
Configuration Summary¶
Optimal Kafka filesystem setup:
# Format each disk with XFS
mkfs.xfs -f /dev/nvme0n1
mkfs.xfs -f /dev/nvme1n1
# Mount with optimal options
mount -o noatime,nodiratime /dev/nvme0n1 /kafka/data1
mount -o noatime,nodiratime /dev/nvme1n1 /kafka/data2
# Configure Kafka
# server.properties
log.dirs=/kafka/data1,/kafka/data2
Kernel tuning:
# /etc/sysctl.conf
vm.dirty_ratio = 80
vm.dirty_background_ratio = 5
vm.swappiness = 1
vm.vfs_cache_pressure = 50
# Apply
sysctl -p
What to Avoid¶
| Configuration | Issue |
|---|---|
| ZFS for production | Write amplification, ARC competition |
| RAID 0 across log.dirs | Single disk failure loses all data |
| Small page cache | Increases disk I/O, disables effective zero-copy |
| High swappiness | JVM and page cache evicted to swap |
| atime enabled | Unnecessary write overhead |
Monitoring¶
Filesystem Metrics¶
# Disk I/O statistics
iostat -xz 1
# Key metrics:
# - %util: Device utilization (< 80% target)
# - await: Average I/O wait time
# - w/s: Writes per second
Page Cache Statistics¶
# Page cache usage
free -h
# Cached column shows page cache size
# Detailed memory info
cat /proc/meminfo | grep -E "Cached|Dirty|Writeback"
# Per-file cache status
vmtouch /kafka/data1/topic-*/*.log
Kafka Log Directory Health¶
# Check disk usage per log directory
du -sh /kafka/data*
# Check segment distribution
find /kafka/data* -name "*.log" | wc -l
# Verify XFS health
xfs_info /kafka/data1
xfs_repair -n /dev/nvme0n1 # Dry-run check
Summary¶
| Recommendation | Rationale |
|---|---|
| Use XFS for all log directories | Optimized for sequential writes, allocation groups enable parallelism |
| Avoid ZFS for production | CoW overhead, ARC competes with page cache |
| Configure multiple log.dirs on separate disks | Distributes I/O, increases throughput |
| Tune page cache parameters | Kafka depends on page cache for performance |
| Use noatime mount option | Eliminates unnecessary metadata writes |
| Leave majority of RAM for page cache | Enables zero-copy, reduces disk I/O |
Related Documentation¶
- Performance Overview - Complete performance tuning guide
- Capacity Planning - Sizing recommendations
- Broker Configuration - Broker settings
- Monitoring - Metrics and alerting