Dead Letter Queues¶

A Dead Letter Queue (DLQ) is a separate topic that stores messages which cannot be processed successfully. Rather than blocking the consumer, losing the message, or retrying indefinitely, failed messages are moved to the DLQ for later analysis and reprocessing.

Kafka does not provide built-in DLQ functionality—it is an application-level pattern that must be implemented by producers or consumers. This follows Kafka's "dumb broker, smart consumer" architecture where error handling responsibility lies with client applications rather than the broker.

Kafka DLQ vs Traditional Message Brokers¶

Traditional message brokers (JMS, RabbitMQ, IBM MQ) provide built-in DLQ functionality, typically routing messages based on:

TTL expiration - message exceeded time-to-live
Delivery failures - maximum delivery attempts exceeded
Queue capacity - destination queue full

Kafka DLQs serve a different purpose. Since Kafka retains messages regardless of consumption and consumers control their own offsets, DLQs in Kafka primarily address:

Invalid message format - deserialization failures, schema mismatches
Bad message content - validation errors, missing required fields
Processing failures - business logic exceptions, dependency errors

This distinction is important: Kafka DLQs are about message quality, not delivery mechanics.

The Problem: Poison Messages¶

A poison message is a message that causes consumer failure repeatedly. Without proper handling, a single poison message can halt an entire consumer group.

Common Causes of Poison Messages¶

Cause	Description
Schema mismatch	Producer schema incompatible with consumer's deserializer
Corrupt data	Malformed JSON, invalid Avro, truncated payload
Business validation	Data fails domain validation (negative price, invalid date)
Missing dependencies	Referenced entity doesn't exist in database
Transient failures	Database timeout, network error (may succeed on retry)
Code bugs	Consumer code throws exception for certain data patterns

DLQ Pattern¶

The DLQ pattern isolates failed messages so healthy messages continue processing.

DLQ Benefits¶

Benefit	Description
Fault isolation	One bad message doesn't block partition processing
No data loss	Failed messages preserved for analysis and reprocessing
Visibility	DLQ depth indicates system health issues
Debugging	Failed messages available for root cause analysis
Controlled retry	Messages can be reprocessed after fixes deployed

DLQ Architecture¶

Basic DLQ Topology¶

Multi-Stage DLQ (Retry Tiers)¶

For transient failures, implement multiple retry stages before final DLQ:

Retry Timing Strategies¶

Strategy	Implementation	Use Case
Immediate retry	Consumer retries N times in-memory	Transient network glitches
Delayed retry	Separate retry topics with consumer pause	Rate limiting, backpressure
Exponential backoff	Increasing delays (1s, 5s, 30s, 5m)	External service recovery
Scheduled retry	Time-windowed reprocessing	Batch reconciliation

DLQ Topic Strategy¶

Organizations must decide between dedicated DLQs per topic or a unified DLQ.

Strategy	Approach	Trade-offs
Per-topic DLQ	`orders.dlq`, `payments.dlq`, `users.dlq`	Targeted analysis, clear ownership; more topics to manage
Unified DLQ	Single `application.dlq` for all topics	Simpler operations, single dashboard; harder root cause analysis
Per-service DLQ	`order-service.dlq` handles multiple input topics	Balanced approach; requires header-based routing

Most production deployments use per-topic DLQs for clear ownership and targeted alerting.

DLQ Storage Options¶

DLQ messages can remain in Kafka or be moved to external storage for long-term retention and analysis.

Storage	Best For	Considerations
Kafka topic	Standard use cases, automated reprocessing	Set appropriate retention; monitor disk usage
S3/GCS archive	Compliance, long-term retention	Batch reprocessing; requires ETL tooling
Database (PostgreSQL)	Manual review workflows, complex remediation	Enables UI/CLI tooling; additional infrastructure

DLQ Message Structure¶

A DLQ message should contain the original message plus metadata for debugging and reprocessing.

Standard DLQ Headers¶

Header	Purpose
`dlq.original.topic`	Source topic name
`dlq.original.partition`	Source partition
`dlq.original.offset`	Source offset (for replay tracking)
`dlq.original.timestamp`	Original message timestamp
`dlq.original.key`	Original message key (if key changed)
`dlq.error.message`	Exception message
`dlq.error.class`	Exception class name
`dlq.error.stacktrace`	Stack trace (optional, can be large)
`dlq.retry.count`	Number of processing attempts
`dlq.failed.timestamp`	When message was sent to DLQ
`dlq.consumer.group`	Consumer group that failed
`dlq.consumer.instance`	Specific consumer instance

Error Classification¶

Not all errors should go to the DLQ. Classify errors to determine the appropriate handling strategy.

Error Categories¶

Category	Examples	Strategy
Transient	Connection timeout, rate limit, lock contention	Retry with backoff
Recoverable	Schema mismatch, validation failure, missing reference	DLQ → fix → reprocess
Corrupt	Invalid encoding, truncated message, wrong topic	DLQ → investigate → discard
Poison	Causes crash/OOM, infinite loop trigger	DLQ → immediate alert

DLQ Topic Configuration¶

DLQ topics should be configured for durability and long retention since messages may need reprocessing weeks later.

Recommended Settings¶

# DLQ topic configuration
cleanup.policy=delete
retention.ms=2592000000        # 30 days (longer than main topics)
retention.bytes=-1             # No size limit
min.insync.replicas=2          # Durability
replication.factor=3           # Durability
compression.type=producer      # Preserve original compression

Partitioning Strategy¶

Strategy	Approach	Trade-off
Single partition	All DLQ messages in one partition	Simple, ordered review; limited throughput
Match source	Same partition count as source topic	Preserves key locality; complex reprocessing
By error type	Partition by error category	Easy triage; custom partitioner needed
Round-robin	Default partitioning	Balanced load; no ordering

DLQ vs Alternatives¶

Comparison with Other Patterns¶

Approach	Blocking	Data Loss	Complexity	Best For
Retry in-place	Yes	No	Low	Transient errors only
Skip and log	No	Yes	Low	Non-critical data
DLQ	No	No	Medium	Production systems
Parking lot	No	No	High	Compliance, finance

Preventing DLQ Messages¶

The best DLQ strategy is minimizing messages that reach it. Schema Registry provides producer-side validation that catches bad messages before they enter Kafka.

Prevention Strategies¶

Strategy	Implementation	Effectiveness
Schema Registry	Enforce Avro/Protobuf/JSON Schema at producer	Catches format errors before Kafka
Producer validation	Validate business rules before send	Catches domain errors at source
Contract testing	Verify producer/consumer compatibility in CI	Catches schema drift before deployment
Input sanitization	Clean/normalize data at ingestion boundary	Reduces malformed data

Prevention reduces DLQ volume but cannot eliminate it entirely—runtime failures, dependency issues, and edge cases still require DLQ handling.

When to Use DLQs¶

Good Candidates¶

Order processing where every message must be accounted for
Financial transactions requiring audit trails
Event sourcing where event loss corrupts state
Multi-tenant systems where one tenant's bad data shouldn't affect others
Integration pipelines where upstream data quality varies

Poor Candidates¶

Metrics/telemetry where occasional loss is acceptable
Cache invalidation events (stale cache self-corrects)
Heartbeats/health checks
High-volume logs where DLQ would be overwhelmed

DLQ Anti-Patterns¶

Anti-Pattern: DLQ as Primary Error Handling¶

Common Anti-Patterns¶

Anti-Pattern	Problem	Solution
No retry before DLQ	Transient errors flood DLQ	Implement retry with backoff
DLQ without monitoring	Silent failures accumulate	Alert on DLQ depth
No reprocessing plan	DLQ becomes data graveyard	Build reprocessing tooling
Infinite retry	Poison messages never reach DLQ	Set max retry limit
Losing error context	Can't debug failures	Include error metadata in headers
Same retention as source	DLQ expires before review	Longer DLQ retention
DLQ for backpressure	Using DLQ to handle load spikes	Scale consumers or use quotas
Connection errors to DLQ	Network timeouts sent to DLQ	Retry in application; fix connectivity
No DLQ ownership	Nobody reviews DLQ messages	Assign data owners, not just infrastructure
Ignoring DLQ entirely	Messages accumulate indefinitely	Process or archive with defined SLA

DLQ Ownership¶

Effective DLQ management requires clear ownership and defined processes.

Responsibility Model¶

Role	Responsibility
Data owner	Review failed messages, determine if data fix needed
Development team	Fix code bugs causing failures, deploy fixes
Operations	Monitor DLQ depth, trigger alerts, manage retention
Platform team	Provide reprocessing tooling, maintain DLQ infrastructure

Process Considerations¶

SLA for review - define maximum time messages can remain in DLQ unreviewed
Escalation path - who gets notified when DLQ depth exceeds thresholds
Reprocessing authority - who can trigger replay of DLQ messages
Discard policy - criteria for permanently discarding unrecoverable messages

Error Handling Implementation - Code patterns and implementation
DLQ Operations - Monitoring, alerting, and reprocessing
Schema Registry - Producer-side validation to prevent bad messages
Delivery Semantics - At-least-once and exactly-once processing
Consumer Error Handling - Consumer-side error strategies