Kafka Connect¶

Kafka Connect is the integration layer of the Kafka ecosystem, providing a standardized framework for moving data between Kafka and external systems.

Configuration-Driven Integration¶

Kafka Connect enables enterprise data integration without writing code. Instead of developing custom producers and consumers for each system, organizations deploy integrations through JSON configuration.

From Code to Configuration¶

Traditional approach (custom code for each integration):

// Hundreds of lines per integration
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(props);
S3Client s3 = S3Client.builder().region(Region.US_EAST_1).build();

while (true) {
    ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
    // Batching logic
    // Parquet conversion
    // S3 multipart upload
    // Offset management
    // Error handling
    // Retry logic
    // Monitoring
}

Kafka Connect approach (configuration only):

{
  "name": "s3-sink",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "topics": "events",
    "s3.bucket.name": "data-lake",
    "s3.region": "us-east-1",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "flush.size": "10000"
  }
}

Deploy via REST API:

curl -X POST -H "Content-Type: application/json" \
  --data @s3-sink.json \
  http://connect:8083/connectors

Why This Matters for Enterprises¶

Aspect	Custom Code	Kafka Connect
Time to deploy	Weeks to months	Hours to days
Skills required	Kafka expertise + target system expertise	Configuration knowledge
Maintenance burden	Full ownership of code	Connector upgrades only
Consistency	Varies by developer	Standardized across all integrations
Risk	Untested code in production	Battle-tested connectors
Scaling	Code changes may be needed	Configuration change (`tasks.max`)

For organizations with dozens of integration requirements, the configuration-driven approach transforms data integration from a development problem into an operations problem—dramatically reducing time-to-value and ongoing maintenance costs.

Modify Integrations Without Deployments¶

Connector behavior can be modified at runtime via REST API:

# Update configuration (no restart required for some changes)
curl -X PUT -H "Content-Type: application/json" \
  --data '{"connector.class": "...", "flush.size": "50000"}' \
  http://connect:8083/connectors/s3-sink/config

# Pause connector
curl -X PUT http://connect:8083/connectors/s3-sink/pause

# Resume connector
curl -X PUT http://connect:8083/connectors/s3-sink/resume

# Check status
curl http://connect:8083/connectors/s3-sink/status

No code deployments. No container rebuilds. No CI/CD pipelines for configuration changes.

The Integration Challenge¶

Modern enterprises operate dozens to hundreds of data systems: relational databases, NoSQL stores, search engines, cloud storage, data warehouses, SaaS applications, and legacy systems. Without a standardized integration approach, each connection requires custom code.

Custom Integration Problems¶

Problem	Impact
Duplicated effort	Every integration reimplements common patterns
Inconsistent quality	Each integration has different error handling, monitoring
Operational burden	Each integration is a separate system to deploy and monitor
Schema coupling	Producers and consumers must coordinate on formats
Scalability challenges	Each integration must solve parallelism independently

Kafka Connect Solution¶

Kafka Connect provides a standardized framework where connectors handle system-specific logic and the framework handles operational concerns.

What Kafka Connect Provides¶

Capability	Description
200+ pre-built connectors	Database CDC, cloud storage, search engines, messaging systems
Standardized operations	Same deployment, monitoring, and management for all connectors
Distributed mode	Automatic task distribution and fault tolerance
Offset tracking	Framework manages position in source/sink systems
Schema integration	Automatic serialization via Schema Registry
Single Message Transforms	Lightweight transformations without custom code
Dead letter queues	Standardized error handling
Exactly-once delivery	Supported for compatible connectors (Kafka 3.3+)

Connector Ecosystem¶

The Kafka Connect ecosystem includes connectors for virtually every common data system.

Connector Categories¶

Category	Connectors	Primary Use Case
Event Sources	HTTP/REST, MQTT, File/Syslog, JMS/MQ	Streaming events into Kafka
Cloud Storage Sinks	S3, GCS, Azure Blob, HDFS	Data lake ingestion
Database Sinks	Cassandra, JDBC, Elasticsearch, OpenSearch	Persistent storage and search
Data Warehouse Sinks	Snowflake, BigQuery, Redshift, Databricks	Analytics pipelines

Why Kafka Connect Matters¶

Kafka Connect is often the most valuable component of a Kafka deployment. Several factors explain this:

1. Connector Quality¶

Production connectors implement complex logic that is difficult to replicate:

Connector	Complexity Handled
HTTP Source	Pagination, rate limiting, authentication, retry logic
S3 Sink	Partitioning, rotation, exactly-once with idempotent writes
Cassandra Sink	Batching, retry policies, consistency levels, schema mapping
MQTT Source	QoS handling, reconnection, message ordering, topic mapping

2. Operational Standardization¶

Aspect	Without Connect	With Connect
Deployment	Different for each integration	Same for all connectors
Monitoring	Custom metrics per integration	Standard JMX metrics
Scaling	Custom logic	Add workers, increase tasks
Failure handling	Custom implementation	Dead letter queues, retry policies
Configuration	Code changes	REST API, JSON configuration

3. Schema Registry Integration¶

Connectors integrate with Schema Registry automatically:

Source vs Sink Connectors¶

Source Connectors¶

Source connectors read from external systems and write to Kafka topics.

Source Type	How It Works	Examples
File-based	Watches for new files or streams	File Source, Syslog
API-based	Polls REST/GraphQL APIs	HTTP Source
Message-based	Bridges messaging systems	JMS, MQTT, SQS
Stream-based	Connects streaming platforms	Kinesis, Event Hubs

Sink Connectors¶

Sink connectors read from Kafka topics and write to external systems.

Sink Type	How It Works	Examples
Storage	Writes files in batches	S3, GCS, HDFS
Database	Upserts or inserts records	Cassandra, JDBC
Search	Indexes documents	Elasticsearch, OpenSearch

Deployment Modes¶

Standalone Mode¶

Single worker process—suitable for development and simple use cases.

Distributed Mode¶

Multiple worker processes forming a cluster—required for production.

Aspect	Standalone	Distributed
Fault tolerance	None	Automatic task redistribution
Scaling	Vertical only	Horizontal (add workers)
Offset storage	Local file	Kafka topics
Use case	Development, simple integrations	Production

Tasks and Parallelism¶

Connectors are divided into tasks for parallelism. Each task processes a subset of the data.

Task Distribution¶

Connector Type	Task Parallelism
File Source	One task per file or directory
HTTP Source	One task per endpoint (typically)
S3 Sink	Tasks share topic partitions
Cassandra Sink	Tasks share topic partitions

The tasks.max configuration controls maximum parallelism. Actual task count depends on the connector's ability to parallelize the work.

Connect vs Custom Code¶

When to Use Connect¶

Scenario	Recommendation
Cloud storage ingestion	Use S3/GCS connector—handles partitioning, rotation
Persistent storage to Cassandra	Use Cassandra Sink—handles batching, consistency levels
Search indexing	Use Elasticsearch connector—handles bulk API, backpressure
Legacy messaging bridge	Use JMS/MQ connector—handles protocol translation

When Custom Code May Be Better¶

Scenario	Why
Complex business logic in transformation	Kafka Streams provides full programming model
Non-standard protocols or APIs	May require custom producer/consumer
Real-time with sub-millisecond latency	Direct producer may have less overhead
Highly specialized integration	No suitable connector exists

The general guidance: prefer Connect when a suitable connector exists. Connectors encode years of production experience and edge case handling.

Connector Ecosystem - Comprehensive connector catalog
Cloud Storage - S3, GCS, data lake patterns
Kafka Connect Guide - Technical reference
Build vs Buy - Decision framework