Schema Registry¶
Schema Registry provides centralized schema management for Apache Kafka, ensuring data compatibility between producers and consumers.
What is a Schema?¶
A schema is a formal definition of data structure—it specifies the fields, their types, and constraints that data must conform to.
| Aspect | Description |
|---|---|
| Fields | Named elements that make up the data (e.g., id, name, timestamp) |
| Types | Data type of each field (e.g., string, integer, boolean, array) |
| Constraints | Rules fields must follow (e.g., required, nullable, valid range) |
| Relationships | How nested structures and references work |
Schema vs Schemaless¶
| Approach | Characteristics |
|---|---|
| Schema-based | Structure defined upfront; validated at write time; compatible evolution enforced |
| Schemaless | Flexible structure; validated at read time (if at all); evolution is implicit |
Kafka itself is schemaless—brokers store bytes without understanding their structure. Schema Registry adds schema enforcement on top of Kafka.
Example: User Record Schema¶
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "long"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null},
{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
]
}
This Avro schema specifies:
idmust be a 64-bit integernamemust be a stringemailis optional (nullable with null default)created_atis a timestamp represented as milliseconds
Any data that does not conform to this structure is rejected at serialization time.
Why Schema Management Matters¶
Kafka topics are schema-agnostic—the broker stores bytes without understanding their structure. This flexibility becomes problematic when multiple applications produce and consume the same topics.
Problems Without Schema Management¶
| Problem | Impact |
|---|---|
| Format inconsistency | Producers use different field names, types, or structures |
| Silent breakage | Schema changes break consumers without warning |
| No contract enforcement | Documentation becomes stale; no runtime validation |
| Difficult evolution | Cannot safely add or remove fields |
| Debugging complexity | Hard to understand data format across systems |
Schema Registry Architecture¶
Schema Registry is a separate service that stores schemas in a Kafka topic (_schemas) and provides a REST API for schema operations.
Wire Format¶
Messages produced with Schema Registry include a schema ID prefix:
| Magic Byte (1) | Schema ID (4) | Payload (variable) |
| 0x00 | 00 00 00 01 | [serialized data] |
| Component | Size | Description |
|---|---|---|
| Magic byte | 1 byte | Always 0x00 (indicates Schema Registry format) |
| Schema ID | 4 bytes | Big-endian integer identifying the schema |
| Payload | Variable | Serialized data (Avro, Protobuf, or JSON) |
This format allows consumers to deserialize messages without prior knowledge of the schema—they retrieve the schema from the registry using the embedded ID.
Schema Formats¶
Schema Registry supports three serialization formats:
Apache Avro¶
Avro is the most widely used format with Kafka due to its compact binary encoding and rich schema evolution support.
{
"type": "record",
"name": "User",
"namespace": "com.example",
"fields": [
{"name": "id", "type": "long"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}
| Characteristic | Avro |
|---|---|
| Encoding | Binary (compact) |
| Schema required | For both serialization and deserialization |
| Evolution support | Excellent (defaults, unions) |
| Language support | Java, Python, C#, Go, others |
| Typical use case | Data pipelines, Kafka Connect |
Protocol Buffers (Protobuf)¶
Protobuf offers efficient binary encoding with strong typing and is popular in gRPC-based systems.
syntax = "proto3";
message User {
int64 id = 1;
string name = 2;
optional string email = 3;
}
| Characteristic | Protobuf |
|---|---|
| Encoding | Binary (compact) |
| Schema required | For both serialization and deserialization |
| Evolution support | Good (field numbers, optional) |
| Language support | Excellent (official support for many languages) |
| Typical use case | Microservices, systems already using gRPC |
JSON Schema¶
JSON Schema validates JSON documents and is useful when human readability is required.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"id": {"type": "integer"},
"name": {"type": "string"},
"email": {"type": "string"}
},
"required": ["id", "name"]
}
| Characteristic | JSON Schema |
|---|---|
| Encoding | JSON (human-readable) |
| Schema required | Only for validation |
| Evolution support | Limited |
| Language support | Universal (JSON everywhere) |
| Typical use case | APIs, debugging, human inspection |
Format Comparison¶
| Aspect | Avro | Protobuf | JSON Schema |
|---|---|---|---|
| Message size | Small | Small | Large |
| Serialization speed | Fast | Fast | Moderate |
| Human readable | No | No | Yes |
| Schema evolution | Excellent | Good | Limited |
| Kafka ecosystem support | Excellent | Good | Good |
Compatibility¶
Schema Registry enforces compatibility rules when schemas evolve. Compatibility ensures that schema changes do not break existing producers or consumers.
Compatibility Modes¶
| Mode | Rule | Upgrade Order |
|---|---|---|
| BACKWARD | New schema can read data written with old schema | Consumers first |
| BACKWARD_TRANSITIVE | New schema can read data from all previous versions | Consumers first |
| FORWARD | Old schema can read data written with new schema | Producers first |
| FORWARD_TRANSITIVE | All previous schemas can read new data | Producers first |
| FULL | Both backward and forward compatible | Any order |
| FULL_TRANSITIVE | Full compatibility with all versions | Any order |
| NONE | No compatibility checking | Not recommended |
Safe Schema Changes¶
| Change Type | BACKWARD | FORWARD | FULL |
|---|---|---|---|
| Add optional field with default | ✅ | ❌ | ❌ |
| Remove optional field with default | ❌ | ✅ | ❌ |
| Add required field | ❌ | ❌ | ❌ |
| Remove required field | ❌ | ❌ | ❌ |
| Add optional field (Avro union with null) | ✅ | ✅ | ✅ |
| Remove optional field (Avro union with null) | ✅ | ✅ | ✅ |
Subjects and Naming¶
Schemas are organized into subjects. The subject naming strategy determines how schemas are associated with topics.
Naming Strategies¶
| Strategy | Subject Name | Use Case |
|---|---|---|
| TopicNameStrategy | <topic>-key, <topic>-value |
One schema per topic (default) |
| RecordNameStrategy | <record-namespace>.<record-name> |
Multiple record types per topic |
| TopicRecordNameStrategy | <topic>-<record-namespace>.<record-name> |
Multiple types with topic isolation |
TopicNameStrategy (Default)¶
Topic: orders
Key subject: orders-key
Value subject: orders-value
Most deployments use TopicNameStrategy—each topic has one schema for keys and one for values.
RecordNameStrategy¶
Topic: events
Records: com.example.OrderCreated, com.example.OrderShipped
Subjects: com.example.OrderCreated, com.example.OrderShipped
RecordNameStrategy allows multiple record types in a single topic, useful for event sourcing where different event types share a topic.
Schema Evolution¶
Schema evolution allows schemas to change over time while maintaining compatibility with existing data.
Evolution Best Practices¶
| Practice | Rationale |
|---|---|
| Use optional fields | Allow safe addition and removal |
| Provide defaults | Enable backward compatibility |
| Never change field types | Type changes break compatibility |
| Never reuse field names | Previous data may have different semantics |
| Add to end of records | Maintains wire compatibility |
| Use aliases for renaming | Provides backward compatibility for renamed fields |
Configuration¶
Producer Configuration¶
# Schema Registry URL
schema.registry.url=http://schema-registry:8081
# Key serializer (for Avro keys)
key.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer
# Value serializer (for Avro values)
value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer
# Auto-register schemas (default: true)
auto.register.schemas=true
# Use latest schema version (default: false)
use.latest.version=false
Consumer Configuration¶
# Schema Registry URL
schema.registry.url=http://schema-registry:8081
# Key deserializer
key.deserializer=io.confluent.kafka.deserializers.KafkaAvroDeserializer
# Value deserializer
value.deserializer=io.confluent.kafka.deserializers.KafkaAvroDeserializer
# Return specific record type (vs GenericRecord)
specific.avro.reader=true
Schema Registry Server¶
Key server configurations:
| Property | Default | Description |
|---|---|---|
kafkastore.bootstrap.servers |
- | Kafka bootstrap servers for _schemas topic |
kafkastore.topic |
_schemas |
Topic for schema storage |
kafkastore.topic.replication.factor |
3 | Replication factor for schemas topic |
compatibility.level |
BACKWARD |
Default compatibility mode |
mode.mutability |
true |
Allow changing compatibility mode |
REST API¶
Schema Registry provides a REST API for schema operations.
Register a Schema¶
curl -X POST \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"schema": "{\"type\":\"record\",\"name\":\"User\",\"fields\":[{\"name\":\"id\",\"type\":\"long\"},{\"name\":\"name\",\"type\":\"string\"}]}"}' \
http://schema-registry:8081/subjects/users-value/versions
Response:
{"id": 1}
Get Latest Schema¶
curl http://schema-registry:8081/subjects/users-value/versions/latest
Check Compatibility¶
curl -X POST \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"schema": "{...}"}' \
http://schema-registry:8081/compatibility/subjects/users-value/versions/latest
Common Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/subjects |
GET | List all subjects |
/subjects/{subject}/versions |
GET | List versions for subject |
/subjects/{subject}/versions |
POST | Register new schema |
/subjects/{subject}/versions/{version} |
GET | Get specific version |
/schemas/ids/{id} |
GET | Get schema by global ID |
/config |
GET/PUT | Global compatibility config |
/config/{subject} |
GET/PUT | Subject-level compatibility |
High Availability¶
For production deployments, Schema Registry should be deployed with high availability.
Deployment Recommendations¶
| Aspect | Recommendation |
|---|---|
| Instance count | Minimum 2, typically 3 for high availability |
| Leader election | Uses Kafka consumer group protocol |
| Read scaling | Add replicas for read throughput |
| Write scaling | Single primary (not horizontally scalable for writes) |
| Schemas topic | Replication factor ≥ 3, min.insync.replicas = 2 |
Related Documentation¶
- Why Schemas - Detailed motivation for schema management
- Schema Formats - Avro, Protobuf, JSON Schema guides
- Compatibility - Compatibility rules and evolution
- Schema Evolution - Safe schema evolution practices
- Operations - Operating Schema Registry in production