Skip to content

Schema Registry

Schema Registry provides centralized schema management for Apache Kafka, ensuring data compatibility between producers and consumers.


What is a Schema?

A schema is a formal definition of data structure—it specifies the fields, their types, and constraints that data must conform to.

Aspect Description
Fields Named elements that make up the data (e.g., id, name, timestamp)
Types Data type of each field (e.g., string, integer, boolean, array)
Constraints Rules fields must follow (e.g., required, nullable, valid range)
Relationships How nested structures and references work

Schema vs Schemaless

Approach Characteristics
Schema-based Structure defined upfront; validated at write time; compatible evolution enforced
Schemaless Flexible structure; validated at read time (if at all); evolution is implicit

Kafka itself is schemaless—brokers store bytes without understanding their structure. Schema Registry adds schema enforcement on top of Kafka.

Example: User Record Schema

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null},
    {"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
  ]
}

This Avro schema specifies:

  • id must be a 64-bit integer
  • name must be a string
  • email is optional (nullable with null default)
  • created_at is a timestamp represented as milliseconds

Any data that does not conform to this structure is rejected at serialization time.


Why Schema Management Matters

Kafka topics are schema-agnostic—the broker stores bytes without understanding their structure. This flexibility becomes problematic when multiple applications produce and consume the same topics.

uml diagram

Problems Without Schema Management

Problem Impact
Format inconsistency Producers use different field names, types, or structures
Silent breakage Schema changes break consumers without warning
No contract enforcement Documentation becomes stale; no runtime validation
Difficult evolution Cannot safely add or remove fields
Debugging complexity Hard to understand data format across systems

Schema Registry Architecture

Schema Registry is a separate service that stores schemas in a Kafka topic (_schemas) and provides a REST API for schema operations.

uml diagram

Wire Format

Messages produced with Schema Registry include a schema ID prefix:

| Magic Byte (1) | Schema ID (4) | Payload (variable) |
|     0x00       |   00 00 00 01 |   [serialized data] |
Component Size Description
Magic byte 1 byte Always 0x00 (indicates Schema Registry format)
Schema ID 4 bytes Big-endian integer identifying the schema
Payload Variable Serialized data (Avro, Protobuf, or JSON)

This format allows consumers to deserialize messages without prior knowledge of the schema—they retrieve the schema from the registry using the embedded ID.


Schema Formats

Schema Registry supports three serialization formats:

Apache Avro

Avro is the most widely used format with Kafka due to its compact binary encoding and rich schema evolution support.

{
  "type": "record",
  "name": "User",
  "namespace": "com.example",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}
Characteristic Avro
Encoding Binary (compact)
Schema required For both serialization and deserialization
Evolution support Excellent (defaults, unions)
Language support Java, Python, C#, Go, others
Typical use case Data pipelines, Kafka Connect

Protocol Buffers (Protobuf)

Protobuf offers efficient binary encoding with strong typing and is popular in gRPC-based systems.

syntax = "proto3";

message User {
  int64 id = 1;
  string name = 2;
  optional string email = 3;
}
Characteristic Protobuf
Encoding Binary (compact)
Schema required For both serialization and deserialization
Evolution support Good (field numbers, optional)
Language support Excellent (official support for many languages)
Typical use case Microservices, systems already using gRPC

JSON Schema

JSON Schema validates JSON documents and is useful when human readability is required.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "id": {"type": "integer"},
    "name": {"type": "string"},
    "email": {"type": "string"}
  },
  "required": ["id", "name"]
}
Characteristic JSON Schema
Encoding JSON (human-readable)
Schema required Only for validation
Evolution support Limited
Language support Universal (JSON everywhere)
Typical use case APIs, debugging, human inspection

Format Comparison

Aspect Avro Protobuf JSON Schema
Message size Small Small Large
Serialization speed Fast Fast Moderate
Human readable No No Yes
Schema evolution Excellent Good Limited
Kafka ecosystem support Excellent Good Good

Schema Formats Guide


Compatibility

Schema Registry enforces compatibility rules when schemas evolve. Compatibility ensures that schema changes do not break existing producers or consumers.

uml diagram

Compatibility Modes

Mode Rule Upgrade Order
BACKWARD New schema can read data written with old schema Consumers first
BACKWARD_TRANSITIVE New schema can read data from all previous versions Consumers first
FORWARD Old schema can read data written with new schema Producers first
FORWARD_TRANSITIVE All previous schemas can read new data Producers first
FULL Both backward and forward compatible Any order
FULL_TRANSITIVE Full compatibility with all versions Any order
NONE No compatibility checking Not recommended

Safe Schema Changes

Change Type BACKWARD FORWARD FULL
Add optional field with default
Remove optional field with default
Add required field
Remove required field
Add optional field (Avro union with null)
Remove optional field (Avro union with null)

Compatibility Guide


Subjects and Naming

Schemas are organized into subjects. The subject naming strategy determines how schemas are associated with topics.

Naming Strategies

Strategy Subject Name Use Case
TopicNameStrategy <topic>-key, <topic>-value One schema per topic (default)
RecordNameStrategy <record-namespace>.<record-name> Multiple record types per topic
TopicRecordNameStrategy <topic>-<record-namespace>.<record-name> Multiple types with topic isolation

TopicNameStrategy (Default)

Topic: orders
Key subject: orders-key
Value subject: orders-value

Most deployments use TopicNameStrategy—each topic has one schema for keys and one for values.

RecordNameStrategy

Topic: events
Records: com.example.OrderCreated, com.example.OrderShipped
Subjects: com.example.OrderCreated, com.example.OrderShipped

RecordNameStrategy allows multiple record types in a single topic, useful for event sourcing where different event types share a topic.


Schema Evolution

Schema evolution allows schemas to change over time while maintaining compatibility with existing data.

uml diagram

Evolution Best Practices

Practice Rationale
Use optional fields Allow safe addition and removal
Provide defaults Enable backward compatibility
Never change field types Type changes break compatibility
Never reuse field names Previous data may have different semantics
Add to end of records Maintains wire compatibility
Use aliases for renaming Provides backward compatibility for renamed fields

Schema Evolution Guide


Configuration

Producer Configuration

# Schema Registry URL
schema.registry.url=http://schema-registry:8081

# Key serializer (for Avro keys)
key.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer

# Value serializer (for Avro values)
value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer

# Auto-register schemas (default: true)
auto.register.schemas=true

# Use latest schema version (default: false)
use.latest.version=false

Consumer Configuration

# Schema Registry URL
schema.registry.url=http://schema-registry:8081

# Key deserializer
key.deserializer=io.confluent.kafka.deserializers.KafkaAvroDeserializer

# Value deserializer
value.deserializer=io.confluent.kafka.deserializers.KafkaAvroDeserializer

# Return specific record type (vs GenericRecord)
specific.avro.reader=true

Schema Registry Server

Key server configurations:

Property Default Description
kafkastore.bootstrap.servers - Kafka bootstrap servers for _schemas topic
kafkastore.topic _schemas Topic for schema storage
kafkastore.topic.replication.factor 3 Replication factor for schemas topic
compatibility.level BACKWARD Default compatibility mode
mode.mutability true Allow changing compatibility mode

REST API

Schema Registry provides a REST API for schema operations.

Register a Schema

curl -X POST \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{\"type\":\"record\",\"name\":\"User\",\"fields\":[{\"name\":\"id\",\"type\":\"long\"},{\"name\":\"name\",\"type\":\"string\"}]}"}' \
  http://schema-registry:8081/subjects/users-value/versions

Response:

{"id": 1}

Get Latest Schema

curl http://schema-registry:8081/subjects/users-value/versions/latest

Check Compatibility

curl -X POST \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{...}"}' \
  http://schema-registry:8081/compatibility/subjects/users-value/versions/latest

Common Endpoints

Endpoint Method Description
/subjects GET List all subjects
/subjects/{subject}/versions GET List versions for subject
/subjects/{subject}/versions POST Register new schema
/subjects/{subject}/versions/{version} GET Get specific version
/schemas/ids/{id} GET Get schema by global ID
/config GET/PUT Global compatibility config
/config/{subject} GET/PUT Subject-level compatibility

High Availability

For production deployments, Schema Registry should be deployed with high availability.

uml diagram

Deployment Recommendations

Aspect Recommendation
Instance count Minimum 2, typically 3 for high availability
Leader election Uses Kafka consumer group protocol
Read scaling Add replicas for read throughput
Write scaling Single primary (not horizontally scalable for writes)
Schemas topic Replication factor ≥ 3, min.insync.replicas = 2