Whenever I start planning a new message-driven architecture that spans several software teams, I quickly realize how many moving parts there are.

To keep my head clear, I’ve put together a personal checklist of things I always try to cover.

It’s not meant to be a formal rulebook, but more of a practical reminder of what usually makes the difference between smooth collaboration and endless headaches.

message driven architecture conver
Figure 1. Message Driven Architecture Article Cover Image

1. Establish Shared Foundations

1 establish shared foundations

Before diving into design or tooling, it’s crucial to get everyone aligned on the basics.

Different teams often use the same words to mean slightly different things — and that’s a recipe for confusion when you start exchanging messages at scale.

  • Define a common vocabulary for message, event, command etc…​

    Make sure everyone agrees on what these terms mean. For example, is an "event" something that already happened (immutable fact), or can it also represent an intention? These nuances matter a lot once teams start consuming each other’s data.

    Common Definitions
    Message

    A generic unit of data sent from one system to another. Doesn’t imply intent (like a command) or temporal context (like an event).

    Event

    A fact about something that already happened, immutable and time-oriented. Past tense: OrderPlaced, UserRegistered. Cannot be changed - it’s record of history. Usually has one producer, many consumers (pub/sub).

    Action

    A request to perform an action in the future. Imperative: PlaceOrder, DeactivateUser. Has a clear target / intended receiver. The recipient may accept or reject the command.

    Producer / Publisher

    The system or service that creates and sends a message into a channel, topic, or queue. Example: An Order Service publishing an OrderPlaced event.

    Consumer / Subscriber

    The system or service that receives and processes a message. Example: A Billing Service subscribing to OrderPlaced to create an invoice.

    Topic / Subject / Channel

    A named logical destination for messages, often in pub/sub systems. Producers send messages to a topic; consumers subscribe to it.

    Queue

    A messaging destination where messages are delivered to one consumer (work distribution). Often ensures FIFO order and durability.

    Event Stream

    A continuous, append-only sequence of events (like Kafka topics). Consumers can replay from any point in time.

    Schema

    The structured definition of a message payload (e.g., Avro, Protobuf, JSON Schema). Defines field names, types, and constraints.

    Schema Registry

    A centralized repository for managing message schemas and ensuring compatibility across versions.

    Dead Letter Queue (DLQ)

    A special queue for messages that could not be processed successfully, even after retries. Prevents data loss while isolating problematic messages.

    Idempotency

    The property that handling the same message multiple times has the same effect as handling it once. Critical for at-least-once delivery.

    Correlation ID

    A unique identifier carried in each message to trace a request across systems and tie related messages together.

  • Decide on a canonical message model[1] (shared schema format vs. bounded contexts with translation).

    Some organizations enforce a single shared schema format (e.g., Avro[2], Protobuf[3], JSON Schema) to ease tooling and validation. Others allow each bounded context to have its own format and use translation/adapters at the boundaries. Whichever path you choose, make it explicit and consistent.

  • Assign contract ownership for each message type (usually producer team).

    Messages are APIs. Someone must own and maintain them. Typically, the team that produces the message owns its schema and meaning. This avoids the "no man’s land" problem where a schema evolves without accountability.

2. Model Domains Before Topics

2 model domains before topics

It’s tempting to jump straight into designing message topics and queues — but that often leads to a system shaped by technical convenience rather than real business needs.

Instead, start with the domain: understand the boundaries of responsibility, the language each team uses, and the natural events that occur in the business.

  • Apply Domain-Driven Design (DDD) to identify bounded contexts.

    Each bounded context defines a clear area of responsibility with its own models and language. Messaging across these boundaries should feel natural, not forced. For example, the "Orders" context publishes events about orders, but it doesn’t speak in terms of "Invoices" or "Shipments" — those belong to other contexts.

  • Create a context map with team ownership, message flows, and business meaning.

    A visual map helps everyone see where responsibilities lie and how information flows. This avoids accidental overlap between teams and clarifies where translation between models is required.

    For more details about first steps in creating context maps (and using the C4 model) is available here: Domain-driven Design: A Practitioner’s Guide - Context Map
  • Align message topics with business events, not just technical needs.

    Don’t design topics around CRUD operations or database tables. Instead, focus on meaningful business events (e.g., OrderPlaced, PaymentFailed). These stand the test of time and are easier for both humans and systems to reason about.

3. Define Message Contracts as APIs

3 define message contracts as apis

Once teams start exchanging messages, those messages effectively become APIs.

If they change unexpectedly, they can break consumers in subtle and costly ways.

Treating message contracts with the same care as service APIs helps keep the system stable and predictable.

  • Treat each message schema like a public API.

    Changes to a schema should go through the same rigor as changes to a service API: reviews, documentation, and clear communication. Think of downstream teams as your "API consumers."

  • Implement versioning rules (additive changes for backward compatibility).

    Breaking changes (like removing fields or changing their meaning) can wreak havoc. Adopt a clear strategy: only allow additive changes for backward compatibility, and if a breaking change is unavoidable, introduce a new version instead of altering the old one.

  • Use schema registry for storing & validating definitions.

    A central registry ensures that all producers and consumers rely on the same schema definitions. It also enables automated compatibility checks during CI/CD pipelines, preventing surprises in production[4].

  • Document semantic meaning for fields.

    It’s not enough to know a field is an integer or string — teams need to understand what it represents. Is amount a gross or net value? Is status an enum with well-defined states? Ambiguity in semantics is one of the fastest ways to create miscommunication between systems.

  • Be Precise About Field Semantics

    A message schema isn’t just a collection of fields — every field needs a precise definition to avoid misinterpretation across teams. Ambiguity is one of the most common sources of integration bugs in message-driven architectures.

    Common Pitfalls
    • Date and Time Fields

      • Define whether ranges are open (start ⇐ t < end) or closed (start ⇐ t ⇐ end).

      • Always specify the time zone (UTC is strongly recommended).

      • Clarify if fields represent an instant in time (e.g., 2023-11-26T10:00Z) or a business date (e.g., 2023-11-26).

    • Null or Missing Values

      • Does null mean "no value known"?

      • Does it mean "inherit from parent object"?

      • Or does it mean "the result is empty / not applicable"? Document this explicitly for each nullable field.

    • Enumerations and Codes

      • Define allowed values and what each one means.

      • Avoid overloading "magic values" like 0 or empty strings for special cases.

    • Identifiers

      • Clarify uniqueness and scope: Is an orderId unique globally, per tenant, or per system?

      • Define lifecycle: Can IDs be reused? Do they persist forever?

    • Optional vs. Required

      • Mark fields explicitly as mandatory or optional.

      • For optional ones, specify default behavior when omitted.

  • Consider using AsyncAPI

    Just like OpenAPI has become the standard for describing REST APIs, AsyncAPI is emerging as the standard for event-driven and message-driven systems.

    It allows you to:

    • Define message channels, topics, and queues in a machine-readable way.

    • Document message payloads and schemas (Avro, JSON Schema, Protobuf, etc.).

    • Capture metadata like delivery guarantees, correlation IDs, and bindings to specific brokers (Kafka, RabbitMQ, MQTT, etc.).

    • Generate documentation, code stubs, and tests automatically from the spec.

More Information can be found on the AsyncAPI Website

AsyncAPI Example

The following AsyncAPI specification allows teams to auto-generate documentation, mock servers or client libraries.

Click to expand asyncapi-example.yaml
asyncapi-example.yaml
asyncapi: '2.6.0'
info:
  title: hasCode.com Order Service Events
  version: '1.0.0'
  description: |
    Events emitted by the Order Service.
servers:
  production:
    url: kafka.hascode.com:9092
    protocol: kafka
channels:
  order.created:
    description: Event published when a new order is placed.
    publish:
      message:
        name: OrderCreated
        title: Order Created Event
        summary: Notifies subscribers that a new order was placed.
        contentType: application/json
        payload:
          type: object
          required:
            - orderId
            - customerId
            - createdAt
          properties:
            orderId:
              type: string
              description: Unique identifier of the order.
            customerId:
              type: string
              description: Customer who placed the order.
            createdAt:
              type: string
              format: date-time
              description: UTC timestamp when the order was created.
            totalAmount:
              type: number
              format: float
              description: Total order amount in EUR.

4. Align on Delivery & Reliability Guarantees

4 align on delivery reliability guarantees

Different messages have different criticality, and mismatched expectations can cause serious failures.

One team might assume that every message is delivered exactly once, while another designs for at-least-once with idempotency.

These gaps are dangerous — align early.

  • For each message type, define delivery semantics (at-most-once, at-least-once, exactly-once).

    Choose the right guarantee for the business need. Not every use case requires exactly-once, but where it does, teams need to design carefully.

    Delivery Semantics
    At-Most-Once

    A message is delivered zero or one time. If delivery fails, the message is lost and will not be retried.

    At-Least-Once

    A message is delivered one or more times. Duplicates are possible, so consumers must be idempotent (handle the same message multiple times without side effects).

    Exactly-Once

    A message is delivered once and only once. In practice, this is the hardest to achieve at scale, since it requires careful coordination between producers, brokers, and consumers. Many platforms simulate this via idempotency + deduplication strategies.

    "Exactly-once" often comes at the cost of complexity and performance. Many real-world systems accept "at-least-once" delivery with idempotent processing as the pragmatic default.
  • Clarify ordering requirements (per-key, global, none).

    If consumers depend on message order, make sure the producers and infrastructure support it. Otherwise, design messages to be order-independent.

    Ordering Requirements

    In distributed systems, message ordering is not guaranteed by default. If consumers implicitly rely on order, but producers or infrastructure don’t provide it, you’ll get subtle and hard-to-debug issues. That’s why it’s essential to clarify ordering expectations explicitly.

    • Global Ordering All messages are delivered in the exact order they were produced, system-wide. Pros: Simplicity for consumers, deterministic behavior. Cons: Low throughput and high contention — doesn’t scale well in distributed environments. Use Case: Rare, usually only for small systems or logs where strict global order is essential.

    • Per-Key Ordering Messages for the same key (e.g., customer ID, account ID, order ID) are guaranteed to arrive in order, but different keys may be interleaved. Pros: Scales well while still preserving logical correctness. Cons: Requires clear definition of partitioning key. Use Case: Order processing, account balance updates, workflows tied to a specific entity.

    • No Ordering Messages may arrive in any order, and consumers must be resilient to out-of-order delivery. Pros: Maximum throughput and flexibility. Cons: Consumers must handle reordering or accept non-deterministic behavior. Use Case: Analytics, metrics aggregation, or scenarios where sequence doesn’t matter.

    If you do need ordering, be explicit about the scope (global vs. per-key). For scalability, per-key ordering is the pragmatic default. When possible, design messages to be order-independent, reducing system coupling and complexity.
  • Agree on retention & replay policies.

    Retention and Replay Policies

    Not all events are created equal. Some are short-lived signals (like "user is typing") that lose meaning after a few seconds. Others are critical facts (like PaymentReceived) that might need to be replayed months later to rebuild state, audit behavior, or feed analytics systems. Without clear policies, teams can make different assumptions — leading to data loss, bloated storage, or mismatched expectations.

    • Short-Lived / Ephemeral Messages Designed to expire quickly because their value diminishes over time. Example: "User is online," "Typing indicator," or telemetry pings. Policy: Retention of seconds to minutes; no replay needed.

    • Business-Critical Events Represent important, immutable facts about the business. These often need to be stored and replayable. Example: OrderPlaced, PaymentReceived, InvoiceGenerated. Policy: Retention of months or indefinitely, depending on compliance and recovery needs.

    • Replayable Streams Some topics act as an event log, enabling consumers to replay events from the beginning (or any point in time) to rebuild state or catch up. Example: Kafka topics with compacted or long-lived retention. Policy: Define whether replay is always possible, and how far back (e.g., 7 days, 6 months, forever).

    • Retention vs. Storage Costs Long retention improves resiliency and analytics options, but increases storage and operational overhead. Strike a balance based on business requirements.

    • Explicit Agreements Document retention and replay expectations per topic/queue. A producer’s "fire and forget" event may be a consumer’s "must retain for 7 years" fact — resolving these conflicts early avoids pain later.

    Make retention & replay policies part of your contract design, not just infrastructure defaults. Treat them as shared agreements between teams, not hidden assumptions.

    Some events need to be replayed for analytics or rebuilding state. Others can expire quickly. Explicit policies avoid mismatched assumptions.

  • Define idempotency expectations clearly.

    If a consumer might receive the same message twice, make sure the team knows and builds in safeguards. Silent assumptions about uniqueness are a recipe for bugs.

    Idempotency Expectations

    In message-driven architectures, duplicates happen. Brokers may redeliver messages, retries may be triggered by timeouts, and network partitions can confuse producers and consumers. To avoid nasty surprises, every team must agree on how idempotency is handled.

    • What is Idempotency? An operation is idempotent if running it multiple times has the same effect as running it once. Example: Setting order.status = SHIPPED is idempotent. Counter-example: Incrementing order.shipmentCount++ is not idempotent — duplicates change the outcome.

    • Where to Enforce Idempotency?

      • Producer-side: Assign stable, unique IDs to each message (e.g., orderId, transactionId, or a dedicated messageId).

      • Consumer-side: Deduplicate based on these IDs (e.g., "process this paymentId only once").

      • Storage layer: Use database constraints (unique keys) or upserts to ensure safe writes.

    • Idempotency Keys A unique identifier that ties retries and duplicates back to the same logical operation. This must be included in the message contract. Example: A paymentId ensures you don’t charge a credit card twice.

    • Team Alignment Clarify who guarantees idempotency:

      • Is it the producer’s responsibility to never send duplicates? (Rare in practice.)

      • Or is it the consumer’s job to handle them safely? (Most common approach.)

    • Testing Idempotency Contract tests should explicitly validate that consumers handle duplicate messages without side effects.

    In practice, assume "at-least-once" delivery semantics and design all consumers to be idempotent by default. It’s cheaper than fixing double-payments in production.

5. Decouple via Event Streams

5 decouple via event streams

Message-driven architectures shine when producers and consumers remain loosely coupled.

If the system starts to feel like tightly bound request/response interactions, it’s time to pause and rethink.

  • Prefer pub/sub over point-to-point integration.

    A message should have multiple potential consumers without the producer even knowing who they are.

  • Ensure consumers handle irrelevant or extra messages gracefully[4].

    Consumers should ignore what they don’t understand, so producers can evolve without breaking everything downstream.

    Or, to quote Martin Fowler’s Tolerant Reader Pattern:

    Be conservative in what you do, be liberal in what you accept from others
  • Avoid overly synchronous patterns across systems.

    A messaging architecture that relies on real-time responses quickly turns into a distributed monolith. Design for asynchronous processing whenever possible.

6. Enforce Contracts with Automation

6 enforce contracts with automation

Even with good intentions, humans make mistakes.

Automation is the safety net that ensures contract violations are caught before they hit production.

  • Validate schemas in CI/CD against registry.

    Every build and deployment should check schema compatibility automaticallyproduction[4].

  • Implement consumer-driven contract tests.

    Let consumers define what they expect from a message. Producers can run those tests before releasing changes, preventing breaking updates.

  • Detect and prevent breaking changes automatically.

    Don’t rely on manual reviews alone — enforce rules so that incompatible changes fail fast.

7. Make Observability First-Class

7 make observability first class

Messages flow across team and system boundaries, which makes tracing problems harder.

Without visibility, debugging becomes guesswork. Observability should be designed in, not bolted on later.

  • Centralize logs, metrics, and traces of message flows.

    Teams need a shared view to understand where messages go, how long they take, and where they fail.

    OpenTelemetry and Jaeger

    Message-driven systems are harder to debug than synchronous ones — a single message may flow through 10+ services before surfacing as a business outcome. This makes distributed tracing essential.

    • OpenTelemetry (OTel)

      • An open standard for collecting traces, metrics, and logs.

      • Supported by most programming languages and messaging frameworks.

      • Lets you propagate trace IDs or correlation IDs in message headers.

    More details can be found in the OpenTelemetry Documentation

    • Jaeger

      • A popular open-source tracing system.

      • Works seamlessly with OpenTelemetry.

      • Lets you visualize an entire message flow across services, including timing, retries, and failures.

    More information about Jaeger can be found on the Jaeger Website

    Adopt OpenTelemetry for instrumentation and use a backend like Jaeger (or alternatives like Zipkin, Honeycomb, Grafana Tempo) to actually see how messages travel through your system.
  • Include correlation IDs in all messages.

    A simple ID that follows the message across systems can be a lifesaver when trying to reconstruct the path of an event.

  • Define and monitor dead-letter queue strategy.

    Messages that can’t be processed should never just disappear. A clear process for handling them ensures no data is silently lost.

8. Govern Lightly but Consistently

8 govern lightly but consistent

Strong governance keeps things aligned, but too much process slows teams down.

The trick is to strike the right balance: enough structure to stay coherent, enough freedom to move fast.

  • Form a cross-team architecture guild or working group.

    Give teams a space to align on decisions, share experiences, and raise concerns without imposing rigid top-down rules.

  • Maintain design guidelines and checklists for messaging.

    Document the shared rules of the road so new teams and developers can get up to speed quickly.

  • Review actual message flows regularly to detect drift.

    Architecture on paper often drifts from reality. Regular reviews help catch inconsistencies early and keep the system healthy.