Kafka Replication is not a Backup
Blog
November 13, 2025

Kafka Replication is not a Backup

In the world of Event-Driven Architectures (EDA), Apache Kafka is the gold standard for managing real-time data streams. But as this data becomes more critical, protecting it becomes paramount.

A dangerous misconception we constantly see is confusing replication with backup. This confusion comes in two forms:

  1. Believing partition replication (for high availability) is a backup.
  2. Believing geo-replication (like MirrorMaker, for disaster recovery) is a backup.

Neither is true.

While all three involve copying data, they solve three completely different problems. Understanding this difference is the key to building an architecture that is not just highly available, but truly resilient and recoverable.

Let's break it down.

Level 1: Redundancy (For High Availability)

This is the most basic form of data protection in Kafka, built directly into its architecture. When you create a topic, you set a replication factor (e.g., RF=3).

This means that for every partition, Kafka maintains three identical copies, placing them on different servers (brokers) within the same cluster. One copy is the "leader" (handling writes) and the others are "followers" (copying the leader).

  • What it does: If a single server crashes, a follower is instantly promoted to leader.
  • What it protects against: Broker failure. It ensures your cluster stays online without any downtime.

This is for High Availability (HA). But what if your entire data center goes down?

Level 2: Geo-Replication (For Disaster Recovery)

This is where tools like Kafka MirrorMaker come in. Geo-replication continuously copies data from your primary Kafka cluster to a second cluster, often in a different data center or cloud region.

  • What it does: It creates a "warm standby" or "hot" cluster that can take over if your entire primary cluster fails.
  • What it protects against: Full cluster or data center failure.

This is for Disaster Recovery (DR). It gives you a way to recover your service in a different location.

The "Gotcha": The Fatal Flaw of All Replication

Here is the most critical point: All forms of replication faithfully copy your mistakes.

Whether it's intra-cluster partition replication or inter-cluster geo-replication with MirrorMaker, the followers have one job: to perfectly mirror the leader.

  • If a bug in your producer writes corrupted data... it's replicated.
  • If you accidentally run a command to delete a topic... it's replicated.
  • If your data retention policy is 7 days... your data is deleted from all replicas after 7 days.

Replication protects you from hardware or site failure. It offers zero protection from data failure (corruption, accidental deletion, or malicious attacks).

Level 3: Backup (For Data Recovery)

This is the final, crucial piece of the puzzle, and it answers a completely different question. A backup is not about keeping the system online; it's about being able to restore your data to a known good state.

A backup is a separate, versioned, and often external copy of your data that you can use to recover from a scenario where your live data itself is unusable.

  • What it does: It takes a snapshot of your data at a specific moment in time and stores it independently.
  • What it protects against: Data corruption, accidental deletion, bugs, and ransomware.

How to Handle Backups in Kafka with Kannika Armory

Kafka does not have a built-in "backup" button. This is a critical gap that teams try to fill with custom scripts or basic S3 connectors, which are often brittle and hard to restore from.

This is exactly why we built Kannika Armory.

Armory is a dedicated, enterprise-grade backup and recovery solution built for Kafka. It's not another replication tool; it's a true backup system.

  • Point-in-Time Recovery: Armory allows you to restore topics to their exact state from a specific moment in time—before a data corruption bug was deployed.
  • Policy-Based Backups: Automatically back up critical topics (and their schemas) to durable, long-term storage, independent of your cluster's retention policies.
  • Simplified Restoration: When a disaster (like data corruption) strikes, Armory gives you a simple process to restore the correct data, slashing your recovery time (RTO) from days to minutes.

Comparing the levels

🚀 Intra-Cluster Redundancy (Partition Replication)

Intra-Cluster Redundancy is your first line of defense, focused purely on High Availability. Using Kafka's built-in internal partition replication, it protects you from a single broker or server failing. It's designed to answer the question, "Can my service survive a hardware crash right now?"

🌍 Geo-Replication (e.g., MirrorMaker)

Geo-Replication (using a tool like MirrorMaker) takes this a step further, focusing on Disaster Recovery Standby. Through inter-cluster replication, it creates a "hot" standby cluster in a separate data center to protect against a full site failure. It answers, "If this whole data center goes down, do I have a live copy somewhere else?"

🛡️ Backup (Point-in-Time Recovery)

Backup (using a solution like Kannika Armory) serves a completely different and critical purpose: Data Restoration. Its mechanism is an external, point-in-time copy of your data. This is the only method that protects you from data corruption, accidental deletion, or bugs. It allows you to answer the most important question: "Can I get my data back from last week, before the bad bug was deployed?"

How to Choose Your Strategy (And Why Backup is Non-Negotiable)

Every serious Kafka deployment starts with Level 1 (Redundancy). That is non-negotiable for high availability.

The next choice is strategic. For many companies, running a full parallel cluster with Level 2 (Geo-Replication) is financially impractical. If your business does not have a strict, regulatory requirement for instantaneous cross-data-center service failover, this massive expense may not be justified.

However, every business needs to recover from data corruption, bugs, or accidental deletion.

This is why for many organizations, the most practical and cost-effective strategy is to combine Level 1 (Redundancy) with Level 3 (Backup). This approach provides a complete, robust solution:

  • Level 1 protects you from hardware failure.
  • Level 3 protects you from data loss.

This combination delivers true resilience without the 2x cost of a full geo-replicated cluster.

At Kannika, we build systems that are not just fault-tolerant but truly recoverable. Don't mistake a robust replication strategy for a complete data recovery plan. Your business data is too valuable for that.

Wout Florin
Author
Wout Florin