Video
April 20, 2026

Every Kafka engineer has seen a monitoring dashboard go green and felt comfortable. High replication factor, multi-zone cluster, healthy brokers, no lag. Everything's fine. Then something happens that wasn't in the runbook, and the dashboard stays green while the data underneath it disappears.

This post walks through one of the scenarios that causes that outcome, and what it takes to actually recover. If you've ever wondered whether your replicated Kafka setup would survive an operational mistake, the answer depends on details most teams haven't examined.

The setup

Start with a realistic configuration. A Kafka cluster on a managed provider, running a replication factor of three across multiple availability zones. Your producers are writing events, your consumers are reading them, and the cluster has been stable for months. Monitoring shows healthy under-replicated partition counts, normal consumer lag, and no broker alerts. By every metric your operations team tracks, this cluster is safe.

Now introduce the kind of incident that happens in production environments every week. It doesn't need to be dramatic. It can be a fat-finger incident where an engineer runs a topic deletion command against the wrong cluster. It can be an infrastructure-as-code pipeline that picks up a tag change on a topic definition and decides the right course of action is to recreate the topic. The definition in Git looks correct, the name matches, the configuration matches. The pipeline applies the change, Kafka receives the instruction, and the topic gets recreated.

A recreated topic is, from the broker's perspective, a brand new topic. Same name, same partition count, same replication factor, zero data. The records that existed on the previous version of that topic are gone. And because replication is operationally coupled, the recreation propagates to every replica in real time. Your multi-zone setup isn't a safety net here. It's a delivery mechanism for the mistake.

Why this is worse than it looks

When a topic gets recreated, your monitoring doesn't necessarily alert. Under-replicated partitions stay at zero because the topic is healthy from the broker's perspective. Consumer lag metrics reset because there's nothing to consume. The cluster health check passes. The dashboard stays green.

Meanwhile, your downstream consumers start seeing an empty topic. Depending on how your applications handle unexpected state, this might surface as missing records, stalled processing, or data integrity errors that show up hours or days later when someone notices a report is wrong. Tracing the root cause back to a topic recreation is difficult because the recreation itself wasn't flagged as an incident. 

If you try to recover at this point, you'll discover that Kafka doesn't give you much to work with. There's no undo button. The deleted data doesn't sit in a recycle bin for 30 days. The replicas are already empty because the delete propagated. Your options are to restore from a backup, if you have one, or to reconstruct the data from upstream sources, if you can.

What the recovery actually looks like

This is where having a real backup matters, and where the difference between replication and backup becomes operational rather than theoretical.

A Kannika backup runs in the background while your cluster is healthy. It reads records as they're produced and writes them to storage that is operationally decoupled from your Kafka cluster. Operationally decoupled means the backup storage isn't reachable through the same credentials, network, or control plane as your live cluster. When the topic gets recreated on production, the backup is untouched because it isn't part of the production system.

Recovery starts with a restore definition. That's a YAML file describing what to restore, where to restore it to, and how. You apply the definition, and Kannika kicks off the restore job. From the Kannika UI, you can watch the restore in progress. The job pulls records from the backup and writes them back into your Kafka cluster in the original partition order, with the original timestamps preserved.

When the restore finishes, the topic is back in the state it was before the incident. Consumers that had committed offsets before the deletion can pick up where they left off. Consumers that needed a full replay have access to the full record history. From the application side, the incident becomes a time window during which the topic was empty, rather than a permanent data loss.

The operational details that matter

A few things distinguish a real backup from a copy of your data:

The backup needs to be operationally decoupled. If your backup storage is reachable from the same credentials or automation that touches your production cluster, it isn't decoupled. An incident on production can propagate to the backup. Kannika stores its data outside the Kafka control plane specifically to avoid this.

The restore needs to be a defined operation, not a custom script. You don't want to be writing code during an incident. With Kannika, the restore is a declarative YAML definition that you apply, not a workflow you build. The UI gives you visibility into the progress, but the actual recovery logic lives in Kannika.

The restored data needs to match the original state. That includes partition assignment, offsets, and timestamps. A restore that puts records back in the wrong partition or renames the topic is not a clean recovery. Kannika preserves the original structure so that consumers and downstream systems see the topic in the state they expect.

Try it yourself

The Kannika sandbox lets you explore the basics of how a backup and restore work, without needing to set up a Kafka cluster of your own. Spin one up at kannika.io to get a feel for the UI and the restore flow. If you want to walk through your own disaster scenario or set this up against a real environment, get in touch and we can arrange a deeper session.

Bryan De Smaele
Author
Bryan De Smaele