Healthcheck Was Normal, but Notifications Were Not Safe Yet

After Hangangjari launched, two push notifications went out right after I thought an incident had ended.

The Hangangjari API was responding again. Healthcheck had returned to normal. K3s pods were up again, and workers were running. Normally, after seeing that much, it is tempting to say recovery is complete.

But the notifications created right after recovery were not safe. The parking worker directly compared the last snapshot before the incident with the first snapshot after recovery, and interpreted the difference as a new change. As a result, two parking-threshold notifications were sent to real users.

The push outbox itself was not the thing that failed. APNs delivery, user-setting matching, and suppression rules each behaved as expected at their own stage. The problem was earlier.

Could the previous snapshot really be treated as the “immediate previous state”?

Events remained after the incident

The first thing that shook was the operating node.

Hangangjari’s backend runs on K3s on a small server at home. The API, workers, Postgres, and Redis move inside the same operating node. As the operating disk nearly filled up, K3s reported DiskPressure.

This was not an incident caused by a service-code deployment. Backup data outside the service filled the same filesystem, and the effect reached the environment running the operating pods. In Kubernetes, DiskPressure is not just a storage-space issue. If pods are evicted or cannot start, parts of the API, DB, Redis, and workers can shake together.

The first recovery steps were relatively clear. Free disk space, restore K3s state, and check healthcheck. Afterward, operating-disk free space and DiskPressure were added to monitoring alerts, and the operating disk was changed to keep only recent backups.

But recovery did not end when the worker restarted.

While the worker was stopped, the last snapshot still remained in Postgres. When the worker ran again, a new snapshot arrived too. Each was a valid parking state. So the existing logic compared them.

That comparison caused the incident.

sequenceDiagram
  autonumber
  participant Worker as Parking worker
  participant DB as Postgres
  participant Fact as Fact writer
  participant Push as Push pipeline
  participant User as User

  Worker->>DB: Store pre-incident snapshot (AVAILABLE 169)
  Note over Worker,DB: Parking worker stopped during K3s DiskPressure
  Note over Worker,DB: Parking changes in this gap were not collected
  Worker->>DB: Store post-recovery snapshot (FULL 0)
  Worker->>DB: Fetch previous snapshot for comparison
  DB-->>Worker: Return pre-incident snapshot
  Worker->>Fact: Compare previous and current values
  Note right of Fact: Age of previous value was not checked
  Fact->>Push: Create parking_space_threshold fact
  Push->>User: Send 2 push notifications

In the normal collection flow, this structure is correct. If remaining spaces were 40 a moment ago and 20 now, the user may want to know about that change.

But this comparison was not “a moment ago.” It was two values separated by a long incident gap. The system trusted an old value as the immediate previous value and described the first post-recovery state as a change that had just happened.

Healthcheck does not guarantee event quality

A normal healthcheck means the system can receive requests. It is close to saying that an external request reached the API process and the process responded. DB and Redis readiness have to be checked separately through readiness probes. Whether a worker is safely creating fresh events is another problem again.

That signal is necessary, but it was not enough for this problem.

When a user opens the app, the latest parking state after recovery should be shown. Cache should also be updated. Showing the post-recovery value is more honest than showing the pre-incident value.

Push is different.

Push reacts to changes, not the current state itself. A change comes from comparing two points in time. If those two points are too far apart, even a fresh current value cannot be described as “just changed.”

So recovery needed more than one treatment.

Treatment	Post-recovery choice
Store parking state	Continue
Update Redis cache	Continue
App and widget response	Use latest state
Create push fact	Stop if previous snapshot is old
Enqueue outbox item	Stop if the fact is already expired

The same snapshot can be valid for read responses and invalid for push change detection. That difference had to exist in code.

Old previous values are not compared

I added a freshness guard before comparing parking snapshots.

Hangangjari parking status is collected at a short interval. I already had a stale window sufficiently longer than the collection interval, and I used that same number to break recovery gaps.

If the difference between observed_at on the previous snapshot and current snapshot exceeds the stale window, the current snapshot is stored but no push fact is created.

current.observed_at - previous.observed_at > stale window

When this condition triggers, the system records previous_snapshot_stale and stops.

Tests fixed the boundary. Differences inside the stale window are treated as normal collection delay and can create facts. Differences beyond the window are treated as recovery gaps and do not create facts. The incident pattern of directly comparing an old snapshot with the post-recovery snapshot now stops on this path.

The important point is that the data is not thrown away.

The current post-recovery parking state remains in DB. Cache is updated. The app and widgets read the latest value. Only when that value meets an old previous value does the system prevent it from becoming a push fact.

This distinction mattered during recovery. If the current state is hidden, users see older information. But if the system sends push by comparing against an old previous value, it pretends to know a change it did not observe.

I added one more guard before the outbox

Blocking only the fact-creation stage was not enough.

Push is not sent immediately. The system creates a fact, checks user settings, puts it in the outbox, and a dispatcher sends it to APNs. Time passes between those stages. If workers are backed up or processing a backlog after restart, an event that was valid when the fact was created may already be old before entering the outbox.

So parking facts now use the same stale window for expires_at. The candidate builder checks expiration before creating an outbox entry.

flowchart TB
  Current["Collect new snapshot"] --> Store["Store in DB / update Redis cache"]
  Current --> Age{"Does previous snapshot exceed<br/>the stale window?"}
  Age -->|Yes| StopA["Stop fact creation"]
  Age -->|No| Fact["Create push fact"]
  Fact --> Expired{"Has fact expired?"}
  Expired -->|Yes| StopB["Stop before enqueueing outbox"]
  Expired -->|No| Outbox["Enqueue outbox item"]

The first guard prevents comparing current values with old previous values. The second guard prevents already-late facts from entering the send queue. They look similar, but they block different failure points.

Suppression reasons were also separated. The system distinguishes stopping fact creation because the previous snapshot was old from stopping a fact before the outbox because it had already expired.

That reason code matters in operations. If push count drops, the cause is unclear from the count alone. I need to distinguish old previous values being cut off, expired facts being dropped before outbox, and user settings suppressing delivery.

Some notifications were intentionally given up

This fix is not a choice to send every possible change as a notification. It is the opposite.

During the incident, a parking lot may really have changed from available to full. If it is still full after recovery, it is tempting to notify. But the system does not know when that change happened. It cannot look only at the last pre-incident value and first post-recovery value and say, “it just became full.”

So I gave up some notifications.

The app screen shows the latest state. Users can see the refresh time and current state and decide. Push is different. It pulls a user’s attention outside the app and can affect movement decisions. It is better not to send a notification that pretends to know an unknown change.

This fix was not “send fewer notifications.”

It was a decision to send only notifications that can be called changes created after recovery.

Recovery checks must include push

Before this, recovery checks were mostly runtime-state checks.

Does the API respond?
Are pods up again?
Is the worker running again?
Are DB and Redis connected?
Does deployed state match the desired revision?

Now I also check one more thing.

Are newly created post-recovery events safe to send to users?

Healthcheck alone cannot answer that. It requires looking at the first collection time after worker restart, source freshness, suppression reasons, and outbox growth together. After this fix, I also checked that no additional parking-threshold outbox entries of the same type were created.

K3s DiskPressure also stopped being only an infrastructure metric. Disk pressure can lead to worker downtime, worker downtime can lead to stale snapshot comparisons, and those comparisons can become user push notifications.

Recovery was not complete when processes came back. It was not complete when healthcheck returned normal responses. It had to include whether the data and events produced after recovery were safe to send to users.

After this incident, Hangangjari’s recovery standard changed a little.

I do not only check whether the server is alive. I also check what the revived server starts saying to users again.