The Order I Check When Data Looks Stale

Hangangjari’s backend runs on a K3s cluster on a small server at home. Suppose a user opens the app and the parking information looks stale. Opening the code first is too late. I need to separate whether the request reached the server, whether the API is alive, and whether the worker fetched fresh values.

The user-facing API reaches K3s ingress through Cloudflare edge and Tunnel. I avoided exposing the origin server directly, and I did not mix the path users take with the path I use to operate the system.

Even if it is a mini PC, if an API called by an App Store app runs on it, it is a real server. Users do not need to know whether the server is at home or in a cloud VM. If the data is slow or stale, the app becomes hard to trust.

First I check whether user requests reach the API

The first check is the route a user request takes before reaching the API. A blocked route, a single failing feature inside the API, and stale data are different problems.

flowchart LR
  Client["iOS app / widget"] --> Edge["Cloudflare edge"]
  Edge --> Tunnel["Cloudflare Tunnel"]
  Tunnel --> Ingress["K3s ingress"]
  Ingress --> API["FastAPI service"]
  API --> Redis["Redis"]
  API --> PG["Postgres"]

The brand site, support page, and privacy page are separated from the API. Static web pages and app APIs fail in different ways. Even under the same app name, separate runtimes make problems easier to find.

flowchart TB
  Site["hangangjari.app<br/>brand / support / privacy"] --> Pages["Static site"]
  APIName["API domain"] --> Edge["Cloudflare edge"]
  Edge --> Tunnel["Cloudflare Tunnel"]
  Tunnel --> Runtime["K3s cluster"]

The tunnel routes traffic through the edge network without exposing the origin directly. That makes it possible to separate the user entry path from the path used to maintain the server.

This separation also helps during incidents. If the user path and the operator control plane are mixed together, the recovery order becomes blurry.

I needed a check order, not a component inventory

Inside K3s there are the API, workers, stateful stores, ingress, and metrics/logging components. It is more useful to know which part to check first when stale values appear than to memorize every name.

Component	Role
FastAPI API	App API read by the iOS app and widgets
Parking worker	Collects parking status
Outing worker	Collects events, notices, facilities, and realtime context
Forecast worker	Builds forecasts and warms home summary
Push worker	Builds candidates, checks send rules, and dispatches the outbox
Postgres/PostGIS	Reference data, history, and audit records
Redis	Cache, latest status, and supporting indexes
K3s ingress	User API routing
Prometheus/Grafana/logs	Metrics, logs, and dashboards

flowchart TB
  Desired["Server state store<br/>desired state"] --> GitOps["GitOps sync"]
  GitOps --> API["API deployment"]
  GitOps --> Workers["Worker deployment"]
  GitOps --> State["Postgres / Redis"]
  GitOps --> Ingress["Ingress"]
  API --> Signals["Metrics / logs"]
  Workers --> Signals
  State --> Signals

A mini PC does not reduce what needs to be checked. If users depend on an API, it needs a deployment path, backups, metrics/logs, and a recovery path before it can be treated as a real server.

Dashboards should reduce cause candidates

A dashboard is not decoration. It should reduce where I need to look when a user reports a problem.

flowchart TD
  Checks["Things to check"] --> APIQ["API<br/>status / latency / error rate"]
  Checks --> WorkerQ["Workers<br/>last success / rows / freshness"]
  Checks --> DataQ["Data<br/>cache hit/miss / forecast runs / outbox backlog"]
  Checks --> InfraQ["K3s<br/>ingress / GitOps state / backup state"]
  APIQ --> Triage["See which part is shaking"]
  WorkerQ --> Triage
  DataQ --> Triage
  InfraQ --> Triage

The signals I want to see first in Hangangjari are:

Is the API alive?
Are protected routes being accepted and rejected correctly?
Is latency spiking on a specific endpoint?
Are workers continuously collecting fresh data?
Is the last success time by source too old?
Is Redis cache abnormally empty?
Is the forecast worker creating recent runs?
Is the push outbox building up?
Are DB backup and restore drills healthy?

For metrics collection, the application exposes internal signals and the collector pulls them periodically. Hangangjari’s API and workers were adjusted to leave the signals above.

When a report comes in, I check state before code

When a problem appears, I do not immediately suspect the code. First I check where the failure happened.

flowchart TD
  Alert["User report or smoke failure"] --> Public{"User API healthy?"}
  Public -->|No| Runtime{"Ingress and K3s healthy?"}
  Runtime -->|Yes| Edge["Check edge and tunnel"]
  Runtime -->|No| Pods["Check pods and GitOps state"]
  Public -->|Yes| Feature{"Only one API feature failing?"}
  Feature -->|Yes| Fresh{"Freshness degraded?"}
  Fresh -->|Yes| Worker["Check workers and ingestion runs"]
  Fresh -->|No| Cache["Check cache/display response and DB query"]
  Feature -->|No| Client["Check client cache or widget snapshot"]

With this order, I can separate an API outage, a source failure, stale cache, and a client looking at an old snapshot. Running on a small server did not make the user-facing failures smaller.

I separated the data that must be recovered

Backups do not end at “a file exists.” A backup is only a backup if it can be restored.

In Hangangjari, DB backup state and restore viability are part of the check surface. Much public data can be collected again. But user settings, push subscriptions, app events, and forecast/backtest history are hard to recover if they disappear.

So I need to separate data that can be rebuilt from data that must be preserved. An empty Redis cache and damaged Postgres reference data are not the same incident. Restore drills are how that difference becomes concrete.

I looked at the shape of failure, not server size

What I want to leave in this post is not a list of runtime components. When users see stale data, I need to be able to split the request path, API, worker, cache, and client snapshot in order.

The biggest change in my thinking while organizing Hangangjari’s server side was that server size and operational responsibility are not proportional. Once users connect to it, even a small server has to explain slowness, failure, and recovery.