Workers and CronJobs Running Outside the API

The Hangangjari backend runs user-facing APIs separately from background workers. They share code, but their jobs are different.

The reason for this split is to keep the API away from slow or failing external data. When a user opens the app, the API has to answer immediately. Public-data collection, forecast calculation, and push candidate selection can be slow, can fail, and need different intervals.

At first, adding a scheduler to one backend process also looked possible. Small apps often start that way. Once operation begins, however, user requests and collection jobs quickly prove that they need to move differently. The API must answer short requests immediately. Workers must retry failures and leave traces.

Hangangjari split workers not to increase scale, but to contain failure. If parking status checks briefly fail, the API should still return the last checked value. If the push outbox backs up, home-summary should not slow down with it.

Work prepared outside the API

There are currently four always-running worker types. They prepare values before any user waits on an API call.

flowchart TB
  Scheduler["Per-worker APScheduler"] --> Parking["worker-parking\nStatus polling"]
  Scheduler --> Outing["worker-outing\nEvents / notices / city context / translation"]
  Scheduler --> Forecast["worker-forecast\nForecast generation / first-screen warming / metric rollup"]
  Scheduler --> Push["worker-push\nCandidates / preference indexes / outbox / maintenance"]

  Parking --> PG["Postgres"]
  Parking --> Redis["Redis"]
  Outing --> PG
  Forecast --> PG
  Forecast --> Redis
  Push --> PG
  Push --> Redis
  Push --> APNs["APNs"]

Occasional reference-data alignment and validation jobs are treated separately as CronJobs. If constant polling and reference-data sync live in the same picture, it becomes unclear which jobs run frequently and which run rarely.

flowchart LR
  Cron["K8s CronJob"] --> Master["Parking reference-data sync"]
  Cron --> Facility["Facility sync"]
  Cron --> Transit["Transit dataset sync"]
  Cron --> Backtest["Forecast backtest"]
  Master --> PG
  Facility --> PG
  Transit --> PG
  Backtest --> PG

One detail can be confusing. The name worker-parking may suggest that master sync and status polling both run in the always-on scheduler. In the current code, the always-on scheduler registers status polling. Parking master sync is separated into a job entrypoint and K8s CronJob schedule.

Intervals and responsibilities are separated

In code, work is split like this.

Worker	Main jobs
worker-parking	status polling
worker-outing	event sync, notice sync, realtime context sync, translation sync
worker-forecast	forecast generation, home-summary precompute, metric rollup
worker-push	candidate build, preference index rebuild, outbox drain, maintenance
K8s CronJob	reference-data sync, facility sync, dataset sync, backtest

This keeps one failure from pulling down other screens. If parking collection briefly fails, the API can return the last checked state. If push delivery is delayed, the parking screen’s read path stays separate.

Names are also the first clue during incident response. If parking collection is unstable, check freshness first. If notification jobs are backing up, check queues and delivery records. When every job lives in one worker, it takes longer to know where to start.

Duplicate work is blocked

A simplified parking status check looks like this.

sequenceDiagram
  autonumber
  participant Scheduler as Scheduler
  participant Job as Status polling job
  participant Lock as DB lock
  participant Source as Parking API
  participant PG as Postgres
  participant Redis as Status cache
  participant Fact as Notification fact

  Scheduler->>Job: Run at scheduled interval
  Job->>Lock: Try duplicate-run lock
  alt Lock acquired
    Job->>Source: Request source parking status
    Source-->>Job: Return source rows
    Job->>PG: Store status snapshot and collection record
    Job->>Redis: Refresh latest parking status cache
    Job->>Fact: Record notification candidate facts if changed
  else Duplicate execution
    Job-->>Scheduler: Skip this run
  end

The key is locking and run records. If the same job runs twice at once, duplicate rows, duplicate notifications, and cache races can happen. I did not rely only on scheduler settings.

Storage-level locks prevent the same job from being processed concurrently.

Each data source has a written refresh plan

Hangangjari has a schedule catalog for data sources. It records the source, owning job category, refresh interval, and stale threshold.

This file is both an operator-facing document and a document that defines what freshness the app can honestly tell users. As more sources are added, “how often does this data refresh?” has to be tracked in code and documentation together.

In a public-data app, how often data is checked is directly tied to what freshness the product can claim. A value checked every 30 seconds and a value checked once a day should not both be described as the same kind of “latest information.” With a catalog, API freshness display and operational alerts refer to the same numbers.

flowchart LR
  Catalog["Collection schedule catalog"] --> SchedulerJobs["Always-on scheduler jobs"]
  Catalog --> CronJobs["Operational CronJobs"]
  SchedulerJobs --> Ingestion["Ingestion use cases"]
  CronJobs --> Ingestion
  Ingestion --> Runs["ingestion_runs\nsuccess / row_count / schema_hash"]
  Runs --> Health["Source status and freshness"]
  Health --> API["Outing/home-summary freshness"]

Rules for building jobs

Do not put ingestion in the API startup hook.
Make jobs idempotent where possible.
Record success, failure, row count, and schema hash for each source.
Prevent duplicate execution with scheduler settings and storage locks.
Track push delivery through outbox claims and delivery attempts.
Source failures should appear as freshness for that area, not as a whole-app outage.

What remained after separating API and workers

Even in a small service, splitting workers early makes operation easier. In a public-data app especially, external data sources can become slow or fail at any time.

The API returns the last checked value quickly. Workers keep checking for new values. Keeping the two out of the same execution lifecycle was more stable.