source catalog and ingestion run

I separated collection plans from execution results to explain freshness and failures by source

In a public-data app, collection work does not end at “call the API.” I need to know which source should run when, what to show when it fails, whether row counts suddenly changed, and when the last success happened.

Hangangjari handles this by separating the collection plan from collection records.

They look similar, but their roles differ. The collection plan, or source catalog, answers “what should run and when?” The collection record, or ingestion run, answers “how did the thing that just ran finish?” Both are needed to explain which source shook and when.

I wrote what to collect and when in one place

The collection plan contains the list of ingestion jobs. Instead of scattering intervals and roles across the codebase, it lets the system read source ID, job ID, owner, role, interval, and stale threshold from one place.

flowchart LR
  Catalog["Collection plan"] --> Scheduler["Worker scheduler"]
  Catalog --> Cron["Kubernetes CronJob"]
  Scheduler --> Jobs["Parking / outing / forecast jobs"]
  Cron --> Batch["Reference-data and backtest jobs"]

The list includes sources of these shapes.

CategoryExample
parking masterSync parking-lot reference data
parking statusPoll realtime parking values
outing facilityCollect facility data
outing eventCollect event data
outing noticeCollect notices and controls
realtime contextCollect crowding, weather, and traffic context
forecastBuild forecasts and backtests

I looked first at what each source affects on screen, not where it runs. I needed to distinguish which sources support visible values, which jobs affect freshness, and which jobs can be retried without directly appearing on the user screen.

Push is intentionally excluded here. The push worker is also background work, but this list covers work that reads external sources and changes the freshness of data shown on screen. Notification candidate creation and outbox dispatch are closer to sending. Putting both in the same list can look convenient on one screen, but it makes incident causes harder to read.

I also recorded how collection finished

An ingestion run records how a collection job finished.

flowchart LR
  Fetch["Fetch source"] --> Parse["Parse"]
  Parse --> Normalize["Normalize"]
  Normalize --> Upsert["Upsert domain rows"]
  Upsert --> Run["Ingestion-run record"]
  Run --> Health["Source health"]
  Health --> API["API freshness/source state"]

When Hangangjari’s outing ingestion succeeds, it records:

  • Source ID.
  • Start and end time.
  • Success or failure.
  • Number of rows read.
  • Status distribution.
  • Schema hash.
  • Failure error summary.

These values are needed both by operators and by the app screen. They reveal parser failures, and the API can produce source-check results such as “fresh,” “stale,” and “unavailable.”

These values are hard to add later. Without run records from the beginning, there is no way to answer “was this source also unstable last week?” after an incident. So I stored not only result rows, but also the act of collecting them.

”No data” and “collection failed” are different

The most dangerous simplification in a public-data screen is showing every failure as “none.”

Zero events is different from a failed event source. No facility is different from a facility parser producing empty rows. Stale realtime context is different from a disabled source.

A source catalog and ingestion runs preserve this difference.

StateMeaningSignal to users
freshRecent successful data existsUsable as reference
partialOnly some sources are availableLimited reference
staleLast success is oldOn-site difference possible
unavailableCollection failed or source is unavailableDistinct from none

I also watch row counts and shape changes

Public-data sources can change quietly. Field names can change, HTML shape can change, or rows for a specific park can disappear.

That is why collection results need schema hash and row count. These values distinguish “there are fewer events today” from “the parser only read half of them.”

Status distribution gives the same kind of signal. If every facility value suddenly becomes unknown, or if crowding distribution is abnormally skewed, the source or parser may be the problem.

Screen guidance follows ingestion results

Source-check results are not only for operators. They are ingredients for the app and widgets to express “how much can I trust this?”

The Hangangjari API classifies sources as fresh, stale, or unavailable based on the last success time and stale threshold per source. This information explains sources and refresh state on screen, and it becomes warning text and unavailable reasons in widgets.

Source-check results must connect to the UI. The stale state on an operations dashboard and the “information may not be current” message in the app are two expressions of the same fact. If that connection breaks, the server knows about a problem while the app pretends not to.

erDiagram
  DATA_SOURCES ||--o{ INGESTION_RUNS : records
  DATA_SOURCES ||--o{ OUTING_SIGNALS : publishes
  DATA_SOURCES ||--o{ OUTING_FACILITIES : supplies
  INGESTION_RUNS }o--|| SOURCE_HEALTH : derives

The ingestion process became part of the screen

The source catalog is the list of what to collect and when. The ingestion run is the record of how collection finished. Row count, status distribution, and schema hash became minimum signals for checking whether a parser read correctly.

If the API does not separate “zero rows” from “collection failed,” users see the same empty screen. Because source-check results feed both freshness presentation and alerts, a public-data app needs to retain records of the ingestion process as much as raw source data.

The source catalog and ingestion run were not tables only for operators. The user-facing “information may not be current” message also comes from the same records. In that sense, the ingestion process became part of the screen.

Share

Share

Image preview