source catalog and ingestion run
I separated collection plans from execution results to explain freshness and failures by source
In a public-data app, collection work does not end at “call the API.” I need to know which source should run when, what to show when it fails, whether row counts suddenly changed, and when the last success happened.
Hangangjari handles this by separating the collection plan from collection records.
They look similar, but their roles differ. The collection plan, or source catalog, answers “what should run and when?” The collection record, or ingestion run, answers “how did the thing that just ran finish?” Both are needed to explain which source shook and when.
I wrote what to collect and when in one place
The collection plan contains the list of ingestion jobs. Instead of scattering intervals and roles across the codebase, it lets the system read source ID, job ID, owner, role, interval, and stale threshold from one place.
flowchart LR Catalog["Collection plan"] --> Scheduler["Worker scheduler"] Catalog --> Cron["Kubernetes CronJob"] Scheduler --> Jobs["Parking / outing / forecast jobs"] Cron --> Batch["Reference-data and backtest jobs"]
The list includes sources of these shapes.
| Category | Example |
|---|---|
| parking master | Sync parking-lot reference data |
| parking status | Poll realtime parking values |
| outing facility | Collect facility data |
| outing event | Collect event data |
| outing notice | Collect notices and controls |
| realtime context | Collect crowding, weather, and traffic context |
| forecast | Build forecasts and backtests |
I looked first at what each source affects on screen, not where it runs. I needed to distinguish which sources support visible values, which jobs affect freshness, and which jobs can be retried without directly appearing on the user screen.
Push is intentionally excluded here. The push worker is also background work, but this list covers work that reads external sources and changes the freshness of data shown on screen. Notification candidate creation and outbox dispatch are closer to sending. Putting both in the same list can look convenient on one screen, but it makes incident causes harder to read.
I also recorded how collection finished
An ingestion run records how a collection job finished.
flowchart LR Fetch["Fetch source"] --> Parse["Parse"] Parse --> Normalize["Normalize"] Normalize --> Upsert["Upsert domain rows"] Upsert --> Run["Ingestion-run record"] Run --> Health["Source health"] Health --> API["API freshness/source state"]
When Hangangjari’s outing ingestion succeeds, it records:
- Source ID.
- Start and end time.
- Success or failure.
- Number of rows read.
- Status distribution.
- Schema hash.
- Failure error summary.
These values are needed both by operators and by the app screen. They reveal parser failures, and the API can produce source-check results such as “fresh,” “stale,” and “unavailable.”
These values are hard to add later. Without run records from the beginning, there is no way to answer “was this source also unstable last week?” after an incident. So I stored not only result rows, but also the act of collecting them.
”No data” and “collection failed” are different
The most dangerous simplification in a public-data screen is showing every failure as “none.”
Zero events is different from a failed event source. No facility is different from a facility parser producing empty rows. Stale realtime context is different from a disabled source.
A source catalog and ingestion runs preserve this difference.
| State | Meaning | Signal to users |
|---|---|---|
| fresh | Recent successful data exists | Usable as reference |
| partial | Only some sources are available | Limited reference |
| stale | Last success is old | On-site difference possible |
| unavailable | Collection failed or source is unavailable | Distinct from none |
I also watch row counts and shape changes
Public-data sources can change quietly. Field names can change, HTML shape can change, or rows for a specific park can disappear.
That is why collection results need schema hash and row count. These values distinguish “there are fewer events today” from “the parser only read half of them.”
Status distribution gives the same kind of signal. If every facility value suddenly becomes unknown, or if crowding distribution is abnormally skewed, the source or parser may be the problem.
Screen guidance follows ingestion results
Source-check results are not only for operators. They are ingredients for the app and widgets to express “how much can I trust this?”
The Hangangjari API classifies sources as fresh, stale, or unavailable based on the last success time and stale threshold per source. This information explains sources and refresh state on screen, and it becomes warning text and unavailable reasons in widgets.
Source-check results must connect to the UI. The stale state on an operations dashboard and the “information may not be current” message in the app are two expressions of the same fact. If that connection breaks, the server knows about a problem while the app pretends not to.
erDiagram
DATA_SOURCES ||--o{ INGESTION_RUNS : records
DATA_SOURCES ||--o{ OUTING_SIGNALS : publishes
DATA_SOURCES ||--o{ OUTING_FACILITIES : supplies
INGESTION_RUNS }o--|| SOURCE_HEALTH : derives
The ingestion process became part of the screen
The source catalog is the list of what to collect and when. The ingestion run is the record of how collection finished. Row count, status distribution, and schema hash became minimum signals for checking whether a parser read correctly.
If the API does not separate “zero rows” from “collection failed,” users see the same empty screen. Because source-check results feed both freshness presentation and alerts, a public-data app needs to retain records of the ingestion process as much as raw source data.
The source catalog and ingestion run were not tables only for operators. The user-facing “information may not be current” message also comes from the same records. In that sense, the ingestion process became part of the screen.
Share
No comments yet. You can leave the first one.
Pending review