A Parser That Separates No Data From Collection Failure

Parsing in a public-data app looks easy. Call a URL, read JSON or HTML, and the job seems done.

In a real app, this is where trust begins to split. The system has to know whether a response is truly empty, whether the HTML or JSON shape changed, whether a park name arrived under a different expression, and whether the last successful fetch is too old. If the app does not notice this response-shape change, or schema drift, users can see wrong information.

The issue I encountered most often during development was that “there is no data” and “the app could not read the data” look similar from the outside. Some days have zero event rows. Other days look empty because the parser failed after an official page changed.

If both are stored as the same value, the app quietly creates the wrong screen.

Hangangjari’s parser is not only a converter. It considers the character of the official source, how often it changes, schema drift, park mapping, source URL, and whether the value is safe to show on screen.

First decide which sources can appear on screen

Not all public data is treated with the same weight. Sources are first graded.

Grade	Meaning	How the app uses it
Official API	Documented public API	Can be a primary source
Official web/AJAX	JSON/HTML publicly called by an official website	Primary or supporting source
Official HTML/RSS	HTML/RSS from an official page	Used after parser stability checks
Curated static	Human-reviewed static data	Facility, mapping, and correction data
Discovery only	Supporting signal for finding missing data	Not directly exposed in user responses
Do not use	Login, personalization, payment, vehicle number, captcha, private API	Not used

This distinction is tied to wording. To say “confirmed from an official source,” the app has to know which source is the baseline. A Discovery only source is closer to a tool for finding gaps than to a value shown to users.

This boundary is conservative. Pulling from more places can make a screen look richer. But if the information can change where a user goes, the app must be able to explain where it came from and when it was checked. For parking and control information especially, source and observation time matter as much as the value itself.

Collection schedules are readable outside code

Hangangjari has a schedule catalog for collection intervals. Collection jobs are grouped by source ID and execution style instead of scattered cron strings in code.

The current categories are:

Category	Example	Execution style
Parking master	Parking reference-data sync	Kubernetes CronJob
Parking status	Real-time parking status	worker scheduler
Outing facility	Convenience and park facilities	Kubernetes CronJob
Outing event	Event information	worker scheduler
Outing notice	Notices and controls	worker scheduler
Realtime context	Seoul real-time city data	worker scheduler
Transit dataset	Transit and access helper data	CronJob
Forecast generation	Forecast creation	worker scheduler
Forecast backtest	Forecast validation	CronJob

Where a job runs matters less than which source has the problem. The system needs to track which job failed, which source is stale, and which parser produced row-count drift.

Here, worker scheduler and Kubernetes CronJob are not merely a matter of taste. Sources that need short, repeated checks run inside workers. Master, facility, and validation jobs that are fine daily or every few hours are easier to operate as CronJobs. Even within “collection,” intervals and failure impact differ.

Turning source data into screen data

flowchart LR
  Catalog["Collection schedule catalog"] --> Fetch["Fetch via source adapter"]
  Fetch --> Parse["Parser<br/>schema_hash<br/>parser version"]
  Parse --> Normalize["Domain normalization"]
  Normalize --> Validate["Validation<br/>required fields<br/>park mapping<br/>time"]
  Validate --> Upsert["Postgres upsert"]
  Upsert --> Runs["ingestion_runs<br/>status<br/>row count<br/>error"]
  Upsert --> ReadModel["API screen-ready response"]
  Runs --> Health["Source status"]

Raw source data goes through five steps before becoming a screen value.

The catalog provides each source’s contract and execution interval.
The adapter fetches official or public data.
The parser turns raw data into candidate values the app can handle.
The normalizer aligns parks, time, status, and URLs to the app model.
The validator and repository write data to the DB and leave an ingestion run.

This separation avoids hiding failures. Collection failure is different from “no data.” If a state cannot be shown safely, it should appear as unavailable or stale.

Park names and times are normalized for the screen

External data rarely arrives in the shape the app wants. Hangangjari normalizes these fields separately.

Item	Reason
Park mapping	Official filter names, park names, coordinates, and keywords can differ
Time	Start, end, registered, modified, and collected times must be separated
Status	Scheduled, ongoing, ended, canceled, and unknown are mapped into app states
Freshness	Old successful data must not look current
Source URL	Users should be able to verify the source
Raw payload	Needed for debugging, but not returned directly in app responses

Park mapping needed particular care. Hangang parks look familiar, but each source describes them slightly differently. Contexts such as “Jamwon,” “Banpo,” and “Banpo/Jamwon” differ.

The app has to decide first whether to treat them as one place or separate places.

This was not solved by automation alone. String similarity or coordinate distance can split a place that looks obvious to a person, or merge different places too aggressively.

So park mapping became part of the app’s place model, not just a parser helper.

Collection results are data too

If only fetched results are stored in the DB, there is little to inspect when something goes wrong. The collection process itself has to become data.

erDiagram
  DATA_SOURCES ||--o{ INGESTION_RUNS : reports
  DATA_SOURCES ||--o{ OUTING_SIGNALS : publishes
  DATA_SOURCES ||--o{ OUTING_FACILITIES : provides
  PARKS ||--o{ OUTING_FACILITIES : contains
  OUTING_SIGNALS ||--o{ OUTING_SIGNAL_PARK_LINKS : maps
  PARKS ||--o{ OUTING_SIGNAL_PARK_LINKS : receives

ingestion_runs is the window into where collection failed.

When did it succeed?
How many rows were read?
Did the response shape hash change?
Did the status distribution suddenly change?
Did row-count drift appear?
What error caused failure?

These values distinguish “the app is slow” from “the source response changed.” They also make it possible to check whether fixing a parser actually improved collection.

Translation is separated from collection failure

Localization is not just a translation-file problem. If official data is Korean-first, the app has to decide when to translate event names, notices, and facility names, and when to preserve the original.

Hangangjari separates fetched source text from display copy. The translation cache is a value for screen rendering; it does not replace the original source. Translation failure also must not become collection failure.

”Official data” is not enough

The parser is the gate that makes screen values trustworthy. If the app says “based on official data” without deciding which source it trusts, zero rows and collection failure can collapse into the same empty screen.

So collection success, row count, schema hash, and freshness are stored as inspectable data. Raw payloads are not returned directly to users, and park mapping plus localized display are separated from source preservation.

The biggest lesson from parser work was that “official data” alone does not create a trustworthy screen. The app has to define which sources it trusts, how failures are recorded, and how zero rows are explained before even one line on the screen becomes credible.