A Parser That Separates No Data From Collection Failure
Official source grades, schema drift, and ingestion runs explain why a screen is empty
Parsing in a public-data app looks easy. Call a URL, read JSON or HTML, and the job seems done.
In a real app, this is where trust begins to split. The system has to know whether a response is truly empty, whether the HTML or JSON shape changed, whether a park name arrived under a different expression, and whether the last successful fetch is too old. If the app does not notice this response-shape change, or schema drift, users can see wrong information.
The issue I encountered most often during development was that “there is no data” and “the app could not read the data” look similar from the outside. Some days have zero event rows. Other days look empty because the parser failed after an official page changed.
If both are stored as the same value, the app quietly creates the wrong screen.
Hangangjari’s parser is not only a converter. It considers the character of the official source, how often it changes, schema drift, park mapping, source URL, and whether the value is safe to show on screen.
First decide which sources can appear on screen
Not all public data is treated with the same weight. Sources are first graded.
| Grade | Meaning | How the app uses it |
|---|---|---|
| Official API | Documented public API | Can be a primary source |
| Official web/AJAX | JSON/HTML publicly called by an official website | Primary or supporting source |
| Official HTML/RSS | HTML/RSS from an official page | Used after parser stability checks |
| Curated static | Human-reviewed static data | Facility, mapping, and correction data |
| Discovery only | Supporting signal for finding missing data | Not directly exposed in user responses |
| Do not use | Login, personalization, payment, vehicle number, captcha, private API | Not used |
This distinction is tied to wording. To say “confirmed from an official source,” the app has to know which source is the baseline. A Discovery only source is closer to a tool for finding gaps than to a value shown to users.
This boundary is conservative. Pulling from more places can make a screen look richer. But if the information can change where a user goes, the app must be able to explain where it came from and when it was checked. For parking and control information especially, source and observation time matter as much as the value itself.
Collection schedules are readable outside code
Hangangjari has a schedule catalog for collection intervals. Collection jobs are grouped by source ID and execution style instead of scattered cron strings in code.
The current categories are:
| Category | Example | Execution style |
|---|---|---|
| Parking master | Parking reference-data sync | Kubernetes CronJob |
| Parking status | Real-time parking status | worker scheduler |
| Outing facility | Convenience and park facilities | Kubernetes CronJob |
| Outing event | Event information | worker scheduler |
| Outing notice | Notices and controls | worker scheduler |
| Realtime context | Seoul real-time city data | worker scheduler |
| Transit dataset | Transit and access helper data | CronJob |
| Forecast generation | Forecast creation | worker scheduler |
| Forecast backtest | Forecast validation | CronJob |
Where a job runs matters less than which source has the problem. The system needs to track which job failed, which source is stale, and which parser produced row-count drift.
Here, worker scheduler and Kubernetes CronJob are not merely a matter of taste. Sources that need short, repeated checks run inside workers. Master, facility, and validation jobs that are fine daily or every few hours are easier to operate as CronJobs. Even within “collection,” intervals and failure impact differ.
Turning source data into screen data
flowchart LR Catalog["Collection schedule catalog"] --> Fetch["Fetch via source adapter"] Fetch --> Parse["Parser<br/>schema_hash<br/>parser version"] Parse --> Normalize["Domain normalization"] Normalize --> Validate["Validation<br/>required fields<br/>park mapping<br/>time"] Validate --> Upsert["Postgres upsert"] Upsert --> Runs["ingestion_runs<br/>status<br/>row count<br/>error"] Upsert --> ReadModel["API screen-ready response"] Runs --> Health["Source status"]
Raw source data goes through five steps before becoming a screen value.
- The catalog provides each source’s contract and execution interval.
- The adapter fetches official or public data.
- The parser turns raw data into candidate values the app can handle.
- The normalizer aligns parks, time, status, and URLs to the app model.
- The validator and repository write data to the DB and leave an ingestion run.
This separation avoids hiding failures. Collection failure is different from “no data.” If a state cannot be shown safely, it should appear as unavailable or stale.
Park names and times are normalized for the screen
External data rarely arrives in the shape the app wants. Hangangjari normalizes these fields separately.
| Item | Reason |
|---|---|
| Park mapping | Official filter names, park names, coordinates, and keywords can differ |
| Time | Start, end, registered, modified, and collected times must be separated |
| Status | Scheduled, ongoing, ended, canceled, and unknown are mapped into app states |
| Freshness | Old successful data must not look current |
| Source URL | Users should be able to verify the source |
| Raw payload | Needed for debugging, but not returned directly in app responses |
Park mapping needed particular care. Hangang parks look familiar, but each source describes them slightly differently. Contexts such as “Jamwon,” “Banpo,” and “Banpo/Jamwon” differ.
The app has to decide first whether to treat them as one place or separate places.
This was not solved by automation alone. String similarity or coordinate distance can split a place that looks obvious to a person, or merge different places too aggressively.
So park mapping became part of the app’s place model, not just a parser helper.
Collection results are data too
If only fetched results are stored in the DB, there is little to inspect when something goes wrong. The collection process itself has to become data.
erDiagram
DATA_SOURCES ||--o{ INGESTION_RUNS : reports
DATA_SOURCES ||--o{ OUTING_SIGNALS : publishes
DATA_SOURCES ||--o{ OUTING_FACILITIES : provides
PARKS ||--o{ OUTING_FACILITIES : contains
OUTING_SIGNALS ||--o{ OUTING_SIGNAL_PARK_LINKS : maps
PARKS ||--o{ OUTING_SIGNAL_PARK_LINKS : receives
ingestion_runs is the window into where collection failed.
- When did it succeed?
- How many rows were read?
- Did the response shape hash change?
- Did the status distribution suddenly change?
- Did row-count drift appear?
- What error caused failure?
These values distinguish “the app is slow” from “the source response changed.” They also make it possible to check whether fixing a parser actually improved collection.
Translation is separated from collection failure
Localization is not just a translation-file problem. If official data is Korean-first, the app has to decide when to translate event names, notices, and facility names, and when to preserve the original.
Hangangjari separates fetched source text from display copy. The translation cache is a value for screen rendering; it does not replace the original source. Translation failure also must not become collection failure.
”Official data” is not enough
The parser is the gate that makes screen values trustworthy. If the app says “based on official data” without deciding which source it trusts, zero rows and collection failure can collapse into the same empty screen.
So collection success, row count, schema hash, and freshness are stored as inspectable data. Raw payloads are not returned directly to users, and park mapping plus localized display are separated from source preservation.
The biggest lesson from parser work was that “official data” alone does not create a trustworthy screen. The app has to define which sources it trusts, how failures are recorded, and how zero rows are explained before even one line on the screen becomes credible.
Share
No comments yet. You can leave the first one.
Pending review