A Parser That Separates No Data From Collection Failure

Official source grades, schema drift, and ingestion runs explain why a screen is empty

Parsing in a public-data app looks easy. Call a URL, read JSON or HTML, and the job seems done.

In a real app, this is where trust begins to split. The system has to know whether a response is truly empty, whether the HTML or JSON shape changed, whether a park name arrived under a different expression, and whether the last successful fetch is too old. If the app does not notice this response-shape change, or schema drift, users can see wrong information.

The issue I encountered most often during development was that “there is no data” and “the app could not read the data” look similar from the outside. Some days have zero event rows. Other days look empty because the parser failed after an official page changed.

If both are stored as the same value, the app quietly creates the wrong screen.

Hangangjari’s parser is not only a converter. It considers the character of the official source, how often it changes, schema drift, park mapping, source URL, and whether the value is safe to show on screen.

First decide which sources can appear on screen

Not all public data is treated with the same weight. Sources are first graded.

GradeMeaningHow the app uses it
Official APIDocumented public APICan be a primary source
Official web/AJAXJSON/HTML publicly called by an official websitePrimary or supporting source
Official HTML/RSSHTML/RSS from an official pageUsed after parser stability checks
Curated staticHuman-reviewed static dataFacility, mapping, and correction data
Discovery onlySupporting signal for finding missing dataNot directly exposed in user responses
Do not useLogin, personalization, payment, vehicle number, captcha, private APINot used

This distinction is tied to wording. To say “confirmed from an official source,” the app has to know which source is the baseline. A Discovery only source is closer to a tool for finding gaps than to a value shown to users.

This boundary is conservative. Pulling from more places can make a screen look richer. But if the information can change where a user goes, the app must be able to explain where it came from and when it was checked. For parking and control information especially, source and observation time matter as much as the value itself.

Collection schedules are readable outside code

Hangangjari has a schedule catalog for collection intervals. Collection jobs are grouped by source ID and execution style instead of scattered cron strings in code.

The current categories are:

CategoryExampleExecution style
Parking masterParking reference-data syncKubernetes CronJob
Parking statusReal-time parking statusworker scheduler
Outing facilityConvenience and park facilitiesKubernetes CronJob
Outing eventEvent informationworker scheduler
Outing noticeNotices and controlsworker scheduler
Realtime contextSeoul real-time city dataworker scheduler
Transit datasetTransit and access helper dataCronJob
Forecast generationForecast creationworker scheduler
Forecast backtestForecast validationCronJob

Where a job runs matters less than which source has the problem. The system needs to track which job failed, which source is stale, and which parser produced row-count drift.

Here, worker scheduler and Kubernetes CronJob are not merely a matter of taste. Sources that need short, repeated checks run inside workers. Master, facility, and validation jobs that are fine daily or every few hours are easier to operate as CronJobs. Even within “collection,” intervals and failure impact differ.

Turning source data into screen data

flowchart LR
  Catalog["Collection schedule catalog"] --> Fetch["Fetch via source adapter"]
  Fetch --> Parse["Parser<br/>schema_hash<br/>parser version"]
  Parse --> Normalize["Domain normalization"]
  Normalize --> Validate["Validation<br/>required fields<br/>park mapping<br/>time"]
  Validate --> Upsert["Postgres upsert"]
  Upsert --> Runs["ingestion_runs<br/>status<br/>row count<br/>error"]
  Upsert --> ReadModel["API screen-ready response"]
  Runs --> Health["Source status"]

Raw source data goes through five steps before becoming a screen value.

  1. The catalog provides each source’s contract and execution interval.
  2. The adapter fetches official or public data.
  3. The parser turns raw data into candidate values the app can handle.
  4. The normalizer aligns parks, time, status, and URLs to the app model.
  5. The validator and repository write data to the DB and leave an ingestion run.

This separation avoids hiding failures. Collection failure is different from “no data.” If a state cannot be shown safely, it should appear as unavailable or stale.

Park names and times are normalized for the screen

External data rarely arrives in the shape the app wants. Hangangjari normalizes these fields separately.

ItemReason
Park mappingOfficial filter names, park names, coordinates, and keywords can differ
TimeStart, end, registered, modified, and collected times must be separated
StatusScheduled, ongoing, ended, canceled, and unknown are mapped into app states
FreshnessOld successful data must not look current
Source URLUsers should be able to verify the source
Raw payloadNeeded for debugging, but not returned directly in app responses

Park mapping needed particular care. Hangang parks look familiar, but each source describes them slightly differently. Contexts such as “Jamwon,” “Banpo,” and “Banpo/Jamwon” differ.

The app has to decide first whether to treat them as one place or separate places.

This was not solved by automation alone. String similarity or coordinate distance can split a place that looks obvious to a person, or merge different places too aggressively.

So park mapping became part of the app’s place model, not just a parser helper.

Collection results are data too

If only fetched results are stored in the DB, there is little to inspect when something goes wrong. The collection process itself has to become data.

erDiagram
  DATA_SOURCES ||--o{ INGESTION_RUNS : reports
  DATA_SOURCES ||--o{ OUTING_SIGNALS : publishes
  DATA_SOURCES ||--o{ OUTING_FACILITIES : provides
  PARKS ||--o{ OUTING_FACILITIES : contains
  OUTING_SIGNALS ||--o{ OUTING_SIGNAL_PARK_LINKS : maps
  PARKS ||--o{ OUTING_SIGNAL_PARK_LINKS : receives

ingestion_runs is the window into where collection failed.

  • When did it succeed?
  • How many rows were read?
  • Did the response shape hash change?
  • Did the status distribution suddenly change?
  • Did row-count drift appear?
  • What error caused failure?

These values distinguish “the app is slow” from “the source response changed.” They also make it possible to check whether fixing a parser actually improved collection.

Translation is separated from collection failure

Localization is not just a translation-file problem. If official data is Korean-first, the app has to decide when to translate event names, notices, and facility names, and when to preserve the original.

Hangangjari separates fetched source text from display copy. The translation cache is a value for screen rendering; it does not replace the original source. Translation failure also must not become collection failure.

”Official data” is not enough

The parser is the gate that makes screen values trustworthy. If the app says “based on official data” without deciding which source it trusts, zero rows and collection failure can collapse into the same empty screen.

So collection success, row count, schema hash, and freshness are stored as inspectable data. Raw payloads are not returned directly to users, and park mapping plus localized display are separated from source preservation.

The biggest lesson from parser work was that “official data” alone does not create a trustworthy screen. The app has to define which sources it trusts, how failures are recorded, and how zero rows are explained before even one line on the screen becomes credible.

Share

Share

Image preview