Pipeline
Horse Enrich
Dashboard
methodology v2 15 theme hues

Enrichment pipeline & data plays

Every phase below maps to real scripts and JSON artifacts in the repo. Visual enrichment is what you see in the UI (badges, flows, charts); data opportunities are angles you can still productize or research.

Phases (chronological story)

Not every run executes every sub-phase; build_comprehensive.py is the final join that turns partial progress into one row per horse.

Production ingest
p0

auction / listing scrapers → horses2

Raw lots: name, sire, price, source_key, initial pedigree. This is the commercial spine every later phase hangs on.

production_horses.json
Phase 2 — Progeny cache
p2

enrichment/phase2_progeny.py

Paginates HorseTelex sire pages; builds offspring index (name → damsire, jumpscore). Powers damsire recovery and jumpscore via progeny inference.

horsetelex_progeny_cache.json
Phase 3a — Studbook resolution
p3a

phase3_multisource.py --phase 3a

SBS/KWPN/BWP style recovery: birth year, UELN, pedigree hints before HT retry.

phase3a_studbook_results.json
Phase 3b — HorseTelex retry
p3b

phase3_multisource.py --phase 3b

Re-match HT using recovered YOB; fills ht_id, jumpscore, riders when possible.

enrichment_results.json (updates)
Phase 3c — Competition (FEI + Hippomundo)
p3c

phase3_fei_competition.py, phase3c_hippomundo*.py, phase4_*

FEI: Playwright search, heights → computed jumpscore mapping. Hippomundo: results pages, pagination, rate limits. Family expansion in phase4_hippomundo_family.

phase3_fei_results.jsonphase3c_hippomundo_results.jsonphase4 caches
Phase 3d — Price estimation cascade
p3d

phase3_multisource.py --phase 3d

Tiered heuristics (studbook baselines, sex/age) with zero extra HTTP — feeds price_tier / estimate on horses.

phase3d_price_estimates.json
Phase 3e — Embryo predictions (optional)
p3e

phase3_multisource.py --phase 3e

Hooks embryo scoring into the same enrichment universe for Insights / ML summaries.

embryo-related artifacts when run
WBFSH rankings
wbfsh

enrichment/wbfsh_scraper.py + indexes

Stallion ranking + breeder; FEI-index fallback fills pedigree gaps when HT is thin.

wbfsh_matches.jsonwbfsh_fei_index.json
Comprehensive merge
merge

build_comprehensive.py

Per-field precedence (HT > prod > Hippo > WBFSH / progeny), data_sources[], jumpscore_source, single row per horse for API + UI.

horses_comprehensive.jsonhorses_consolidated.json
Serve & visualize
serve

enrich_api.py + Next.js frontend

FastAPI loads consolidated JSON; dashboard shows coverage, lineage, React Flow pipelines, Recharts ML lab.

/stats/methodology/ml/summary/horse/{id}
Visual enrichment (today)

Beyond raw columns: badges for data_sources, 15-hue theme tokens (nav accents, spectrum hairline, mesh background), horse detail lineage card, methodology & pipeline timelines, Insights charts (verdict pie, SHAP bars, model R²). React Flow graphs explain ingest vs ML paths.

Next visuals we could ship

  • Heatmap: studbook × median sale_price (needs aggregation endpoint).
  • Scatter: jumpscore vs log(price) with source hue (client-side from export or future /analytics).
  • Geo: FEI country choropleth when fei_country coverage is high enough.
  • Pedigree mini-graph for a horse (sire/dam/damsire nodes) with link-outs.
  • Sparklines when longitudinal price or height series exist.

See also Methodology, Insights, and horse Search for lineage cards.

Unique data angles

Sport vs commerce gap

Horses with sale_price but no jumpscore / FEI tell you ‘paper vs arena’ — good for buyer-risk flags.

Triple-system horses

Use source_overlaps (HT+FEI+Hippomundo) as a gold cohort for benchmarking model confidence.

Jumpscore provenance mix

Stratify models by jumpscore_source (HT vs progeny vs FEI-derived) to measure proxy noise.

Sire × damsire pricing

Cross-tab mean/median price — BLUP-style features already feed price_predictor; expose as a public chart later.

Registry + height story

FEI max_height vs studbook: who jumps above their paperwork?

Embryo vs foal listings

Filter names/metadata for embryo lots; compare predicted bands from embryo_predictions.json when mounted.