Enrichment pipeline & data plays
Every phase below maps to real scripts and JSON artifacts in the repo. Visual enrichment is what you see in the UI (badges, flows, charts); data opportunities are angles you can still productize or research.
Phases (chronological story)
Not every run executes every sub-phase; build_comprehensive.py is the final join that turns partial progress into one row per horse.
auction / listing scrapers → horses2
Raw lots: name, sire, price, source_key, initial pedigree. This is the commercial spine every later phase hangs on.
enrichment/phase2_progeny.py
Paginates HorseTelex sire pages; builds offspring index (name → damsire, jumpscore). Powers damsire recovery and jumpscore via progeny inference.
phase3_multisource.py --phase 3a
SBS/KWPN/BWP style recovery: birth year, UELN, pedigree hints before HT retry.
phase3_multisource.py --phase 3b
Re-match HT using recovered YOB; fills ht_id, jumpscore, riders when possible.
phase3_fei_competition.py, phase3c_hippomundo*.py, phase4_*
FEI: Playwright search, heights → computed jumpscore mapping. Hippomundo: results pages, pagination, rate limits. Family expansion in phase4_hippomundo_family.
phase3_multisource.py --phase 3d
Tiered heuristics (studbook baselines, sex/age) with zero extra HTTP — feeds price_tier / estimate on horses.
phase3_multisource.py --phase 3e
Hooks embryo scoring into the same enrichment universe for Insights / ML summaries.
enrichment/wbfsh_scraper.py + indexes
Stallion ranking + breeder; FEI-index fallback fills pedigree gaps when HT is thin.
build_comprehensive.py
Per-field precedence (HT > prod > Hippo > WBFSH / progeny), data_sources[], jumpscore_source, single row per horse for API + UI.
enrich_api.py + Next.js frontend
FastAPI loads consolidated JSON; dashboard shows coverage, lineage, React Flow pipelines, Recharts ML lab.
Beyond raw columns: badges for data_sources, 15-hue theme tokens (nav accents, spectrum hairline, mesh background), horse detail lineage card, methodology & pipeline timelines, Insights charts (verdict pie, SHAP bars, model R²). React Flow graphs explain ingest vs ML paths.
Next visuals we could ship
- Heatmap: studbook × median sale_price (needs aggregation endpoint).
- Scatter: jumpscore vs log(price) with source hue (client-side from export or future /analytics).
- Geo: FEI country choropleth when fei_country coverage is high enough.
- Pedigree mini-graph for a horse (sire/dam/damsire nodes) with link-outs.
- Sparklines when longitudinal price or height series exist.
See also Methodology, Insights, and horse Search for lineage cards.
Sport vs commerce gap
Horses with sale_price but no jumpscore / FEI tell you ‘paper vs arena’ — good for buyer-risk flags.
Triple-system horses
Use source_overlaps (HT+FEI+Hippomundo) as a gold cohort for benchmarking model confidence.
Jumpscore provenance mix
Stratify models by jumpscore_source (HT vs progeny vs FEI-derived) to measure proxy noise.
Sire × damsire pricing
Cross-tab mean/median price — BLUP-style features already feed price_predictor; expose as a public chart later.
Registry + height story
FEI max_height vs studbook: who jumps above their paperwork?
Embryo vs foal listings
Filter names/metadata for embryo lots; compare predicted bands from embryo_predictions.json when mounted.