Methodology
Horse Enrich
Dashboard

Reference

Methodology & data contract

Sale horses from production scrapers are merged with HorseTelex, FEI, Hippomundo, WBFSH, and progeny-cache signals. Field values follow a strict precedence order per build_comprehensive.py.

32,964 horses1,392 w/ jumpscore14 triple-match

Interactive pipeline

Pan, zoom, and use the minimap. Each source is a real subsystem in this repo; the merge node reflects build_comprehensive.py precedence, not a black-box ETL.

Live corpus: 32,964 horses · ~14% HorseTelex · ~4% FEI ID.

Mini Map

Pan and zoom the canvas. Orange edges show live ingestion into the merge step; the minimap helps on smaller screens.

Source lenses

Every source answers a different question: commercial fact, breeding database, sport outcomes, or ranking priors.

production

Base listing: identity, sale price, source auction, initial pedigree.

horsetelex

Primary sport-breeding match: pedigree refinement, jumpscore, HT links.

fei

Competition records: FEI ID, heights, counts, computed jumpscore when applicable.

hippomundo

Results pages: Hippomundo ID, optional sire/dam, results presence.

wbfsh

Stallion rankings and breeder; fallback for missing pedigree/registry fields.

progeny_cache

Sire offspring index: damsire recovery, jumpscore via progeny list.

Field-level precedence

First trustworthy non-empty wins. This is how we keep a single row per horse without hiding conflicts — check data_sources on each record.

sire

1. horsetelex2. production3. hippomundo4. wbfsh

dam

1. horsetelex2. production3. hippomundo

damsire

1. horsetelex2. production3. hippomundo4. progeny_cache5. wbfsh

birth year

1. horsetelex2. hippomundo3. production4. fei

sex

1. fei2. production

color

1. fei2. production

studbook

1. horsetelex2. fei3. production

ueln

1. horsetelex2. production

fei id

1. horsetelex2. fei

jumpscore

1. horsetelex2. progeny_cache3. fei_competition
Jumpscore provenance

The score is unified, but the origin is explicit — essential for audits and for models that need source-aware features later.

horsetelex

Score taken from matched HorseTelex record.

progeny_cache

Score inferred from sire progeny listing.

fei_competition

Score computed from FEI competition history.

Scraping & automation posture

High-level only — rate limits, Playwright lifecycles, and cache files live in Python modules; nothing here exposes infra secrets.

  • Hippomundo: browser automation loads sport results pages, paginates, extracts structured results.
  • Progeny: HorseTelex sire pages fetched with pagination to build offspring cache.
  • Rate limits and backoff are enforced; no credentials belong in this API.
Interpretation playbook

Market

Use sale_price vs price_estimate to spot listing premia; always segment by source (currency normalization already lives in ML code).

Sport

Treat jumpscore_sourceas a reliability knob: direct HT > progeny inference > FEI-derived.

Quality

Dashboard Coverage and source_overlaps tell you how often multi-system agreement exists — the backbone for future graph features.