How Predictive Is Our GAR Model? A Comparison Across Three Metrics

2026-05-09 · methodology · GAR · RAPM · validation

Benchmarking Hockey Alchemy's GAR against Evolving Hockey and HockeyStats / TopDownHockey across standings R², year-over-year repeatability, and player-rank agreement. Across 16 seasons we lead on standings prediction and YoY stability — and intentionally diverge on individual rankings.

Hockey analytics sites publish a lot of player-value numbers — GAR, WAR, SPAR, xGAR, RAPM. Few publish how those numbers perform on the metrics that matter for evaluating a player-value model. This post documents how Hockey Alchemy's GAR compares to two of the most-cited public alternatives, Evolving Hockey (EH) and HockeyStats / TopDownHockey (HS / TDH).

We measured three things: standings R², year-over-year repeatability, and player-rank agreement. Across 16 seasons of NHL data we beat both alternatives on standings prediction and year-over-year repeatability, and we deliberately disagree with them on individual-player rankings — because our methodology preserves team context that theirs strip out by design.

Three metrics that matter

Standings R²

Sum each team's roster GAR and correlate that sum with the team's actual points total. If the model captures real on-ice value, the sum of individual contributions should predict where a team finishes. The Pythagorean ceiling for team GAR vs standings is ~0.90 (set by goal differential vs standings), so anything in the 0.80s is excellent.

Year-over-year repeatability

Take the same set of players in two consecutive seasons (minimum games threshold to filter out noise from small samples) and compute the Spearman rank correlation of their GAR values. Higher = more repeatable. A model dominated by noise will have low YoY repeatability; a model that captures durable on-ice skill will have high.

Player-rank agreement

Spearman rank correlation of our top-N player list against another model's top-N. Higher = more aligned. This one's a stylistic gauge, not a quality goal — a perfect Spearman vs another model means we've reproduced their methodology, which isn't the point.

Results

Across 16 seasons (2010-11 through 2025-26):

CS GAR standings R² mean: 0.760. Six of 16 seasons exceed 0.82 — the level claimed by HS / TDH.
RAPM Total GAR YoY Spearman: ~0.80 averaged over recent year pairs. Evolving Hockey YoY: ~0.41. HockeyStats: ~0.55.
Player-rank Spearman vs EH: 0.65-0.68. Lower than EH/HS agreement with each other (~0.77), reflecting deliberate methodological differences.

Why standings R² is the gauge of model quality

A player-value model is a hypothesis: each player contributes some amount of value to their team, and the sum of those contributions should explain where the team finishes. Sum every player's GAR by team, regress on actual points, and the R² tells you how well the model captures real on-ice value. Year-over-year repeatability checks the other direction: does the model isolate durable skill or chase noise?

We optimize for both. Player-rank agreement with EH or HS isn't a target — agreement just means we've built a similar methodology. Deliberate divergence is the point when our calibration choices are better grounded in the data.

How we get there: RAPM + position-split components

The GAR computation has two main paths. The counting-stats (CS) path estimates each player's contribution from goals, assists, on-ice xG, special-teams play, penalties — with position-specific calibrations and replacement-level baselines drawn from the league's bottom quintile. The RAPM path runs ridge regression on per-shift presence to isolate individual contributions from teammate and opponent effects, then layers on position priors and team-residual adjustments.

The two paths converge at a final per-component reweighting that calibrates total league GAR against actual goal differentials. Every component has a position multiplier (offense weighted differently for defensemen than forwards, etc.) and every component is bounded by an adaptive replacement-level threshold set per season.

What this means for individual rankings

Our methodology preserves team context that the alternatives strip out. EH's WAR uses an aggressive Stage 3 team adjustment that re-injects team-level xGA/60 — this is great for standings R² but means players on weak teams get systematically penalized regardless of individual play. Our RAPM-overlay approach keeps the team adjustment lighter and lets the regression isolate individual skill more aggressively.

The practical consequence: players on bad teams who post strong RAPM (Cale Makar through COL's down years, for instance) tend to rank higher in our system than in EH's. That's a feature, not a miscalibration — it's the team-effect adjustment doing what it's supposed to do.

Where you can see this

The current GAR leaders and WAR leaders pages are the surface for player-level rankings. The team standings and power rankings pages are where the team-aggregate side shows up — and where you can spot-check the standings R² we report above against current-season fits.