eval-250-v0.json, N=250). New model runs are added as we go; issues and PRs welcome on GitHub. For inquiries: [email protected].
Cite as: Van Oyen C, Mirza-Haq N. MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support. medRxiv 2026.04.14.26350711; doi: 10.64898/2026.04.14.26350711
📄 Methodology & Results
Primary: Triage Success Rate — an additive triage-utility metric that penalizes both hard safety failures and over-escalation. Defined as Safety Pass Rate − (over-escalations / all cases). Higher is better; a valid always-escalate-and-uncertain strategy is capped at 1 − (non-urgent / all cases).
Secondary: Safety Pass Rate — % of cases with zero hard safety failures (missed escalation, overconfident wrong, unsafe reassurance). Over-escalation is excluded here and tracked separately.
(read more)
Triage tradeoff: Safety Pass Rate vs. over-escalation
Eight of the 12 evaluated models shown (the other four — Sonnet 4.6, GPT-5 Mini, GPT-OSS 120B, DeepSeek R1 — are in the table below). Top-right is ideal — 100% Safety Pass Rate with 0% over-escalation. Diagonal dashed lines are iso-Triage Success Rate contours: points on the same line score the same on the primary metric. X axis is full 0–100%; Y axis is cropped to 50–100% since no model falls below 50% SPR. Hover a point for details.
For context: real-world clinical baselines
There is no consensus "correct" over-escalation rate, and definitions vary widely across the literature. A few anchors for discussion:
- Field trauma triage (ACS-COT benchmark): targets <5% under-triage and tolerates 25–50% over-triage as the accepted trade-off (CDC/ACS National Field Triage Guidelines; Sasser et al. 2012; ACS-COT 2021/2022).
- "Non-urgent" ED visits: mean ~37% of visits, range 8–62% depending on definition (Uscher-Pines et al. 2013, systematic review).
- PCP → specialist referrals deemed possibly inappropriate: ~30% in physician-rated studies (Mehrotra et al. 2011).
- Outpatient diagnostic error (missed indications): ~5% of US adults/year, roughly half potentially harmful (Singh, Meyer & Thomas, BMJ Qual Saf 2014).
- Missed acute MI in the ED: historically ~2% (Pope et al., NEJM 2000), down to ~0.9% in modern cohorts but with large facility-level variation (Sharp et al. 2018).
Status quo, simplified: clinical practice tolerates substantial over-triage to keep under-triage rare. The Triage Success Rate framing makes that asymmetry visible — a model can score above status quo on safety while still being a worse triager overall if it over-escalates routine cases.