🏥 MedSafe-Dx Leaderboard

Safety-First Clinical Diagnostic Decision Support Evaluation

📄 Methodology & Results 📝 Preprint 💻 GitHub 🏠 README
MedSafe-Dx v0 — a safety-focused benchmark from Cortico Health Technologies for evaluating LLMs in clinical diagnostic decision support. Results on this leaderboard are reproducible from a frozen seed (eval-250-v0.json, N=250). New model runs are added as we go; issues and PRs welcome on GitHub. For inquiries: [email protected]. Cite as: Van Oyen C, Mirza-Haq N. MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support. medRxiv 2026.04.14.26350711; doi: 10.64898/2026.04.14.26350711

📄 Methodology & Results

Primary: Safety Pass Rate — % of cases with zero safety failures (higher is better). (read more)

Safety vs. over-escalation tradeoff

Each model is one point. Top-right is ideal: high Safety Pass Rate (catches urgent cases, avoids unsafe reassurance) with low over-escalation (avoids alarm fatigue on routine cases). The shaded zone marks the sweet spot: ≥ 90% Safety Pass Rate and ≤ 60% over-escalation. Hover a point for details.