MedSafe-Dx Case Breakdown Analysis

This report analyzes model performance across different case types.

Overall Performance by Model

Model N Safety Pass Missed Esc Overconf Wrong Unsafe Reassure Top-1 Top-3
anthropic-claude-haiku-4.5 250 95.6% 4.4% 0.0% 0.0% 39.2% 58.4%
anthropic-claude-sonnet-4.5 250 87.2% 7.6% 2.8% 3.2% 58.4% 77.2%
deepseek-deepseek-chat-v3-0324 250 85.2% 7.2% 4.0% 4.0% 40.8% 60.0%
google-gemini-2.0-flash 250 89.6% 10.4% 0.0% 0.0% 34.4% 53.6%
google-gemini-3-pro-preview 250 72.0% 20.0% 4.0% 4.0% 49.2% 60.8%
openai-gpt-4.1 250 87.6% 5.6% 4.8% 2.0% 53.2% 73.2%
openai-gpt-4o-mini 250 92.4% 6.0% 0.4% 1.2% 28.8% 47.6%
openai-gpt-5-chat 250 94.0% 3.2% 2.4% 0.4% 53.2% 73.2%
openai-gpt-5-mini 250 90.0% 10.0% 0.0% 0.0% 55.2% 68.4%
openai-gpt-5.2 250 97.6% 2.0% 0.4% 0.0% 52.4% 68.4%
openai-gpt-oss-120b 250 85.2% 7.2% 6.4% 1.6% 47.2% 67.2%

Performance by Escalation Requirement (All Models)

Stratum N Safety Pass Missed Esc Top-1 Top-3
Requires Escalation 1716 85.4% 12.2% 46.0% 62.4%
No Escalation 1034 94.3% 0.0% 47.5% 67.6%

Performance by Case Severity (All Models)

Severity categories: - Critical: Severity 1-2 (life-threatening conditions) - Moderate: Severity 3 (serious conditions) - Mild: Severity 4-5 (less severe conditions)

Severity N Safety Pass Missed Esc Overconf Wrong Top-1 Top-3
Critical 880 94.0% 4.0% 2.0% 53.2% 70.7%
Moderate 836 89.7% 8.7% 1.4% 48.3% 65.1%
Mild 990 83.0% 10.2% 3.2% 39.7% 57.6%

Performance by Symptom Count (All Models)

Symptom Count N Safety Pass Overconf Wrong Top-1 Top-3
Low 935 90.5% 1.2% 54.4% 73.5%
Medium 990 84.3% 3.7% 34.0% 50.1%
High 825 92.1% 1.8% 52.6% 71.2%

Performance on Escalation-Required Cases (by Model)

Model N Safety Pass Missed Esc Top-1 Top-3
anthropic-claude-haiku-4.5 156 92.9% 7.1% 38.5% 60.3%
anthropic-claude-sonnet-4.5 156 85.3% 12.2% 59.6% 75.0%
deepseek-deepseek-chat-v3-0324 156 84.6% 11.5% 34.0% 50.6%
google-gemini-2.0-flash 156 83.3% 16.7% 29.5% 48.1%
google-gemini-3-pro-preview 156 64.1% 32.1% 49.4% 59.6%
openai-gpt-4.1 156 85.9% 9.0% 48.1% 67.9%
openai-gpt-4o-mini 156 89.7% 9.6% 24.4% 41.7%
openai-gpt-5-chat 156 92.3% 5.1% 51.3% 69.2%
openai-gpt-5-mini 156 84.0% 16.0% 57.7% 70.5%
openai-gpt-5.2 156 96.2% 3.2% 59.6% 70.5%
openai-gpt-oss-120b 156 81.4% 11.5% 53.8% 73.1%

Performance on Critical Severity Cases (by Model)

Model N Safety Pass Missed Esc Top-1 Top-3
anthropic-claude-haiku-4.5 80 100.0% 0.0% 35.0% 66.2%
anthropic-claude-sonnet-4.5 80 93.8% 2.5% 67.5% 85.0%
deepseek-deepseek-chat-v3-0324 80 95.0% 5.0% 37.5% 58.8%
google-gemini-2.0-flash 80 92.5% 7.5% 32.5% 56.2%
google-gemini-3-pro-preview 80 85.0% 15.0% 65.0% 78.8%
openai-gpt-4.1 80 92.5% 1.2% 53.8% 72.5%
openai-gpt-4o-mini 80 96.2% 2.5% 23.8% 41.2%
openai-gpt-5-chat 80 97.5% 0.0% 65.0% 76.2%
openai-gpt-5-mini 80 92.5% 7.5% 70.0% 80.0%
openai-gpt-5.2 80 98.8% 0.0% 71.2% 82.5%
openai-gpt-oss-120b 80 90.0% 2.5% 63.7% 80.0%

Statistical Tests

Chi-square tests for significant differences in safety pass rate:

Rendered from /app/results/analysis/case_breakdown_tables.md (mtime UTC: 2026-04-22T20:53:37.211961Z, bytes: 6274, sha256: 62120f2d94b9)