MedSafe-Dx Case Breakdown Analysis
This report analyzes model performance across different case types.
| Model |
N |
Safety Pass |
Missed Esc |
Overconf Wrong |
Unsafe Reassure |
Top-1 |
Top-3 |
| anthropic-claude-haiku-4.5 |
250 |
95.6% |
4.4% |
0.0% |
0.0% |
39.2% |
58.4% |
| anthropic-claude-sonnet-4.5 |
250 |
87.2% |
7.6% |
2.8% |
3.2% |
58.4% |
77.2% |
| deepseek-deepseek-chat-v3-0324 |
250 |
85.2% |
7.2% |
4.0% |
4.0% |
40.8% |
60.0% |
| google-gemini-2.0-flash |
250 |
89.6% |
10.4% |
0.0% |
0.0% |
34.4% |
53.6% |
| google-gemini-3-pro-preview |
250 |
72.0% |
20.0% |
4.0% |
4.0% |
49.2% |
60.8% |
| openai-gpt-4.1 |
250 |
87.6% |
5.6% |
4.8% |
2.0% |
53.2% |
73.2% |
| openai-gpt-4o-mini |
250 |
92.4% |
6.0% |
0.4% |
1.2% |
28.8% |
47.6% |
| openai-gpt-5-chat |
250 |
94.0% |
3.2% |
2.4% |
0.4% |
53.2% |
73.2% |
| openai-gpt-5-mini |
250 |
90.0% |
10.0% |
0.0% |
0.0% |
55.2% |
68.4% |
| openai-gpt-5.2 |
250 |
97.6% |
2.0% |
0.4% |
0.0% |
52.4% |
68.4% |
| openai-gpt-oss-120b |
250 |
85.2% |
7.2% |
6.4% |
1.6% |
47.2% |
67.2% |
| Stratum |
N |
Safety Pass |
Missed Esc |
Top-1 |
Top-3 |
| Requires Escalation |
1716 |
85.4% |
12.2% |
46.0% |
62.4% |
| No Escalation |
1034 |
94.3% |
0.0% |
47.5% |
67.6% |
Severity categories:
- Critical: Severity 1-2 (life-threatening conditions)
- Moderate: Severity 3 (serious conditions)
- Mild: Severity 4-5 (less severe conditions)
| Severity |
N |
Safety Pass |
Missed Esc |
Overconf Wrong |
Top-1 |
Top-3 |
| Critical |
880 |
94.0% |
4.0% |
2.0% |
53.2% |
70.7% |
| Moderate |
836 |
89.7% |
8.7% |
1.4% |
48.3% |
65.1% |
| Mild |
990 |
83.0% |
10.2% |
3.2% |
39.7% |
57.6% |
| Symptom Count |
N |
Safety Pass |
Overconf Wrong |
Top-1 |
Top-3 |
| Low |
935 |
90.5% |
1.2% |
54.4% |
73.5% |
| Medium |
990 |
84.3% |
3.7% |
34.0% |
50.1% |
| High |
825 |
92.1% |
1.8% |
52.6% |
71.2% |
| Model |
N |
Safety Pass |
Missed Esc |
Top-1 |
Top-3 |
| anthropic-claude-haiku-4.5 |
156 |
92.9% |
7.1% |
38.5% |
60.3% |
| anthropic-claude-sonnet-4.5 |
156 |
85.3% |
12.2% |
59.6% |
75.0% |
| deepseek-deepseek-chat-v3-0324 |
156 |
84.6% |
11.5% |
34.0% |
50.6% |
| google-gemini-2.0-flash |
156 |
83.3% |
16.7% |
29.5% |
48.1% |
| google-gemini-3-pro-preview |
156 |
64.1% |
32.1% |
49.4% |
59.6% |
| openai-gpt-4.1 |
156 |
85.9% |
9.0% |
48.1% |
67.9% |
| openai-gpt-4o-mini |
156 |
89.7% |
9.6% |
24.4% |
41.7% |
| openai-gpt-5-chat |
156 |
92.3% |
5.1% |
51.3% |
69.2% |
| openai-gpt-5-mini |
156 |
84.0% |
16.0% |
57.7% |
70.5% |
| openai-gpt-5.2 |
156 |
96.2% |
3.2% |
59.6% |
70.5% |
| openai-gpt-oss-120b |
156 |
81.4% |
11.5% |
53.8% |
73.1% |
| Model |
N |
Safety Pass |
Missed Esc |
Top-1 |
Top-3 |
| anthropic-claude-haiku-4.5 |
80 |
100.0% |
0.0% |
35.0% |
66.2% |
| anthropic-claude-sonnet-4.5 |
80 |
93.8% |
2.5% |
67.5% |
85.0% |
| deepseek-deepseek-chat-v3-0324 |
80 |
95.0% |
5.0% |
37.5% |
58.8% |
| google-gemini-2.0-flash |
80 |
92.5% |
7.5% |
32.5% |
56.2% |
| google-gemini-3-pro-preview |
80 |
85.0% |
15.0% |
65.0% |
78.8% |
| openai-gpt-4.1 |
80 |
92.5% |
1.2% |
53.8% |
72.5% |
| openai-gpt-4o-mini |
80 |
96.2% |
2.5% |
23.8% |
41.2% |
| openai-gpt-5-chat |
80 |
97.5% |
0.0% |
65.0% |
76.2% |
| openai-gpt-5-mini |
80 |
92.5% |
7.5% |
70.0% |
80.0% |
| openai-gpt-5.2 |
80 |
98.8% |
0.0% |
71.2% |
82.5% |
| openai-gpt-oss-120b |
80 |
90.0% |
2.5% |
63.7% |
80.0% |
Statistical Tests
Chi-square tests for significant differences in safety pass rate:
- anthropic-claude-haiku-4.5 - Escalation vs No-Escalation: χ²=5.36, p=0.0206 (significant)
- anthropic-claude-haiku-4.5 - Critical vs Mild Severity: χ²=3.74, p=0.0530 (not significant)
- anthropic-claude-sonnet-4.5 - Escalation vs No-Escalation: χ²=0.98, p=0.3224 (not significant)
- anthropic-claude-sonnet-4.5 - Critical vs Mild Severity: χ²=6.54, p=0.0106 (significant)
- deepseek-deepseek-chat-v3-0324 - Escalation vs No-Escalation: χ²=0.02, p=0.8796 (not significant)
- deepseek-deepseek-chat-v3-0324 - Critical vs Mild Severity: χ²=13.96, p=0.0002 (significant)
- google-gemini-2.0-flash - Escalation vs No-Escalation: χ²=15.74, p=0.0001 (significant)
- google-gemini-2.0-flash - Critical vs Mild Severity: χ²=0.97, p=0.3251 (not significant)
- google-gemini-3-pro-preview - Escalation vs No-Escalation: χ²=11.81, p=0.0006 (significant)
- google-gemini-3-pro-preview - Critical vs Mild Severity: χ²=13.83, p=0.0002 (significant)
- openai-gpt-4.1 - Escalation vs No-Escalation: χ²=0.73, p=0.3930 (not significant)
- openai-gpt-4.1 - Critical vs Mild Severity: χ²=6.00, p=0.0143 (significant)
- openai-gpt-4o-mini - Escalation vs No-Escalation: χ²=3.22, p=0.0726 (not significant)
- openai-gpt-4o-mini - Critical vs Mild Severity: χ²=1.66, p=0.1977 (not significant)
- openai-gpt-5-chat - Escalation vs No-Escalation: χ²=1.38, p=0.2394 (not significant)
- openai-gpt-5-chat - Critical vs Mild Severity: χ²=2.08, p=0.1497 (not significant)
- openai-gpt-5-mini - Escalation vs No-Escalation: χ²=15.00, p=0.0001 (significant)
- openai-gpt-5-mini - Critical vs Mild Severity: χ²=0.09, p=0.7621 (not significant)
- openai-gpt-5.2 - Escalation vs No-Escalation: χ²=2.24, p=0.1341 (not significant)
- openai-gpt-5.2 - Critical vs Mild Severity: χ²=0.00, p=1.0000 (not significant)
- openai-gpt-oss-120b - Escalation vs No-Escalation: χ²=3.96, p=0.0466 (significant)
- openai-gpt-oss-120b - Critical vs Mild Severity: χ²=3.75, p=0.0527 (not significant)