RANKINGS
LLM Leaderboard rankings. Compare large language models by GPQA, AIME 2025, SWE-bench, HLE, MMMLU, BrowseComp, and MMMU-Pro benchmark scores.
LLM Leaderboard rankings. Compare large language models by GPQA, AIME 2025, SWE-bench, HLE, MMMLU, BrowseComp, and MMMU-Pro benchmark scores.
Showing 1–20 of 110
| # | Model | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 82.5% | 91.3% | 99.8% | 80.8% | 91.1% | 84.0% | 53.1% | 77.3% | |
| 2 | 80.9% | 94.3% | — | 80.6% | 92.6% | 85.9% | 51.4% | 80.5% | |
| 3 | — | 88.4% | 100.0% | — | — | — | 50.7% | — | |
| 4 | 76.3% | 89.9% | — | 79.6% | 89.3% | 74.7% | 49.0% | 75.6% | |
| 5 | 81.1% | 91.9% | 100.0% | 76.2% | 91.8% | — | 45.8% | 81.0% | |
| 6 | 80.8% | 90.4% | 99.7% | 78.0% | 91.8% | — | 43.5% | 81.2% | |
| 7 | — | 87.5% | 91.7% | — | — | — | 40.0% | — | |
| 8 | — | 92.8% | — | — | — | 82.7% | 39.8% | 81.2% | |
| 9 | Baiduernie-5.0 | — | 85.0% | 87.0% | — | — | — | 39.0% | — |
| 10 | — | 93.2% | 100.0% | — | — | 77.9% | 36.6% | — | |
| 11 | 77.4% | 92.4% | 100.0% | 80.0% | 89.6% | 65.8% | 34.5% | 79.5% | |
| 12 | Alibaba Cloud / Qwen Teamqwen3.6-plus | 73.3% | 90.4% | — | 78.8% | 89.5% | — | 28.8% | 78.8% |
| 13 | — | 88.0% | — | — | — | — | 28.2% | 76.6% | |
| 14 | 68.9% | 85.7% | 94.6% | 74.9% | — | 54.9% | 24.8% | 78.4% | |
| 15 | — | 82.8% | — | — | — | — | 24.3% | 66.1% | |
| 16 | — | 86.4% | 88.0% | 67.2% | — | — | 21.6% | — | |
| 17 | — | 85.7% | 92.0% | — | — | 44.9% | 20.0% | — | |
| 18 | — | 83.0% | 83.0% | 63.2% | — | — | 17.8% | — | |
| 19 | — | 82.3% | 91.1% | — | — | — | 16.7% | — | |
| 20 | — | 86.9% | — | — | 88.9% | — | 16.0% | 76.8% |