DeepSeek reasoning model with 128K context. Uses chain-of-thought. Exceptional math (96/100) and reasoning (96/100). Open-source competitor to o1.
LMArena human preference ranking
Epoch Capabilities Index score
Normalized ECI score (0-100)
Maximum input token capacity
Total compute used during training
Code generation, understanding, and debugging
Mathematical reasoning and problem solving
Multi-step logical reasoning
Overall reasoning and task completion ability
| Dimension | Score |
|---|---|
| Intelligence | 88.0 |
| Reasoning | 77.0 |
| Math | 94.0 |
| Code | 91.0 |
| Overall | 88 |
Labs like OpenAI and Anthropic claim RL reasoning scaling cannot be sustained beyond 1-2 years due to compute infrastructure limits, suggesting the exceptional 2024-2025 capability growth could slow.
DeepSeek V3 used 10x less compute than Llama 3 through MLA (multi-head latent attention), MoE innovations, and multi-token prediction, demonstrating 3x yearly algorithmic efficiency gains.
The best open models runnable on consumer GPUs lag frontier AI by only ~1 year across GPQA, MMLU, and LMArena benchmarks, suggesting rapid capability democratization and regulatory implications.
xAI's Grok Code Fast 1 has surged to the #1 position on OpenRouter with 572.7B tokens processed weekly, more than 3x the second-place model. This dethroned mimo-v2-flash which dropped from #1 (170.9B) to #9 (77.6B), signaling a major shift toward specialized coding models.
DeepSeek R1 has emerged as the leading open-source model for mathematical reasoning, outperforming many closed-source alternatives on MATH and GSM8K benchmarks.