Back to Benchmarks

Lech Mazur Writing

Writing quality and style evaluation

Frontier

Category:language

EDI:108.4

Slope:0.92

Leaderboard

(27 models)

Rank	Model	Score	Stderr
1	GPT-5.2	8.72	—
2	Qwen3-Max-Instruct	8.71	—
3	kimi-k2-thinking (official)	8.69	—
4	o3	8.63	—
5	Gemini 2.5 Pro (Jun 2025)	8.60	—
6	Claude Opus 4.5	8.54	—
7	DeepSeek V3	8.52	—
8	Qwen 3 235B	8.49	—
9	Qwen3-235B-A22B	8.30	—
10	DeepSeek R1	8.30	—
11	GPT-4o	8.18	—
12	Grok 4	8.11	—
13	Claude 3.7 Sonnet	8.11	—
14	QwQ-32B	8.07	—
15	Gemma 3 27B	7.99	—
16	GPT-OSS 120B	7.73	—
17	Grok-3 mini	7.64	—
18	GPT-4.1	7.56	—
19	o4-mini-2025-04-16 medium	7.50	—
20	Gemini 2.0 Flash Thinking Exp	7.38	—
21	Claude 3.5 Haiku	7.35	—
22	Qwen2.5-Max	7.29	—
23	o1	7.02	—
24	Mistral Large	6.90	—
25	gpt-4o-mini-2024-07-18	6.72	—
26	Llama-4-Maverick-17B-128E-Instruct	6.37	—
27	Phi-4	6.26	—

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

Stack

Build compatible AI stacks

Tools

Build compatible AI stacks

Registry

Component Browser

Browse and compare AI components

AI model evaluation datasets

Compare AI models side-by-side

Knowledge Graph

Explore entity relationships

Training

Memory Calculator

Estimate GPU memory for model training

Parallelism Advisor

Optimize tensor, pipeline, and data parallelism

Inference

Quantization Advisor

Choose the right quantization method

Serving Engine Advisor

Find the optimal inference engine

Reasoning Strategy Advisor

Optimize LLM reasoning for your task

Agent Eval Strategy Advisor

Design evaluation strategies for AI agents

Cost

Compare API vs self-hosted costs

Spot Instance Advisor

Maximize savings with spot instances

Vibe Calculator

localhost:app → Production reality check

Enterprise AI ROI

Quantify AI investment returns

Lech Mazur Writing: Top Score 8.7% - AI Benchmark | NeoSignal