Agent evaluation metrics shifting from single-attempt success to probabilistic measures. Pass@k (success in any of k attempts) and pass^k (success in all k attempts) becoming standard for measuring both capability ceiling and reliability floor.
Pass@k shows ceiling of what agents can do; pass^k shows reliability floor
Jan 9, 2026SWE-bench Verified now reporting both pass@1 and pass@5 scores
Jan 9, 2026Human-in-the-loop scoring with checkpoints captures partial progress
Jan 9, 2026