71,584 real thumbs-up / thumbs-down conversations from Tencent Yuanbao — the first
user-centric LLM benchmark grounded in genuine user satisfaction signals.
Last updated: 2026-05-21 · Task 1: N=2,000 · Task 2: N=4,900 · Task 3: N=500
| Rank | Model | Good-Acc ↑ | Bad-Acc ↑ | Avg-Acc ↑ |
|---|
Avg-Acc = (Good-Acc + Bad-Acc) / 2 — rewards balanced discrimination. Good-Acc: recall on 1,000 liked conversations. Bad-Acc: recall on 1,000 disliked conversations.
| Rank | Model | Good% ↑ |
|---|
Good%: fraction of conversations rated above threshold.
| Rank | Model | Recovery Rate ↑ |
|---|
Recovery Rate: improved responses rated as satisfying.
UXBench is designed to answer three core research questions: (1) how well automated LLM judges predict real user feedback to an AI-generated response; (2) whether frontier LLMs can generate high-quality responses for failure-prone user queries; and (3) how model capability improvements translate into measurable UX gains.
Unlike benchmarks relying on expert annotation or synthetic prompts, UXBench is mined from 38,331 disliked and 33,253 liked real conversations — capturing what actual users find frustrating or satisfying.
Three complementary tasks: UX Judge (discriminative), UX Eval (GRM judge), and UX Recovery (generative fix) — covering the full quality-assurance cycle.
A human-validated taxonomy of 10 failure dimensions and 8 success patterns covering 8 interaction scenarios and 83 domains, enabling fine-grained capability diagnosis.
Insights from evaluating 26 frontier LLMs on real-user UX data
Every model shows Good-Acc ≫ Bad-Acc, with gaps of 27–88 pp. LLMs are biased toward validating responses — a critical flaw for QA pipelines.
LLM judges exhibit measurable in-group favoritism: same-family models score each other higher. Mixed-vendor ensembles produce more calibrated evaluations.
General-capability rankings show weak correlation (r=0.31) with UXBench rankings. Strong reasoning does not guarantee a good user experience.
Task 3 recovery rates top out at 12.8% (Claude Opus 4.6), revealing that generating genuinely improved responses remains an open challenge even for frontier models.
5-stage auto-labeling pipeline · ~1.5B tokens processed
10 failure dimensions (BAD) · 8 success patterns (GOOD) · human-validated · 8 scenarios · 83 domains
If you use UXBench in your research, please cite:
@misc{hong2026uxbench,
title = {UXBench: Benchmarking User Experience in AI Assistants},
author = {Mengze Hong and Xia Zeng and Zeyang Lei and Sheng Wang and
Chen Jason Zhang and Di Jiang and others},
year = {2026},
eprint = {2606.09570},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.09570}
}