UXBench: Benchmarking User Experience
in AI Assistants

71,584 real thumbs-up / thumbs-down conversations from Tencent Yuanbao — the first
user-centric LLM benchmark grounded in genuine user satisfaction signals.

📄 Paper 🤗 Dataset 💻 Code
70K+Real Conversations
·
1,000Task 1 — UX Judge (Bad)
·
1,000Task 1 — UX Judge (Good)
·
4,900Task 2 — UX Eval
·
500Task 3 — UX Recovery

🏆 Leaderboard

Last updated: 2026-05-21 · Task 1: N=2,000 · Task 2: N=4,900 · Task 3: N=500

26 frontier models + GRM · N=2,000 · Click headers to sort
Rank Model Good-Acc ↑ Bad-Acc ↑ Avg-Acc ↑

Avg-Acc = (Good-Acc + Bad-Acc) / 2 — rewards balanced discrimination. Good-Acc: recall on 1,000 liked conversations. Bad-Acc: recall on 1,000 disliked conversations.

Task 2 · UX Eval N=4,900

Rank Model Good% ↑

Good%: fraction of conversations rated above threshold.

Task 3 · UX Recovery N=500

Rank Model Recovery Rate ↑

Recovery Rate: improved responses rated as satisfying.


📚 About UXBench

UXBench is designed to answer three core research questions: (1) how well automated LLM judges predict real user feedback to an AI-generated response; (2) whether frontier LLMs can generate high-quality responses for failure-prone user queries; and (3) how model capability improvements translate into measurable UX gains.

🎯 Real-User Signal

Unlike benchmarks relying on expert annotation or synthetic prompts, UXBench is mined from 38,331 disliked and 33,253 liked real conversations — capturing what actual users find frustrating or satisfying.

🔮 Multi-Task Coverage

Three complementary tasks: UX Judge (discriminative), UX Eval (GRM judge), and UX Recovery (generative fix) — covering the full quality-assurance cycle.

🥳 Failure & Success Taxonomy

A human-validated taxonomy of 10 failure dimensions and 8 success patterns covering 8 interaction scenarios and 83 domains, enabling fine-grained capability diagnosis.


🔍 Key Findings

Insights from evaluating 26 frontier LLMs on real-user UX data

📊 Systematic Positive Bias

Every model shows Good-Acc ≫ Bad-Acc, with gaps of 27–88 pp. LLMs are biased toward validating responses — a critical flaw for QA pipelines.

🏠 Self-Preference Effect

LLM judges exhibit measurable in-group favoritism: same-family models score each other higher. Mixed-vendor ensembles produce more calibrated evaluations.

🔀 Capability ≠ UX

General-capability rankings show weak correlation (r=0.31) with UXBench rankings. Strong reasoning does not guarantee a good user experience.

🔧 Recovery is Hard

Task 3 recovery rates top out at 12.8% (Claude Opus 4.6), revealing that generating genuinely improved responses remains an open challenge even for frontier models.


⚙️ Data Pipeline

5-stage auto-labeling pipeline · ~1.5B tokens processed

📥
Stage 1
Signal Extraction
71.5K → 48.6K
🔎
Stage 2
Pre-filter
Dedup + Quality
⛏️
Stage 3
Miner Agent
Reason Extraction
⚖️
Stage 4
Judge Agent
5-axis Scoring
🧪
Stage 5
QA Full Scan
48.6K → 7,400

📂 Failure & Success Taxonomy

10 failure dimensions (BAD) · 8 success patterns (GOOD) · human-validated · 8 scenarios · 83 domains

❌ Failure Modes (BAD)

📋
Verbosity / Redundancy
34.3%
Response is excessively long, repetitive, or padded with irrelevant content.
🔧
Task Incompleteness
24.8%
Only part of the user's request was fulfilled; key deliverables are missing.
🎯
Intent Misunderstanding
11.5%
Response addresses the wrong question or ignores user intent.
⚠️
Factual Error
11.0%
Incorrect facts, hallucinations, or verifiably false information.
🔎
Information Reliability Issue
10.1%
Outdated, unverifiable, or logically flawed information.
🖼️
Instruction / Format Failure
2.8%
Ignores explicit formatting constraints or structural requirements.
📉
Insufficiently Informative
2.2%
Response too superficial or vague to be practically useful.
💬
Emotional Tone Mismatch
2.0%
Inappropriate tone or lack of empathy in emotionally sensitive contexts.

✅ Success Patterns (GOOD)

Accurate Answering
17.4%
Factually correct, verified, and directly answers the question.
🧠
Knowledge Depth
15.4%
Expert-level insight beyond surface-level answers.
📚
Comprehensive Detail
13.7%
Full coverage of all relevant aspects of the question.
💡
Problem Solving
12.6%
Genuinely resolves the underlying user problem.
🛠️
Practical Guidance / Actionability
12.1%
Concrete, executable steps the user can immediately follow.
Creative Generation
11.6%
High-quality creative content that exceeds user expectations.
🎯
Task Completion
9.0%
Follows all instructions and delivers the requested output in full.
💬
Empathetic Support
8.2%
Emotionally appropriate, warm, and supportive responses.

📝 Citation

If you use UXBench in your research, please cite:

@misc{hong2026uxbench,
  title         = {UXBench: Benchmarking User Experience in AI Assistants},
  author        = {Mengze Hong and Xia Zeng and Zeyang Lei and Sheng Wang and
                   Chen Jason Zhang and Di Jiang and others},
  year          = {2026},
  eprint        = {2606.09570},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.09570}
}
🤗 HuggingFace ⭐ GitHub