UXBench Leaderboard

📚 About UXBench

UXBench is designed to answer two core research questions: (1) whether frontier LLMs can generate high-quality responses for failure-prone user queries; and (2) how model capability improvements translate into measurable UX gains.

🎯 Real-User Signal

Unlike benchmarks relying on expert annotation or synthetic prompts, UXBench is mined from 38,331 disliked and 33,253 liked real conversations — capturing what actual users find frustrating or satisfying.

🔮 Multi-Task Coverage

Two complementary tasks: UX Eval (GRM judge) and UX Recovery (generative fix) — covering the full quality-assurance cycle.

🥳 Failure & Success Taxonomy

A human-validated taxonomy of 10 failure dimensions and 8 success patterns covering 8 interaction scenarios and 83 domains, enabling fine-grained capability diagnosis.

🔍 Key Findings

Insights from evaluating 26 frontier LLMs on real-user UX data

📊 Systematic Positive Bias

Every model shows Good-Acc ≫ Bad-Acc, with gaps of 27–88 pp. LLMs are biased toward validating responses — a critical flaw for QA pipelines.

🏠 Self-Preference Effect

LLM judges exhibit measurable in-group favoritism: same-family models score each other higher. Mixed-vendor ensembles produce more calibrated evaluations.

🔀 Capability ≠ UX

General-capability rankings show weak correlation (r=0.31) with UXBench rankings. Strong reasoning does not guarantee a good user experience.

🔧 Recovery is Hard

Task 3 recovery rates top out at 12.8% (Claude Opus 4.6), revealing that generating genuinely improved responses remains an open challenge even for frontier models.

📂 Failure & Success Taxonomy

10 failure dimensions (BAD) · 8 success patterns (GOOD) · human-validated · 8 scenarios · 83 domains

❌ Failure Modes (BAD)

📋

Verbosity / Redundancy

34.3%

Response is excessively long, repetitive, or padded with irrelevant content.

🔧

Task Incompleteness

24.8%

Only part of the user's request was fulfilled; key deliverables are missing.

🎯

Intent Misunderstanding

11.5%

Response addresses the wrong question or ignores user intent.

⚠️

Factual Error

11.0%

Incorrect facts, hallucinations, or verifiably false information.

🔎

Information Reliability Issue

10.1%

Outdated, unverifiable, or logically flawed information.

🖼️

Instruction / Format Failure

2.8%

Ignores explicit formatting constraints or structural requirements.

📉

Insufficiently Informative

2.2%

Response too superficial or vague to be practically useful.

💬

Emotional Tone Mismatch

2.0%

Inappropriate tone or lack of empathy in emotionally sensitive contexts.

✅ Success Patterns (GOOD)

✅

Accurate Answering

17.4%

Factually correct, verified, and directly answers the question.

🧠

Knowledge Depth

15.4%

Expert-level insight beyond surface-level answers.

📚

Comprehensive Detail

13.7%

Full coverage of all relevant aspects of the question.

💡

Problem Solving

12.6%

Genuinely resolves the underlying user problem.

🛠️

Practical Guidance / Actionability

12.1%

Concrete, executable steps the user can immediately follow.

✨

Creative Generation

11.6%

High-quality creative content that exceeds user expectations.

🎯

Task Completion

9.0%

Follows all instructions and delivers the requested output in full.

💬

Empathetic Support

8.2%

Emotionally appropriate, warm, and supportive responses.

📝 Citation

If you use UXBench in your research, please cite:

@misc{hong2026uxbench,
  title         = {UXBench: Benchmarking User Experience in AI Assistants},
  author        = {Mengze Hong and Xia Zeng and Zeyang Lei and Sheng Wang and
                   Chen Jason Zhang and Di Jiang and others},
  year          = {2026},
  eprint        = {2606.09570},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.09570}
}

🤗 HuggingFace ⭐ GitHub

UXBench: Benchmarking User Experience
in AI Assistants

🏆 Leaderboard Last updated: 2026-05-21

UX Eval N=4,900

UX Recovery N=500