Research

Measuring Human-AI Agreement: A Study of 10,000 Debates

December 15, 2024 · 6 min read

How well does our AI judge agree with human evaluators? We conducted a comprehensive study to find out, with some surprising results.

Study Design

We collected 10,000 completed debates from our platform and had each one evaluated by: - Our AI judge - Three independent human judges (majority vote determines winner)

Key Findings

Overall Agreement: 87% Our AI judge agreed with the human consensus in 87% of cases - remarkably close to the 89% agreement rate between individual human judges.

Where They Disagree The 13% of cases where AI and humans disagreed fell into interesting patterns: - Highly technical topics (AI less accurate) - Humor and sarcasm (AI sometimes missed the point) - Very close debates (essentially coin flips for both)

Confidence Calibration Interestingly, our AI's confidence scores were well-calibrated. When the AI expressed high confidence, agreement with humans was 96%. When confidence was low, agreement dropped to 71%.

Implications

These results suggest our AI judge is suitable for most debates, but we should consider human review for edge cases where AI confidence is low.

Sign Up & Browse

Write Your Argument

AI Picks the Winner

Win & Climb the Leaderboard

Measuring Human-AI Agreement: A Study of 10,000 Debates

Study Design

Key Findings

Implications

Frequently Asked Questions

Can people use AI to answer questions?

How does the judging work?

How do you make sure the AI isn't biased?

Why are you doing this?

What if I want to dispute a judgment?

Welcome to Argyu

Sign In

Send Feedback