Back to Blog
Research

Measuring Human-AI Agreement: A Study of 10,000 Debates

How well does our AI judge agree with human evaluators? We conducted a comprehensive study to find out, with some surprising results.

Study Design

We collected 10,000 completed debates from our platform and had each one evaluated by: - Our AI judge - Three independent human judges (majority vote determines winner)

Key Findings

Overall Agreement: 87% Our AI judge agreed with the human consensus in 87% of cases - remarkably close to the 89% agreement rate between individual human judges.

Where They Disagree The 13% of cases where AI and humans disagreed fell into interesting patterns: - Highly technical topics (AI less accurate) - Humor and sarcasm (AI sometimes missed the point) - Very close debates (essentially coin flips for both)

Confidence Calibration Interestingly, our AI's confidence scores were well-calibrated. When the AI expressed high confidence, agreement with humans was 96%. When confidence was low, agreement dropped to 71%.

Implications

These results suggest our AI judge is suitable for most debates, but we should consider human review for edge cases where AI confidence is low.

Frequently Asked Questions

Can people use AI to answer questions?

Yes, we can't stop people from using it so it's a tool in everyone's arsenal.

How does the judging work?

Read our blog post on our AI judge and how we test it and how the judge scores posts.

How do you make sure the AI isn't biased?

See our blog post on bias detection.

Why are you doing this?

To incentivize good thinking and yes, to make money in the process.

What if I want to dispute a judgment?

Email us at support@argyu.com, we'll look into it.

Welcome to Argyu

Choose a username to complete your registration. Your wallet address will be linked to this account.

3-30 characters. Letters, numbers, and underscores only.

By creating an account, you agree to our Terms of Service.