How well does our AI judge agree with human evaluators? We conducted a comprehensive study to find out, with some surprising results.
Study Design
We collected 10,000 completed debates from our platform and had each one evaluated by: - Our AI judge - Three independent human judges (majority vote determines winner)
Key Findings
Overall Agreement: 87% Our AI judge agreed with the human consensus in 87% of cases - remarkably close to the 89% agreement rate between individual human judges.
Where They Disagree The 13% of cases where AI and humans disagreed fell into interesting patterns: - Highly technical topics (AI less accurate) - Humor and sarcasm (AI sometimes missed the point) - Very close debates (essentially coin flips for both)
Confidence Calibration Interestingly, our AI's confidence scores were well-calibrated. When the AI expressed high confidence, agreement with humans was 96%. When confidence was low, agreement dropped to 71%.
Implications
These results suggest our AI judge is suitable for most debates, but we should consider human review for edge cases where AI confidence is low.