← Britwise
EFFICACY · METHODOLOGY
How accurate is Britwise compared to a real IELTS examiner?
We publish our methodology and calibration data openly. Every claim below is reproducible from our 10,000+ recorded Speaking submissions, cross-graded by certified Cambridge examiners.
±0.5
Band vs real examiner
+1.2
Avg band gain in 30 days
94%
Completion rate
10,247
Calibration samples
1. The calibration study
Between Sept 2025 and Mar 2026 we collected 10,247 Speaking recordings (Part 1 + 2 + 3) from paid users in Vietnam. Each recording was graded twice: 1. Britwise AI (GPT-4o + Whisper, scored against the Cambridge Public Band Descriptors). 2. A panel of 4 certified IELTS Speaking examiners (avg 9 years' experience). Both sides graded blind. We compared the overall band and each of the 4 criteria (Fluency, Lexical, Grammar, Pronunciation).
2. Headline results
• 95% of Britwise scores fell within ±0.5 band of the examiner consensus. • 71% landed within ±0.5 of all 4 criteria simultaneously. • Disagreement was symmetric (we under- and over-grade in equal measure — no systematic bias). • Where Britwise and the examiner panel disagreed by >0.5 band, internal review found the AI was correct 41% of the time (the human panel itself sometimes splits).
3. Per-criterion accuracy
Fluency & Coherence
±0.5 in 92% of cases
Lexical Resource
±0.5 in 90% of cases
Grammatical Range & Accuracy
±0.5 in 89% of cases
Pronunciation
±0.5 in 87% of cases
4. Outcome study — +1.2 band in 30 days
We tracked 312 paid users who completed at least 30 days of daily practice (≥3 sessions/week). On entry, their mean Speaking band was 5.4. After 30 days the mean was 6.6. Median gain: +1.0; 90th percentile: +2.0; 10th percentile: +0.5. Comparable studies for traditional 1-1 tutoring (£30/h equivalent) show ≈ +0.5 band in the same time frame for the same starting level.
5. What this is NOT
• Not a Cambridge Assessment endorsement. We aim to mirror their rubric but we are independent. • Not a guarantee. Individual outcomes vary. Our +0.5 band in 30 days refund policy underwrites the average outcome — not every case. • Not the entire test. Listening, Reading and Writing are scored with separate methodologies (Listening/Reading are auto-marked; Writing uses a similar 4-criteria GPT-4o panel).
6. How we keep ourselves honest
Every quarter we re-run the calibration with a fresh batch of 500+ recordings. If the AI drifts beyond our 95% ±0.5 threshold, we recalibrate prompts and rubric weighting until it returns. Want to audit the data? Email research@britwise.info with your institution and we'll share the de-identified calibration set under NDA.
Want to verify on your own students?
Đăng ký pilot 60 ngày miễn phí