Score model outputs against a rubric — and check the judge for bias.
Per-criterion rubric scoring on accuracy, clarity, and completeness, plus pairwise comparisons run in both orderings to catch position bias. 15 scored evaluations across 8 domains are below — or run the judge on your own output.
Prompt
Using only the context below, answer the question. Context: "Returns are accepted within 30 days of delivery for unused items in original packaging. Refunds are issued to the original payment method within 5 business days of the returned item being received." Question: What is the refund window, and how is the refund paid?
Model output
You can request a refund within 60 days of delivery, and the amount is credited back to your original payment method within 5 business days of us receiving the item.
Judged by anthropic/claude-sonnet-4.6 — the output is scored as-is, never regenerated.
Not grounded in the context: the passage states a 30-day return window, but the answer asserts 60 days — a direct contradiction of the source. The payment-method and 5-business-day details are correct, so this is a confident, half-right answer, which is the dangerous kind.
Fluent, well-structured, and reads with total confidence — precisely why an ungrounded number slips past a human skim.
Addresses both parts of the question (window and payment method), but anchors the headline figure to a value that is not in the provided context.
Evaluations
15
across 8 domains
Average total
8.36
mean score, 0–10
Pairwise checks
7
run in both orderings
Position-bias rate
29%
flipped on reorder
Position-bias audit
Every pairwise comparison is scored twice with the outputs swapped. 2 of 7 runs in this set changed their winner on the swap — those verdicts are flagged unreliable. One is shown below.
Bias detected — the verdict flips with order
The judge picked Output A when A was shown first, but Output B when B was shown first. Because the verdict changed with presentation order, the result is unreliable — treat these outputs as effectively tied rather than trusting either single ordering.
Order: A shown first
1 = A · 2 = BBoth faithfully capture all four points (speed, screen, battery, heat) with no distortion.
A is marginally tighter and more direct.
Both include every salient point from the review.
Near-tie; A edges it on concision when shown first.
Order: B shown first
1 = B · 2 = AAgain both are fully accurate summaries.
Shown first, B’s slightly richer phrasing now reads as the more polished option.
Both remain complete.
The verdict flipped to B purely from presentation order — a textbook position-bias case on a genuine near-tie.
Judged by anthropic/claude-sonnet-4.6 · each pair scored in both orderings.
Average score by criterion
Mean across all sample evaluations (0–10)
Score distribution
How many evaluations fall in each total-score band
Average score by domain
Mean total score per subject area
Evaluation runs
Click any row for the scored output and the judge’s per-criterion reasoning.
| Domain | Prompt | Acc. | Clar. | Comp. | Total |
|---|---|---|---|---|---|
| Coding | Write a SQL query to find the second-highest salary from an Employee table. | 9.0 | 9.0 | 8.0 | 8.7 |
| Coding | Explain what a race condition is to a junior developer. | 9.0 | 9.0 | 7.0 | 8.3 |
| Medical | What lifestyle changes help lower high blood pressure? | 10.0 | 9.0 | 9.0 | 9.3 |
| Medical | Is it safe to take ibuprofen and paracetamol together? | 8.0 | 9.0 | 6.0 | 7.7 |
| Legal | What is the difference between a copyright and a trademark? | 9.0 | 10.0 | 7.0 | 8.7 |
| Legal | Can my landlord enter my apartment without notice? | 8.0 | 9.0 | 7.0 | 8.0 |
| Reasoning | A model is billed at $0.003 per 1K input tokens. A single request sends 12,000 input tokens. What is the input cost for that request? | 2.0 | 8.0 | 6.0 | 5.3 |
| Reasoning | A classifier has 90% precision and 80% recall. What is its F1 score? | 10.0 | 9.0 | 9.0 | 9.3 |
| Summarization | Summarize this changelog entry in one sentence: "v2.3 adds SSO via SAML, fixes a memory leak in the export worker, and deprecates the legacy /v1 API, which will be removed in v3." | 9.0 | 9.0 | 8.0 | 8.7 |
| Summarization | Condense this into a tweet: a quarterly report showing 12% revenue growth driven by international expansion and a new subscription tier. | 9.0 | 8.0 | 8.0 | 8.3 |
| Factual QA | Which HTTP status code indicates "Too Many Requests"? | 2.0 | 9.0 | 6.0 | 5.7 |
| Factual QA | Is a standard JSON Web Token (JWT) encrypted by default? | 10.0 | 9.0 | 10.0 | 9.7 |
| Customer Support | A customer asks how to reset their password. Write a helpful reply. | 9.0 | 10.0 | 9.0 | 9.3 |
| Customer Support | Respond to a one-star review that says the app keeps crashing on startup. | 8.0 | 9.0 | 8.0 | 8.3 |
| Translation | Translate into Spanish: 'Where is the nearest train station?' | 10.0 | 10.0 | 10.0 | 10.0 |