LLM-as-judge · stateless

Score model outputs against a rubric — and check the judge for bias.

Per-criterion rubric scoring on accuracy, clarity, and completeness, plus pairwise comparisons run in both orderings to catch position bias. 15 scored evaluations across 8 domains are below — or run the judge on your own output.

Score a single output A/B compare with bias check

no database works with no API key deployed on Vercel

SpecimenRAG · Grounding

Prompt

Using only the context below, answer the question. Context: "Returns are accepted within 30 days of delivery for unused items in original packaging. Refunds are issued to the original payment method within 5 business days of the returned item being received." Question: What is the refund window, and how is the refund paid?

Model output

You can request a refund within 60 days of delivery, and the amount is credited back to your original payment method within 5 business days of us receiving the item.

Judged by anthropic/claude-sonnet-4.6 — the output is scored as-is, never regenerated.

Rubric verdict

6.3Total score · 0–10Mid

0246810

Accuracy3.0/10

Not grounded in the context: the passage states a 30-day return window, but the answer asserts 60 days — a direct contradiction of the source. The payment-method and 5-business-day details are correct, so this is a confident, half-right answer, which is the dangerous kind.

Clarity9.0/10

Fluent, well-structured, and reads with total confidence — precisely why an ungrounded number slips past a human skim.

Completeness7.0/10

Addresses both parts of the question (window and payment method), but anchors the headline figure to a value that is not in the provided context.

Evaluations

across 8 domains

Average total

8.36

mean score, 0–10

Pairwise checks

run in both orderings

Position-bias rate

29%

flipped on reorder

Position-bias audit

Every pairwise comparison is scored twice with the outputs swapped. 2 of 7 runs in this set changed their winner on the swap — those verdicts are flagged unreliable. One is shown below.

Bias detected — the verdict flips with order

The judge picked Output A when A was shown first, but Output B when B was shown first. Because the verdict changed with presentation order, the result is unreliable — treat these outputs as effectively tied rather than trusting either single ordering.

Order: A shown first

1 = A · 2 = B

AccuracyTie

Both faithfully capture all four points (speed, screen, battery, heat) with no distortion.

ClarityOutput A

A is marginally tighter and more direct.

CompletenessTie

Both include every salient point from the review.

OverallOutput A

Near-tie; A edges it on concision when shown first.

Order: B shown first

1 = B · 2 = A

AccuracyTie

Again both are fully accurate summaries.

ClarityOutput B

Shown first, B’s slightly richer phrasing now reads as the more polished option.

CompletenessTie

Both remain complete.

OverallOutput B

The verdict flipped to B purely from presentation order — a textbook position-bias case on a genuine near-tie.

Judged by anthropic/claude-sonnet-4.6 · each pair scored in both orderings.

Average score by criterion

Mean across all sample evaluations (0–10)

Score distribution

How many evaluations fall in each total-score band

Average score by domain

Mean total score per subject area

Evaluation runs

Click any row for the scored output and the judge’s per-criterion reasoning.

Domain	Prompt	Acc.	Clar.	Comp.	Total
Coding	Write a SQL query to find the second-highest salary from an Employee table.	9.0	9.0	8.0	8.7
Coding	Explain what a race condition is to a junior developer.	9.0	9.0	7.0	8.3
Medical	What lifestyle changes help lower high blood pressure?	10.0	9.0	9.0	9.3
Medical	Is it safe to take ibuprofen and paracetamol together?	8.0	9.0	6.0	7.7
Legal	What is the difference between a copyright and a trademark?	9.0	10.0	7.0	8.7
Legal	Can my landlord enter my apartment without notice?	8.0	9.0	7.0	8.0
Reasoning	A model is billed at $0.003 per 1K input tokens. A single request sends 12,000 input tokens. What is the input cost for that request?	2.0	8.0	6.0	5.3
Reasoning	A classifier has 90% precision and 80% recall. What is its F1 score?	10.0	9.0	9.0	9.3
Summarization	Summarize this changelog entry in one sentence: "v2.3 adds SSO via SAML, fixes a memory leak in the export worker, and deprecates the legacy /v1 API, which will be removed in v3."	9.0	9.0	8.0	8.7
Summarization	Condense this into a tweet: a quarterly report showing 12% revenue growth driven by international expansion and a new subscription tier.	9.0	8.0	8.0	8.3
Factual QA	Which HTTP status code indicates "Too Many Requests"?	2.0	9.0	6.0	5.7
Factual QA	Is a standard JSON Web Token (JWT) encrypted by default?	10.0	9.0	10.0	9.7
Customer Support	A customer asks how to reset their password. Write a helpful reply.	9.0	10.0	9.0	9.3
Customer Support	Respond to a one-star review that says the app keeps crashing on startup.	8.0	9.0	8.0	8.3
Translation	Translate into Spanish: 'Where is the nearest train station?'	10.0	10.0	10.0	10.0