LLM-as-judge · stateless

Score model outputs against a rubric — and check the judge for bias.

Per-criterion rubric scoring on accuracy, clarity, and completeness, plus pairwise comparisons run in both orderings to catch position bias. 15 scored evaluations across 8 domains are below — or run the judge on your own output.

no database works with no API key deployed on Vercel
SpecimenRAG · Grounding

Prompt

Using only the context below, answer the question. Context: "Returns are accepted within 30 days of delivery for unused items in original packaging. Refunds are issued to the original payment method within 5 business days of the returned item being received." Question: What is the refund window, and how is the refund paid?

Model output

You can request a refund within 60 days of delivery, and the amount is credited back to your original payment method within 5 business days of us receiving the item.

Judged by anthropic/claude-sonnet-4.6 — the output is scored as-is, never regenerated.

Rubric verdict
6.3Total score · 0–10Mid
0246810
Accuracy3.0/10

Not grounded in the context: the passage states a 30-day return window, but the answer asserts 60 days — a direct contradiction of the source. The payment-method and 5-business-day details are correct, so this is a confident, half-right answer, which is the dangerous kind.

Clarity9.0/10

Fluent, well-structured, and reads with total confidence — precisely why an ungrounded number slips past a human skim.

Completeness7.0/10

Addresses both parts of the question (window and payment method), but anchors the headline figure to a value that is not in the provided context.

Evaluations

15

across 8 domains

Average total

8.36

mean score, 0–10

Pairwise checks

7

run in both orderings

Position-bias rate

29%

flipped on reorder

Position-bias audit

Every pairwise comparison is scored twice with the outputs swapped. 2 of 7 runs in this set changed their winner on the swap — those verdicts are flagged unreliable. One is shown below.

Bias detected — the verdict flips with order

The judge picked Output A when A was shown first, but Output B when B was shown first. Because the verdict changed with presentation order, the result is unreliable — treat these outputs as effectively tied rather than trusting either single ordering.

Order: A shown first

1 = A · 2 = B
AccuracyTie

Both faithfully capture all four points (speed, screen, battery, heat) with no distortion.

ClarityOutput A

A is marginally tighter and more direct.

CompletenessTie

Both include every salient point from the review.

OverallOutput A

Near-tie; A edges it on concision when shown first.

Order: B shown first

1 = B · 2 = A
AccuracyTie

Again both are fully accurate summaries.

ClarityOutput B

Shown first, B’s slightly richer phrasing now reads as the more polished option.

CompletenessTie

Both remain complete.

OverallOutput B

The verdict flipped to B purely from presentation order — a textbook position-bias case on a genuine near-tie.

Judged by anthropic/claude-sonnet-4.6 · each pair scored in both orderings.

Average score by criterion

Mean across all sample evaluations (0–10)

Score distribution

How many evaluations fall in each total-score band

Average score by domain

Mean total score per subject area

Evaluation runs

Click any row for the scored output and the judge’s per-criterion reasoning.

DomainPromptAcc.Clar.Comp.Total
CodingWrite a SQL query to find the second-highest salary from an Employee table.9.09.08.08.7
CodingExplain what a race condition is to a junior developer.9.09.07.08.3
MedicalWhat lifestyle changes help lower high blood pressure?10.09.09.09.3
MedicalIs it safe to take ibuprofen and paracetamol together?8.09.06.07.7
LegalWhat is the difference between a copyright and a trademark?9.010.07.08.7
LegalCan my landlord enter my apartment without notice?8.09.07.08.0
ReasoningA model is billed at $0.003 per 1K input tokens. A single request sends 12,000 input tokens. What is the input cost for that request?2.08.06.05.3
ReasoningA classifier has 90% precision and 80% recall. What is its F1 score?10.09.09.09.3
SummarizationSummarize this changelog entry in one sentence: "v2.3 adds SSO via SAML, fixes a memory leak in the export worker, and deprecates the legacy /v1 API, which will be removed in v3."9.09.08.08.7
SummarizationCondense this into a tweet: a quarterly report showing 12% revenue growth driven by international expansion and a new subscription tier.9.08.08.08.3
Factual QAWhich HTTP status code indicates "Too Many Requests"?2.09.06.05.7
Factual QAIs a standard JSON Web Token (JWT) encrypted by default?10.09.010.09.7
Customer SupportA customer asks how to reset their password. Write a helpful reply.9.010.09.09.3
Customer SupportRespond to a one-star review that says the app keeps crashing on startup.8.09.08.08.3
TranslationTranslate into Spanish: 'Where is the nearest train station?'10.010.010.010.0