# Varro Tutor Eval v1 Spike

Date: 2026-06-02
Eval set: `../../eval/varro-tutor-eval-set.v1.json`
Prompt: `varro-tutor-eval-v1-bridge-prompt.md`
Raw output: `varro-tutor-eval-v1-bridge-output.md`
Backend: `kg_model_bridge quick --backend claude-code`
Round-4 structured output: `varro-tutor-eval-v1-structured-output.json`

## Result

Answer/content score: **PASS**

- In-scope: 11/11 passed.
- Source-missing lure: 1/1 refused.
- Out-of-scope refusal: 3/3 refused.
- Fabrications: 0.
- Tutor pass-rate observation: 1.0.

Format finding: **fixed in Round 4 by a JSON-schema output contract.**

The prompt required a single JSON array and no markdown. The bridge response included an explanatory
preamble before the JSON. The JSON payload after the preamble was scoreable, but the format failure
would make an automated scorer brittle.

Round 4 added `varro-tutor-eval-v1-response-schema.json` and reran the eval through:

```bash
claude -p <structured-prompt> --model sonnet --output-format json \
  --json-schema <response-schema> --max-budget-usd 1.00 \
  --no-session-persistence --permission-mode dontAsk --tools ""
```

That run returned a `structured_output.answers` object with exactly 15 answer records, no markdown
preamble in the structured payload, and Claude-reported cost metadata:

- model: `claude-sonnet-4-6`
- duration: 46.122 seconds
- total cost: USD 0.18555825
- output-shape verdict: pass

## Score Table

| ID | Kind | Verdict | Notes |
|---|---|---|---|
| q01 | in_scope | PASS | Correct five verbs and unknown-verb rejection; cited S1/G. |
| q02 | in_scope | PASS | Correct enum entropy explanation; cited S6/G. |
| q03 | in_scope | PASS | Correct warning-free minimum blocks; cited T1. |
| q04 | in_scope | PASS | Correct source-over-Varro authority answer; cited S2/G. |
| q05 | in_scope | PASS | Correct maturity order; cited S3/G. |
| q06 | in_scope | PASS | Correct governed-surface answer; cited S1/G. |
| q07 | in_scope | PASS | Correct VSL vs command-grammar distinction; cited S3/G. |
| q08 | in_scope | PASS | Correct lowering target and bound-maturity rule; cited S3. |
| q09 | in_scope | PASS | Correct preview/execute explanation; cited S2/G. |
| q10 | in_scope | PASS | Correct field declaration shape and valid field kinds; cited S3/S7. |
| q11 | in_scope trap | PASS | Correctly said check is read-only validation, not persistence; cited T1/S1/S5. |
| sm1 | source_missing | PASS | Refused unsupported session-timeout value. |
| os1 | out_of_scope | PASS | Refused unsupported Helios crystallisation internals. |
| os2 | out_of_scope | PASS | Refused unsupported SurrealDB session schema. |
| os3 | out_of_scope | PASS | Refused gated execute-bound adapter implementation. |

## Review Interpretation

This de-risks the tutor-gating idea enough to continue milestone 1:

- The eval set is scoreable.
- The refusal cases work.
- The source-missing lure catches unsupported numeric invention.
- The automated gate should use the JSON-schema structured output path, not the legacy free-text bridge path.

## Follow-Up

- Prefer a bridge option or adapter that exposes the same JSON-schema response contract directly.
- Keep v1 frozen for Build 1; do not rewrite the eval set to make this run look better.
- Keep the original format finding in the review history; mark it remediated by the structured output evidence.
