AutoRAC system map
How the harness actually works
This is the operational path from official source text to a promoted RAC file. It includes the deterministic CI checks, semantic review gates, import discipline, run ledgers, provenance controls, and the new outer-loop autoresearch gate that only keeps prompt mutations when they beat a separate holdout.
Recent proof points
The page is still curated, but these are current system facts rather than generic claims. They are the shortest path to what actually changed this week.
Accepted autoresearch mutation
The first kept outer-loop prompt mutation fixed claimant-or-partner disjunction handling and improved training score from 73.978 to 99.979 without regressing the separate final-review holdout.
55-case UK bulk wave promoted
Wave 20 cleared the real scale gate, then landed 55 new Pension Credit regulation 15 leaves in rac-uk.
Current UK corpus state
The promoted UK repo now has 146 RAC files, companion tests complete, zero embedded scalar violations, and zero numeric-occurrence backlog.
Replace-mode Atlas sync repaired
UK sync no longer depends on append-only recovery. Managed uk/legislation rows are deleted and republished in normal mode.
Pipeline explorer
Each stage below maps to something concrete in the current harness: code in the repo, files in the eval workspace, and a measurable pass-fail outcome.
Stage 01
Ingest official source text first
AutoRAC starts from real source documents, exact section identifiers, and copied slices. The source layer is never stubbed.
What happens here
Checks and outputs
checks
outputs
<section eId="regulation-13">
<num>13.</num>
<content>
<p>If the claimant's benefit is less than 10 pence per week,
the amount shall be rounded up to 10 pence per week.</p>
</content>
</section>Failure pattern browser
These are real failure classes from recent UK, US, and Colorado work. Each one maps to a concrete harness or repo validator change. Atlas is where per-rule encoding records and agent logs are meant to be inspected, while autoresearch sits one level up and decides whether prompt-surface changes should be kept at all.
Repeated numbers must all appear as named scalars
failure mode
A source said 55% twice or repeated a threshold amount, but the RAC collapsed it into a single generic helper.
harness response
Numeric occurrence coverage now counts substantive source-number repeats and fails when the RAC under-represents them.
where it is enforced
recent example
UC taper rate and repeated UK benefit thresholds
Run ledger and provenance files
The harness now produces enough structured output that this section can eventually be generated directly from exported metadata. Right now the artifact list is curated, but it now uses concrete current examples from the accepted autoresearch run and the latest UK bulk promotion rather than abstract file shapes. The actual per-encoding logs and RAC records still belong in Atlas.
suite-run.json
Live run state, runner metadata, progress counts, and current status. Updated while the suite is running.
{
"name": "UK wave 20 bulk seed",
"runner_backend": "codex",
"status": "completed",
"cases_total": 55,
"cases_completed": 55,
"started_at": "2026-04-08T00:12:44Z"
}