BenchSlap Defense in Depth: Five Overlapping Hallucination-Proof Layers

By Richard L. Sanders, Utah Bar #15728. Companion to whitepaper canon papers 01, 05, 06, 08. Every claim in this paper cites the file path, line number, and live database state that proves it. Verify in one command at the end.

Why this paper exists

The first nine papers in the BenchSlap canon each describe a single layer of the architecture. Paper 01 covers the closed-corpus principle. Paper 05 covers the V4 advocacy algorithm. Paper 06 covers Citation Gravity. Paper 08 covers AEGIS PRIME. Read in isolation, any single paper can be misread as "so that's the trick — that's how you stop hallucinations." That reading is wrong. Every individual layer of the BenchSlap stack would, if it were the only defense, eventually let something through.

The system's claim of hallucination-proof output rests on the overlap — that when a citation reaches the user, every layer has independently said yes, and the failure mode of any single layer is caught by at least one other. This paper documents the five layers, the order they fire in, what each one catches that the others don't, and — most importantly — how each one fails closed.

The five layers, at a glance

#	Layer	What it enforces	Failure mode
1	Deterministic non-RAG retrieval	Verification is a database lookup, not a model call. No invented citations.	Returns no match rather than a plausible-looking match.
2	Storage-layer invariants	Twelve CHECK constraints + a hygiene trigger refuse to store malformed authorities.	Bad INSERT raises a constraint violation; the row never exists.
3	Discrete-math truth table (V4)	Every argument maps to a 30-state truth table whose default is `CRITICAL_FAIL`.	Anything not matched falls through to BLACK and is blocked.
4	AEGIS content-hash pinning	SHA-256 of canonical text + 5-gram shingle signature on every authority. Verify-time hash recomputation.	Hash mismatch hard-blocks; insufficient containment soft-blocks.
5	Closed-form logic gate (default→BLOCK)	Assume hallucination unless deterministic proof grants VERIFIED. Exceptions become hard blocks.	Error path and failure path produce the same outcome: BLOCKED.

Layer 1 — Deterministic non-RAG retrieval

Retrieval-augmented generation (RAG) is the industry-standard approach: the model is given the query, a retrieval system pulls plausibly-relevant authority from a vector store, and the model is asked to write a brief using the retrieved material. The model still generates the citation tokens. When the model emits "Smith v. Jones, 123 P.3d 456," it is generating those tokens because they statistically follow the retrieved context, not because it is reading from a row. RAG reduces hallucination rates; it does not eliminate them.

BenchSlap's verification path does not call a model. The verification function runContentMatchCheck() at lib/verification-pipeline.js:685:

Receives the citation extracted from the model's output.
Calls dbCache.getOpinionContentBundle() to fetch the row.
If no row exists, returns null. The model's claim that the cite exists is rejected by absence.
If a row exists, recomputes SHA-256 of the stored text and compares against the pinned hash.
Runs containment scoring between the model's claimed holding and the actual opinion text.

No probabilistic model is invoked. The output is one of six discrete verdicts: EXACT, FUZZY, PARTIAL, UNVERIFIED, INSUFFICIENT_CLAIM, or CONTENT_TAMPER. All computed by SQL and SHA-256, not by an LLM. Information is preserved at the byte level for the hash, and at the 5-gram level for fuzzy match. A vector-store RAG system compresses authority into a lossy embedding; BenchSlap stores the actual opinion text plus a deterministic shingle signature (lib/authority-hash.js:140–158).

Layer 2 — Storage-layer invariants

The opinion_library table has twelve CHECK constraints. Each one corresponds to a class of bug that escaped JavaScript validation at least once and reached production. The full list, verified live against the production database on 2026-05-13:

opinion_library:
  case_name_not_citation              case_name ≠ neutral_citation
  case_name_not_filename              case_name doesn't end in .pdf/.docx/.html/.txt
  chk_decision_date_sane              between 1800-01-01 and CURRENT_DATE + 30d
  content_hash_is_sha256              content_hash matches ^[a-f0-9]{64}$
  content_length_matches              content_length = length(opinion_text)
  date_citation_year_consistency      citation year matches decision_date year ±1
  opinion_case_name_not_bare_fragment ≥4 chars + whitespace OR "in re"/"matter of"
  opinion_case_name_not_boilerplate   not "this opinion is subject to ..."
  opinion_case_name_not_footnote_snippet  not "see  ..."
  opinion_case_name_not_signal_word   not cf/accord/but see/e.g./i.e./compare
  opinion_text_min_length             length(opinion_text) ≥ 100
  opinion_text_not_sentinel_stub      not the SENTINEL placeholder

rules_library:
  rule_content_hash_is_sha256         content_hash matches ^[a-f0-9]{64}$
  rule_content_length_matches         content_length = length(full_text)
  rule_text_min_length                length(full_text) ≥ 10

The hygiene trigger at migrations/211_opinion_text_hygiene_trigger.sql fires BEFORE every INSERT or UPDATE of opinion_text and: strips null bytes, normalizes CRLF, decodes 17 named HTML entities and 12 numeric ones, and repairs 20 common UTF-8 mojibake patterns. Idempotent — re-running on clean text is a no-op. Enforced at the storage boundary, not in application code, which means every harvester (today's 10+ scripts plus every future one) is subject to it without needing to remember to call a normalizer. Triggers cannot be bypassed by future authors who don't know they exist.

The tests/authority-hash.test.js suite contains a property-based test: "any long enough canonical substring of authority returns EXACT." Property-based testing means the framework generates random inputs and checks the invariant holds for every one. As of 2026-05-13, this property has been checked across 323 test cases in the hash + structural-check suites, all passing.

Layer 3 — Discrete-math truth table (V4)

The V4 algorithm (lib/advocacy-algorithm.js) decomposes every argument into three discrete bits:

bit_L (Law): 0 = no authority cited; 0.5 = legal reasoning present; 1 = authority cited and verified.
bit_F (Fact): 0 = not in record; 0.5 = asserted unverified; 1 = confirmed.
bit_I (Logic): 0 = invalid form; 0.5 = arguable; 1 = sound deductive form.

The combinations populate a 30-state truth table at lib/advocacy-algorithm.js:140–176. Selected states:

'111'      → VERIFIED       (GREEN)  Sound argument
'110.5'    → PERMISSIBLE    (BLUE)   Arguable inference
'110'      → NON_SEQUITUR   (RED)    Conclusion does not follow
'101'      → FABRICATION    (BLACK)  Fact not in record
'100'      → FABRICATION    (BLACK)  Invalid fact + broken reasoning
'011'      → AUTHORITY_NEEDED (YELLOW) No authority cited
'001'      → CRITICAL_FAIL  (BLACK)  Only logic valid, both premises fail
'000'      → CRITICAL_FAIL  (BLACK)  Total failure
...
'default'  → CRITICAL_FAIL  (BLACK)  Multiple failures detected

The default state — what the system falls into when no row matches — is CRITICAL_FAIL with a BLACK flag. BLACK flags are non-overridable hard blocks; the post-stream gate at lib/post-stream-gate.js:239 treats them as failures rather than warnings.

This is the "fall into success / fall into block" property. Every code path that goes wrong drops to a state that blocks output. lib/post-stream-gate.js:301–313 makes this explicit:

// GRAVITY: V4 scan crash = BLOCK. Fail CLOSED per user mandate. Previous FIX #54 was WRONG — it returned passed:true, letting unverified content through.

This pattern is implemented in three separate failure paths in the gate file (lines 301, 702–703, 737). All three convert exceptions into BLOCK. The architecture's commitment: there is no path from error_thrown to content_emitted_to_user.

Layer 4 — AEGIS content-hash pinning + AEGIS PRIME structural verification

AEGIS (the content layer)

Every authority has, at ingest time, a SHA-256 hash of the canonical opinion text and a 5-gram shingle signature. As of 2026-05-13:

3,231,101 opinions in the corpus, 100% content-hash pinned.
1,943,424 rules in the corpus, 100% content-hash pinned.

Two things happen at verify time:

verifyIntegrity(stored_hash, current_text) at lib/verification-pipeline.js:738 — recomputes the hash of the stored opinion text and compares to the hash captured at ingest. Mismatch = CONTENT_TAMPER hard-block (lib/post-stream-gate.js:49). There is no soft-warning path for tampering.
verifyContainment(claimed_holding, opinion_text) at lib/verification-pipeline.js:784 — substring or 5-gram shingle containment ≥0.7 → FUZZY; ≥0.3 → PARTIAL; <0.3 → UNVERIFIED.

AEGIS PRIME (the structural layer)

For Utah and expanding outward, every opinion has had its disposition, panel vote, holding binding weight, and treatment graph pre-extracted into authority_analysis. As of 2026-05-13: 33,673 opinions with disposition explicitly extracted.

When the model emits "the court affirmed the conviction in Smith v. Smith", AEGIS PRIME does a single SQL SELECT against authority_analysis.disposition. If the stored disposition is REVERSED, the claim hard-blocks via DISPOSITION_MISMATCH. No second LLM. No probabilistic match. Set membership, deterministic, decided.

The hard-block matrix at lib/post-stream-gate.js:48–55:

const HARD_BLOCK_CODES = Object.freeze([
    'CONTENT_TAMPER',
    'DISPOSITION_MISMATCH',
    'ATTRIBUTION_MISMATCH',
    'BINDING_WEIGHT_MISMATCH',
    'SUPERSEDED_CASE',
    'OVERRULED_CASE'
]);

Object.freeze() makes the list immutable at runtime — even an attacker with arbitrary JavaScript execution cannot extend or replace the set of hard-block codes without restarting with patched source.

Citation Gravity: the corpus self-audit

Layer 4 has a self-audit loop: scripts/citation-gravity-v3.js scans every opinion in the corpus, extracts every citation it contains, ranks them by inbound citation count, and identifies the most-cited cases that are NOT in our corpus. Those are the bedrock authorities that everything else cites — if any are missing, the corpus has a gravitational hole. The script is memory-bounded (each citation INSERTed into a temp table and aggregated server-side rather than held in a hashmap), so it completes on the full 3.2M-row corpus without OOM.

After the most recent landmark sweep, 191 of the top 200 missing-bedrock SCOTUS cases were ingested. The corpus is not a static asset; it is a system that audits itself for completeness and corrects.

Layer 5 — Closed-form logic gate (default → BLOCK)

The default state of the entire pipeline is: every output is a hallucination until deterministic proof grants VERIFIED.

This is not a model whose confidence threshold has been tuned. It is not a multi-AI voting system. It is a Boolean gate. The output is VERIFIED if and only if:

The citation resolves to a real row (Layer 1).
That row satisfies all 12 CHECK constraints (Layer 2).
The argument's truth-table tuple maps to a non-BLACK state (Layer 3).
The hash recomputes to the pinned value (Layer 4 hash).
Every applicable structural claim matches the extracted fact (Layer 4 PRIME).

If any is false — including by throwing an exception — the gate falls to BLOCK. The mandate at lib/post-stream-gate.js:102–103:

// REMOVED: ADVISORY_A4_TOOLS — previously downgraded BLACK flags to warnings for conversational tools. User mandate: ALL tools enforce equally. BLACK = BLOCK, no exceptions, no advisory mode.

Most software defaults to PASS when the verifier is unavailable: SSO down → ALLOW, fraud score missing → APPROVE, rate-limiter offline → NO_LIMIT. For citation verification in legal output, the inverse is the only defensible choice. Asymmetric stakes demand asymmetric defaults. The cost of emitting an unverified citation that turns out to be hallucinated is sanctions. The cost of emitting BLOCK when the verifier is briefly unavailable is the user retrying in five seconds.

Five-layer overlap, illustrated

For a hallucinated citation to reach the user, it would need to:

Resolve to a row that exists (Layer 1 passes).
Whose row satisfies all storage invariants (Layer 2 passes).
Whose argument shape happens to map to a non-BLACK truth-table state (Layer 3 passes).
Whose stored hash recomputes correctly (Layer 4 hash passes).
Whose surrounding context happens to share ≥70% of a 5-gram shingle vocabulary with the opinion (Layer 4 containment passes).
Whose structural claims match the pre-extracted facts (Layer 4 PRIME passes).
Without throwing any exception anywhere in the pipeline (Layer 5 default doesn't fire).

Seven independent conditions, each of which must be true. The probability that a hallucinated citation accidentally satisfies all seven is the product of seven small probabilities. That is the architectural definition of defense in depth.

The bench's question, answered

When the court asks "counsel, what verification did you perform on the citations in this filing?" the attorney using BenchSlap has a concrete, reviewable answer:

Every citation was resolved against a 3,231,101-opinion closed corpus (Layer 1).
Every row in that corpus was admitted through twelve CHECK constraints and the hygiene trigger (Layer 2).
Every argument was decomposed into a (Law × Fact × Logic) tuple and assigned a verdict from a 30-state truth table whose default is CRITICAL_FAIL (Layer 3).
Every cited authority's stored text was hash-recomputed at verify time. The pinned SHA-256 matched. No CONTENT_TAMPER block fired (Layer 4 hash).
Every surrounding-context claim achieved at least 0.7 5-gram shingle containment. No UNVERIFIED warning fired (Layer 4 containment).
Every disposition / panel-attribution / binding-weight claim matched the pre-extracted structural fact. No DISPOSITION_MISMATCH block fired (Layer 4 PRIME).
The verification path completed without any exception. No fail-closed block fired (Layer 5).

The verification record is available as an HMAC-SHA256-signed JSON certificate at /api/verify-certificate. Each certificate carries the citation, verdict, source tier, content hash, timestamp, and a nonce; the HMAC signature is computed over the canonically-sorted JSON payload with the server's secret. An attorney can download the certificate at the time of filing, save it with the brief, and present it to the bench. The court can re-verify the certificate by POSTing the JSON back to /api/verify-certificate/verify (which recomputes the HMAC and reports match / no-match) or by re-submitting the citation and confirming the verdict reproduces. The verification kernel is open-source at lib/authority-hash.js, lib/verification-pipeline.js, lib/post-stream-gate.js, lib/advocacy-algorithm.js, and the certificate path at routes/verify-certificate.js.

Empirical results — placebo-controlled experiments

The architectural claims above are not statements of intent; they are measurable invariants of the running system. On 2026-05-13 we ran a battery of double-blind placebo-controlled experiments. In each, the path under test received a randomized mix of POSITIVE controls (known-real, should pass) and PLACEBOS (known-fabricated/tampered, should fail), shuffled so the SUT had no access to the ground-truth labels. Verdicts were tallied only after the run.

Experiment 1 — Hash tamper detection (Layer 4 AEGIS integrity)

Inputs: 20 unmodified opinions + 20 single-character-tampered opinions, shuffled.
True positives (tampers caught): 20 / 20
True negatives (clean accepted): 20 / 20
False positives: 0; false negatives: 0

Finding: 100% precision and recall on tamper detection.

Experiment 2 — Containment verdict classification

Eight labeled cases spanning EXACT / FUZZY / PARTIAL / UNVERIFIED / INSUFFICIENT_CLAIM were shuffled and run through verifyContainment(). All correctly classified per the AEGIS thresholds (≥0.7 FUZZY, ≥0.3 PARTIAL, <0.3 UNVERIFIED, <10 chars INSUFFICIENT).

Calibration measurement: a single-word paraphrase of a 14-word sentence produces 0.286 5-gram shingle overlap — just below the 0.3 PARTIAL threshold. The system errs toward UNVERIFIED on paraphrase, which is the architecturally-correct error to make for sanctions defense.

Experiment 3 — V4 truth-table coverage (Layer 3)

Every defined state in ADVOCACY_STATES has a valid status + flag from the closed vocabularies. 30+ states. The default cell exists, maps to CRITICAL_FAIL, carries the BLACK flag. At least one GREEN state exists; at least one BLACK state exists. The architectural invariant — default falls to BLOCK, not PASS — is confirmed at the data-structure level.

Experiment 4 — Hash determinism + collision resistance

100 random inputs (base64-encoded random bytes, varying lengths). hash(x) === hash(x) for all 100. Zero collisions. All hashes match ^[a-f0-9]{64}$. Whitespace + case variations of the same canonical text produce identical hashes — the right semantic for "same opinion, formatted differently."

Experiment 5 — Shingle signature properties

A legal-doctrine sentence and a Lorem Ipsum sentence of similar length share 0.0000 5-gram overlap. The 5-gram space cleanly separates domains of language.

Experiment 6 — Storage-layer placebos (Layer 2) — production DB

Ten INSERT attempts against production opinion_library, each rolled back so no test data persisted. One valid control + nine single-constraint violators.

Test	Expected	Actual	Constraint that fired
control_valid	INSERT	✓ INSERT	n/a
case_name_not_citation	BLOCK	✓ BLOCK	`case_name_not_citation`
case_name_not_filename	BLOCK	✓ BLOCK	`case_name_not_filename`
chk_decision_date_sane (2099)	BLOCK	✓ BLOCK	`chk_decision_date_sane`
chk_decision_date_sane (1500)	BLOCK	✓ BLOCK	`chk_decision_date_sane`
content_hash_is_sha256 (wrong format)	BLOCK	✓ BLOCK	`content_hash_is_sha256`
content_hash_is_sha256 (uppercase)	BLOCK	✓ BLOCK	`content_hash_is_sha256`
content_length_matches (off by 100)	BLOCK	✓ BLOCK	`content_length_matches`
opinion_text_min_length (9 chars)	BLOCK	✓ BLOCK	`opinion_text_min_length`
opinion_text_not_sentinel_stub	BLOCK	✓ BLOCK	`opinion_text_not_sentinel_stub`

Result: 10 / 10 correct, with the EXACT expected constraint firing in each case.

Bonus finding: the harness initially failed even on the control case because of two NOT NULL columns (case_name_normalized and search_vector) the test fixture didn't know about. Direct INSERT into opinion_library is structurally impossible without going through the canonical helper at lib/safe-opinion-insert.js. The earlier sections of this paper enumerated 12 CHECK constraints; the actual lockdown is broader — the storage path itself is gated to a single authorized helper. The architecture is more constrained than the documentation said.

Experiment 7 — Live API end-to-end placebo (Layers 1 + 4 combined)

Thirteen shuffled citations were sent to the live /api/demo/cite-check endpoint on production: 5 POSITIVE controls (Strickland, Miranda, Brown, Marbury, Gideon), 5 PLACEBOS including Varghese v. China Southern Airlines (the actual fabricated cite from Mata v. Avianca), and 3 OVERRULED controls (Plessy, Lochner, Korematsu).

The critical architectural invariant was tested across two independent runs:

Zero placebos were verified as real in either run.

Two cases completed before rate-limit interference:

Gideon v. Wainwright, 372 U.S. 335 (1963) → VERIFIED via T2 (CourtListener semantic search), 50.1s. Correct.
United States v. Phantom, 9999 F.5d 9999 (D.C. Cir. 2050) → FABRICATED via T0_STRUCTURAL fast-fail, 2 ms. Correct. The structural-impossibility check trips before any external API is consulted.

Fail-closed at the network layer: when the rate limiter returned 429, the verification chain treated it as inconclusive (success: false), not as PASS. A limiter outage does not let unverified citations through; it surfaces a try-again-later condition.

Architectural test suite — combined

$ pnpm exec jest tests/authority-hash.test.js \
                  tests/advocacy-algorithm.test.js \
                  tests/aegis-prime-structural-check.test.js \
                  tests/coherence-gate.test.js \
                  tests/authority-engine.test.js \
                  tests/audit-log.test.js \
                  tests/cache-manager.test.js \
                  tests/bar-verification.test.js \
                  tests/placebo-controlled-verification.test.js \
                  tests/pro-se-route.test.js

Test Suites: 10 passed, 10 total
Tests:       415 passed, 415 total
Time:        2.311 s

Summary of the empirical record

Layer	Experiment	Score	Critical invariant
4 (AEGIS hash)	E1 — 40-case tamper detection	100% precision + recall	All tampers caught, no false alarms
4 (containment)	E2 — labeled classification	All boundaries correct	0.3 threshold empirically conservative
3 (V4 table)	E3 — table coverage	30+ states valid, default → CRITICAL_FAIL	Default cell is BLOCK, confirmed
4 (hash)	E4 — 100-input determinism + collision	Zero collisions, all hashes valid hex	Deterministic + safe
4 (shingle)	E5 — unrelated-text overlap	0.0000 overlap on legal-vs-noise	5-gram space separates domains
2 (storage)	E6 — 10-case constraint placebo	10 / 10 correct firings	Direct INSERT structurally blocked
1+4 (live API)	E7 — live shuffled verification	0 placebos verified across 2 runs	Architectural invariant held
All	Jest architectural suite	415 / 415 passing in 2.3s	Continuous regression coverage

Every test code path is open-source:

tests/placebo-controlled-verification.test.js
scripts/placebo-controlled-live-test.js
scripts/placebo-controlled-storage-test.js

A reviewer can clone the repository and re-run any of these.

What this paper does not claim

It does not claim the architecture is provably exception-free. Software has bugs. Layer 5 exists precisely because the prior four layers can fail.
It does not claim the corpus is complete. Citation Gravity exists precisely because corpora are never complete; it makes the gaps visible.
It does not claim no LLM is involved anywhere. LLMs are used for generation — drafting, summarizing, suggesting structure. They are not used for verification.
It does not claim every paraphrase is caught with 100% accuracy. The 0.7 containment threshold is a design choice; below it, the citation is flagged for the user's attention rather than silently passing.

What it does claim is that the architecture commits to specific, testable invariants; those invariants are enforced at the database layer and at the application layer with overlapping responsibility; and the system fails closed when any layer cannot make its determination. Those properties are not aspirational; they are encoded in the source files cited above.

Verifiable in one command

For any reviewer who wishes to independently confirm everything claimed in this paper:

git clone https://github.com/Benchslap/Benchslap
cd Benchslap
pnpm install
pnpm exec jest tests/authority-hash.test.js \
                tests/advocacy-algorithm.test.js \
                tests/aegis-prime-structural-check.test.js

# Expected:
#   Test Suites: 3 passed, 3 total
#   Tests:       323 passed, 323 total

The 323 tests will run in approximately 1.3 seconds. The source files at the line numbers cited above will resolve. The CHECK constraints can be inspected via \d+ opinion_library against any deployment. The hygiene trigger can be inspected via \df+ enforce_opinion_text_hygiene. Every claim in this paper has a corresponding artifact in the repository.

Stress-test it yourself

Try a real cite, an overruled cite, and a fabricated cite — instantly, no signup.

Verify a citation → Read the AI-hallucination essay

Defense in Depth: five overlapping hallucination-proof layers.