ADR 0006: Quality bar — invariants as property tests, plus one real retrieval benchmark
Define quality as guarantees proven by property tests, plus one real labeled retrieval benchmark; domain accuracy is punted to the user's data.
Status: accepted · Date: 2026-06-17
Immutable record. Exempt from drift tracking (no
covers). Supersede, don't edit.
Part of the 0002 product direction.
Context
All evaluation is currently synthetic self-consistency: the QA golden set (41 cases) scores 1.0 against a hand-aligned synthetic KB; the retrieval A/B harness uses lexical-hash gold its own docstring says only proves "the harness math, not real recall"; no real LLM, real corpus, or real embedding model runs in CI. For a general-purpose library, accuracy is inherently the user's-data-dependent — a leaderboard score is not RAGSpine's to own.
Decision
Define quality as guarantees, not scores:
- Primary — invariants as property tests. Anti-fabrication never fabricates, provenance is always present, RESTRICTED never leaks, behavior is deterministic — harden these from a few agent-layer tests into exhaustive property tests. This is the reason to choose RAGSpine over a dependency-heavy framework.
- Plus one real retrieval benchmark. The single claim that is RAGSpine's own — "hybrid retrieval + listwise rerank beats naive BM25" — is an engine property, not a user-data property. Back it with a real (non-synthetic) labeled retrieval benchmark reporting real numbers (Recall@k, MRR), run in a dedicated lane (not necessarily every-CI).
Domain accuracy benchmarking is explicitly punted to the user's data.
Alternatives considered (rejected)
- Real-data accuracy eval as a release gate (option A): real labeled sets + real models across all channels as the bar. Rejected — accuracy depends on the user's data/model, which the library can't own.
- Synthetic CI gate + separately-published full benchmark (option B): rejected as the frame (quality centered on accuracy numbers), but its "publish one real benchmark" slice is adopted for the retrieval claim only.
Consequences
- "done / 1.0" means invariants proven, not accuracy on synthetic data.
- The retrieval benchmark gates promotion of the real vector backend out of experimental (0005).
- Keeps the fast, offline, deterministic CI loop intact as a regression tripwire.
ADR 0005: Lean core — quarantine dormant capability as experimental/extras
Ship only what runs in the default offline path; move OCR, real semantic vector, and the dual-channel verifier into clearly-labeled experimental modules/extras.
ADR 0007: Multilingual — architect for five, ship two; the rest as locale packs
Thread a locale seam through every layer; ship and test Chinese + English in core; Japanese / Italian / German ship as community locale packs.