Extraction
Documents → a frozen StyledGrid IR. Style- and color-aware xlsx/pptx/pdf extractors, per-page PDF routing, a versioned color-semantics registry, and dual-channel cross-checking into a review queue.
The extraction domain turns office documents into a frozen, style-aware intermediate
representation (the StyledGrid IR) — not "just text splitting." Every extractor is
deterministic-first, color- and style-aware, and either produces facts directly or emits a
StyledGrid for downstream ingestion. Heavy parsers (Docling, PaddleOCR) sit behind
Protocol seams and are lazy-imported, so the core runs offline.
The package lives at src/ragspine/extraction/. Its per-domain contract is
src/ragspine/extraction/CLAUDE.md. The IR (ir.py) is described in code as the most stable
interface in the whole project — extractors converge on it, and everything downstream (color
semantics, ingestion, review, eval) consumes it.
Layout
The StyledGrid IR
ir.py defines two dataclasses. A StyledGrid is one worksheet or one table page; its
cells are a sparse map from cell_ref to StyledCell.
StyledCell
A single style-aware cell.
Prop
Type
StyledCell.rgb_tag_key() returns the color-clustering key: None when cf_affected,
otherwise resolved_rgb. Cells inside conditional-formatting regions are deliberately
excluded from color semantics because their fill cannot be trusted.
StyledGrid
Prop
Type
Key methods: get(cell_ref), iter_cells(), add_warning(message), and
cells_by_rgb() — which groups reliably-colored cells by resolved_rgb, skipping
None and cf_affected cells. See the StyledGrid IR concept.
Extractors
Two extraction targets exist by design:
- Fact extractors (
xlsx_extractor,pptx_extractor) produceFactobjects directly for known schemas (e.g. a 5-year summary table) — zero hallucination, zero LLM. - Styled extractors (
*_styled_extractor,pdf_*) produceStyledGridIR for the general ingestion path, preserving color and style.
xlsx_styled_extractor.extract_grids(path) returns one StyledGrid per worksheet. It
resolves OOXML theme + tint to a real RRGGBB value (resolve_theme_color), expands
merged cells, preserves number formats, and detects conditional-formatting regions —
marking those cells cf_affected=True and adding a grid warning so the color layer skips
them. compute_file_hash(path) returns the sha256 used for version lineage (and is reused
by the PDF router).
The simpler xlsx_extractor.extract_facts(path) -> tuple[list[Fact], list[str]] maps a
known summary schema straight to Fact objects (metric names down column A, period headers
across row 1), defaulting channel="TOTAL" and unit="USD_M".
from ragspine.extraction.extractors.xlsx_styled_extractor import extract_grids
grids = extract_grids("report.xlsx") # list[StyledGrid], one per sheet
for cell in grids[0].iter_cells():
print(cell.cell_ref, cell.value, cell.resolved_rgb)Two coexisting modules. pptx_extractor.extract_facts(path) reads native tables and
native chart data (from chart XML, never images) into Fact objects — zero OCR, zero
LLM. The newer pptx_styled_extractor adds two paths:
extract_grids(path)— native tables →StyledGrid(sheet'slide{N}_table{M}', cell_ref'R{row}C{col}'), resolving fill color via the slide theme color scheme.extract_note_fragments(path) -> list[NoteFragment]— textbox + speaker-notes fragments that contain a digit, sorted by slide, for the narrative layer.
A NoteFragment carries slide_no, source_kind ("textbox" / "notes"), locator
(e.g. 'slide2/notes'), text, and glossary_hits. Its stamp constant is
EXTRACTOR_VERSION = "pptx_styled_v0".
pdf_digital_extractor.extract_grids(path) extracts every table of a digital PDF (one
StyledGrid per table) by wrapping Docling — which is lazy-imported inside the function
body, never at module top. resolved_rgb is always None for this channel. Scanned,
unreadable, or table-less PDFs return [] with no exception and no OCR. Docling is
configured with do_ocr=False, do_table_structure=True.
This module also defines the GridExtractor seam — see below.
pdf_scanned_extractor.extract_grids(path, backend, *, min_confidence=0.85, queue=None)
renders pages to PNG (pypdfium2, RENDER_DPI = 200), calls the injected
OcrBackend.recognize, and builds one StyledGrid per recognized table. Low-confidence
cells (confidence < min_confidence) still enter the grid but add a grid warning, and —
if a queue is supplied — are enqueued for review with reason "low_confidence_ocr" and
priority=30.
The neutral result types are OcrCell (row, col, text, confidence), OcrTable,
and OcrPageResult. The real backend PaddleOcrVlBackend (PaddleOCR PPStructureV3,
GPU) sits behind a gpu pytest marker; the module-agnostic logic is tested offline with a
fake backend. Stamp: EXTRACTOR_VERSION = "pdf_scanned_paddleocrvl_v0".
PDF routing — per-page triage
Before extraction, a PDF is triaged page by page. routing/pdf_router.route(path) returns
a RoutingDecision carrying the file verdict, a PageInfo per page, and — for mixed
files — a channel_plan mapping each page number to a pipeline name.
classify_page(page, page_no) derives a per-page kind from two signals — extractable
text char count and image coverage — against TEXT_MIN_CHARS = 50 and
IMG_COVER_SCAN = 0.55:
| chars | image cover | kind |
|---|---|---|
≥ 50 | < 0.55 | digital |
≥ 50 | ≥ 0.55 | ocr_scan |
< 50 | ≥ 0.55 | img_scan |
< 50 | < 0.55 | low_text |
route() aggregates pages into a file verdict (digital / scanned / ocr_scan /
mixed / unreadable) at a 90% threshold, reads the producer/creator metadata into
origin_meta, and sets ask_for_pptx=True when the producer looks like a PowerPoint /
Keynote / Impress export (so a caller can request the native source instead). Encrypted or
corrupt files return verdict="unreadable" with error set — never an exception.
A page routed digital goes to the digital_extractor pipeline; every other kind (scan / ocr /
low-text) goes to the scanned_extractor. The router only decides — the matching extractor
still does the work.
Color semantics — clustering, legend, versioned registry
color/color_semantics.py is the L2 controlled-inference layer: it maps cell fill
colors to business meaning, but only after a human confirms the mapping. The pipeline:
Cluster — cluster_colors(grid) -> list[ColorCluster] groups reliably-colored cells
by RGB, sorted by (-count, rgb).
Detect legend — detect_legend(grid) -> list[LegendEntry] finds a color-block cell
adjacent to a text label and produces color→meaning drafts.
Confirm — drafts enter the MappingRegistry and stay status="draft" until an SME
confirms them. Confirming a new version supersedes (never deletes) the prior active one.
Apply — apply_mapping(grid, mapping) -> dict[str, dict[str, str]] returns
{cell_ref: {tag_key: tag_value}}. If the mapping is not active, it returns {} and
adds a grid warning — unconfirmed mappings can never silently tag a fact.
MappingRegistry is an independent sqlite store (color_mapping table, PK
(scope, version)). Its API: register_draft(mapping) (auto-increments version per scope),
confirm(scope, version, actor, note=None), reject(...), and get_active(scope).
Facts reference a confirmed mapping by its mapping_version, so the lineage survives
revisions. See color semantics.
Dual-channel verification
verification/dual_channel_verifier.verify(facts_a, facts_b, queue=None, tolerance=0.0)
cross-checks two independent extractions of the same table (the docstring example: Docling
table parsing vs. text-layer reconstruction). Each side is a list of ChannelFact, aligned
by dim_key = (metric_code, entity, period_type, period, channel):
- Agree (same key, values within
tolerance) → auto-passed. - Conflict (same key, values differ) → enqueued with reason
"dual_channel_conflict",priority=10. - Single-channel only (key on one side) → enqueued with reason
"single_channel_only",priority=50.
It returns a VerificationResult (agreed, conflicts, only_in_a, only_in_b,
n_auto_passed, n_enqueued). With queue=None it classifies only and enqueues nothing.
Conflicts review sooner than single-only because the priority number is lower. This pure
logic has no Docling dependency.
Protocol seams
Heavy dependencies are injected through @runtime_checkable Protocols so a parser can be
swapped without touching the ingest call site, and the path is testable offline with a fake.
GridExtractor
pdf_digital_extractor. Has version: str + extract_grids(path). Default impl
DoclingGridExtractor with version = "pdf_digital@1" — the value stamped into each
fact's extractor_version. Bump it when the digital parser's output changes.
OcrBackend
pdf_scanned_extractor. recognize(image_bytes, page_no) -> OcrPageResult. Default
real impl PaddleOcrVlBackend; tests inject a fake — no PaddleOCR needed offline.
GridExtractor.version is part of the contract. It becomes the extractor_version written to
fact lineage, keeping a swapped parser (Docling → pdfplumber / camelot / …) distinguishable in
provenance.
Invariants this domain upholds
- Deterministic-first, zero hallucination — native tables and chart data are read structurally; OCR/LLM is a fallback behind a seam, never the default.
- Color trust —
cf_affectedcells and unconfirmed mappings never produce silent tags. - Version lineage —
source_file_hash+extractor_version(+mapping_version) travel with every extracted value. - Pluggability — heavy parsers are lazy-imported
Protocolseams; the core runs offline.
Related
Package Layout
The deep, domain-grouped src/ragspine map — what lives in each of the nine domains, the dependency direction between them, and how the core stays SDK-free.
Ingestion
IR/text → stores. Structured fact ingestion with a batch manifest ledger, narrative chunk ingestion, and an SME human review-queue state machine — all idempotent.