Ingestion

Ingestion is the front door. A file arrives as whatever the source system produced — often a single large PDF, hundreds to thousands of pages, with no document boundaries and no internal navigation — and ingestion turns it into the structured, searchable record the Reader, Search, and AI all operate on. Nothing downstream sees the raw file; they see what ingestion produced.

Two invariants govern it:

Original bytes are evidence. The source PDF is preserved bit-identical — never re-encoded, re-layered, watermarked, or “optimized.” Everything ingestion produces is derived alongside the original, never instead of it.
Derived artifacts are regenerable. Document boundaries, categories, and the search index are all reproducible from the original. The original PDF plus the document records are the source of truth; everything else can be rebuilt.

The pipeline

Document extraction. The source PDF is split into the documents that make up the file — each with a page range, a type (form, treatment record, exam, lay statement, …), a receipt date, and a coarse category (medical / procedural / other / index). These become the DOC# and CAT# records described in the Reader’s data model.
Storage. The original PDF is written to object storage under the appeal’s prefix (<appealId>/original.pdf), and the document records are written to the single-table store, keyed by appeal ID. From here on, every reader of the file works from these records, not the raw PDF.
Search index. A worker extracts the per-page text layer and writes the compressed sidecar at <appealId>/search-text.jsonl.gz — the index Search runs against. It’s built ahead of the reviewer, so a file is search-ready before anyone opens it.

Born-digital vs. scanned

The text-extraction step branches on the document’s origin:

Born-digital PDFs carry a real text layer — extraction is direct, fast, and exact.
Scanned / faxed pages have no text layer; they route through OCR to produce one, self-hosted, with the better of embedded-vs-OCR text resolved per page (see Search → providers).

The output shape is identical either way, so everything downstream is indifferent to which path produced the text.

Duplicate identification

Case files routinely carry the same document many times over — a form resubmitted, an exhibit attached to several filings, a record faxed twice. Ingestion identifies these duplicates as it processes the file, comparing documents by content (a page-text fingerprint, not just filename or size) so near-identical copies are recognized even when their metadata differs.

Duplicates are flagged and associated, never discarded — the original bytes are evidence and nothing is removed. The result is a marker the Reader surfaces, so a reviewer can collapse or skip a copy they’ve already read instead of re-reading it across a thousand-page file. Identification happens here at ingestion; the compare-and-tag action lives in the Reader, and the broader grouping of related-but-not-identical documents is part of the AI layer’s consolidation.

The record is the contract

The result of ingestion is the same shape regardless of how a file arrived: an appeal, its documents, and the search index. A file can enter through the upload-and-split pipeline, or — for an integrating system — through the Integration API writing the same records, or be pulled from an external document source via the document-source port. The Reader doesn’t know or care how a file arrived; it only reads the record. That separation is what lets the source be swapped or supplemented — a new origin system, a different splitter, a different OCR engine — without touching anything downstream.

Pipeline-owned fields. A few fields are derived from the upload by the pipeline (e.g. the original PDF’s byte size and version). The pipeline owns them; they flow one way — ingestion writes, everything else reads.