Ingestion

Ingestion is the front door. A file arrives as whatever the source system produced — often a single large PDF, hundreds to thousands of pages, with no document boundaries and no internal navigation — and ingestion turns it into the structured, searchable record the Reader, Search, and AI all operate on. Nothing downstream sees the raw file; they see what ingestion produced.

Two invariants govern it:

  • Original bytes are evidence. The source PDF is preserved bit-identical — never re-encoded, re-layered, watermarked, or “optimized.” Everything ingestion produces is derived alongside the original, never instead of it.
  • Derived artifacts are regenerable. Document boundaries, categories, and the search index are all reproducible from the original. The original PDF plus the document records are the source of truth; everything else can be rebuilt.

The pipeline

The ingestion pipeline: a source PDF (one large file, no boundaries) passes through document extraction (boundaries, receipt dates, types), then storage (original.pdf in object store plus DOC# / CAT# records), then a search index (per-page text, extracted and gzipped).
  1. Document extraction. The source PDF is split into the documents that make up the file — each with a page range, a type (form, treatment record, exam, lay statement, …), a receipt date, and a coarse category (medical / procedural / other / index). These become the DOC# and CAT# records described in the Reader’s data model.
  2. Storage. The original PDF is written to object storage under the appeal’s prefix (<appealId>/original.pdf), and the document records are written to the single-table store, keyed by appeal ID. From here on, every reader of the file works from these records, not the raw PDF.
  3. Search index. A worker extracts the per-page text layer and writes the compressed sidecar at <appealId>/search-text.jsonl.gz — the index Search runs against. It’s built ahead of the reviewer, so a file is search-ready before anyone opens it.

Born-digital vs. scanned

The text-extraction step branches on the document’s origin:

  • Born-digital PDFs carry a real text layer — extraction is direct, fast, and exact.
  • Scanned / faxed pages have no text layer; they route through OCR to produce one, self-hosted, with the better of embedded-vs-OCR text resolved per page (see Search → providers).

The output shape is identical either way, so everything downstream is indifferent to which path produced the text.

Duplicate identification

Case files routinely carry the same document many times over — a form resubmitted, an exhibit attached to several filings, a record faxed twice. Ingestion identifies these duplicates as it processes the file, comparing documents by content (a page-text fingerprint, not just filename or size) so near-identical copies are recognized even when their metadata differs.

Duplicates are flagged and associated, never discarded — the original bytes are evidence and nothing is removed. The result is a marker the Reader surfaces, so a reviewer can collapse or skip a copy they’ve already read instead of re-reading it across a thousand-page file. Identification happens here at ingestion; the compare-and-tag action lives in the Reader, and the broader grouping of related-but-not-identical documents is part of the AI layer’s consolidation.

The record is the contract

The result of ingestion is the same shape regardless of how a file arrived: an appeal, its documents, and the search index. A file can enter through the upload-and-split pipeline, or — for an integrating system — through the Integration API writing the same records, or be pulled from an external document source via the document-source port. The Reader doesn’t know or care how a file arrived; it only reads the record. That separation is what lets the source be swapped or supplemented — a new origin system, a different splitter, a different OCR engine — without touching anything downstream.

Pipeline-owned fields. A few fields are derived from the upload by the pipeline (e.g. the original PDF’s byte size and version). The pipeline owns them; they flow one way — ingestion writes, everything else reads.