The credibility of AI-generated answers from document analysis hinges on users' ability to verify source attribution. Current Retrieval-Augmented Generation (RAG) systems provide only coarse-grained citations ("page 12") without showing the exact visual regions used to generate answers.
This limitation stems from a fundamental tension: Optical Character Recognition (OCR) systems provide precise spatial coordinates but struggle with semantic understanding and reading order, while Large Language Models (LLMs) excel at semantic interpretation but cannot reliably output spatial coordinates.
We present a novel hybrid architecture that preserves OCR's spatial precision while leveraging LLM's semantic capabilities. Our approach uses OCR for accurate bounding box extraction, then employs vision-language models to create semantically coherent sections, correct reading order errors, and enhance text quality—all while maintaining exact spatial mappings through identifier-based references rather than coordinate generation.
This enables fine-grained visual attribution where users see the precise document regions, down to individual bounding boxes, that contributed to each answer. The system implements a three-tier hierarchical structure (Document → Section → Chunk) where each retrievable chunk maintains links to all contributing bounding boxes with their visual crops.
Upon answering a question, users receive not only the generated answer but also a navigable carousel of source regions with numbered, sequenced bounding boxes showing exactly where each piece of information originated.
This architecture addresses critical needs in high-stakes domains—legal, medical, financial—where answer verification is mandatory. By solving the OCR-LLM integration problem without sacrificing spatial precision, we enable a new class of trustworthy document intelligence systems.