ColPali Architecture
Last updated
Last updated
Traditional document parsing techniques rely heavily on extracting plain texts, at the cost of overlooking graphical elements. However, complex documents such as technical papers and presentations provide much of their context visual cues such as such as tables, images, charts, and the layout structure. As such, visual context is vital to extract, interpret, and prioritize information in order to generate contextually accurate response.
In short, ColPali is an advanced document retrieval model that leverages Vision Language Models to integrate both textual and visual elements for highly accurate and efficient document search
ColiVara uses ColQwen, an improved model from ColPali.
Document Retrieval technology is central to many applications, either ass a standalone ranking system, or as part of a complex Retrieval Augmented Generation (RAG) pipeline.
There are 2 phases in a Standard Retrieval :
To index a standard PDF document, the Offline Indexing process might look like this:
PDF parsers or Optical Character Recognition (OCR) systems extract text from document pages.
Layout detection models segment the document into structured parts.
Optional captioning step, which provides natural language descriptions for visual elements, making them more compatible with embedding models.
A chunking strategy groups related text passages to maintain semantic coherence
Text embeddings are created by mapping vectors meant to represent the text's semantic meaning.
Finally, the documents - with its generated embeddings - are ready to be queried. A Query follows a similar process to be converted into its vector presentation. The query's vector can be used on the document's vector to produce an answer.
While there have been large improvements in text embedding models, practical experiments have shown that the performance bottleneck lies in the ingestion pipeline in which the visually rich document was processed, prior to being consumed by an LLM. Additionally, this process is slow, difficult, and prone to propagate errors.
ColPali is a novel approach to Document Retrieval, by doing away with the textual indexing phase, and replacing this process by generating embeddings on the images (or screenshots of documents) directly.
ColPali was built on previous technologies:
ColPali, a Paligemma-3B extension that is capable of generating ColBERT-style multi-vector representations of text and images, optimizes the ingestion pipeline of visually rich document by creating embeddings on the visual elements using Vision Language Model. This yields much greater performance boost than optimizing the text embedding model.
ColPali exhibits high efficiency:
High retrieval performance: The use of Vision Language Model creates high-quality contextual embeddings from document images, facilitating quicker retrieval.
Low querying latency: Late interaction mechanism significantly decreases the number of comparisons needed during querying.
High indexing speed: Simpler indexing process by directly processing document images, eliminating the need for complex text extraction and segmentation pipelines.
Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., & Colombo, P. ColPali: Efficient Document Retrieval with Vision Language Models. Illuin Technology, Equall.ai, CentraleSupélec, Paris-Saclay, ETH Zürich. Available at: https://arxiv.org/pdf/2407.01449
An Offline Indexing phase, where all the documents from the corpus are indexed.
An Online Querying phase, where a user query is matched with a low latency to the pre-computed document index.