Full text loading...
This study presents **ColPali**, a novel **vision-language retrieval architecture** designed to enhance information retrieval from **visually complex geological documents** such as handwritten reports, stratigraphic charts, and diagrams. Traditional OCR-based and text-only RAG systems struggle to interpret mixed visual-textual layouts common in geological archives. ColPali overcomes these limitations by leveraging **Vision-Language Models (VLMs)** and **Vision Transformers (ViT)** to generate **multi-vector embeddings** directly from document images, capturing both visual and textual context. The system performs similarity-based retrieval within a multimodal embedding space, enabling accurate matching of queries involving figures, tables, and numerical data. Experimental results demonstrate that ColPali achieves superior **retrieval precision, contextual understanding, and explainability** compared to conventional methods. It effectively handles noisy and handwritten data, offering **interpretable outputs through visual heatmaps** that support expert validation. By integrating advanced document understanding capabilities into RAG workflows, ColPali introduces a new paradigm for **intelligent document analysis in geoscience**, significantly improving the accessibility and usability of geological knowledge.