1887

Abstract

Summary

This study presents **ColPali**, a novel **vision-language retrieval architecture** designed to enhance information retrieval from **visually complex geological documents** such as handwritten reports, stratigraphic charts, and diagrams. Traditional OCR-based and text-only RAG systems struggle to interpret mixed visual-textual layouts common in geological archives. ColPali overcomes these limitations by leveraging **Vision-Language Models (VLMs)** and **Vision Transformers (ViT)** to generate **multi-vector embeddings** directly from document images, capturing both visual and textual context. The system performs similarity-based retrieval within a multimodal embedding space, enabling accurate matching of queries involving figures, tables, and numerical data. Experimental results demonstrate that ColPali achieves superior **retrieval precision, contextual understanding, and explainability** compared to conventional methods. It effectively handles noisy and handwritten data, offering **interpretable outputs through visual heatmaps** that support expert validation. By integrating advanced document understanding capabilities into RAG workflows, ColPali introduces a new paradigm for **intelligent document analysis in geoscience**, significantly improving the accessibility and usability of geological knowledge.

Loading

Article metrics loading...

/content/papers/10.3997/2214-4609.202639035
2026-03-09
2026-02-16
Loading full text...

Full text loading...

References

  1. Faysse, Manuel et al., “ColPali: Efficient document retrieval with vision language models.” arXiv preprint arXiv:2407.01449 (2024).
    [Google Scholar]
  2. Bach, D. (2025). Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization. arXiv preprint arXiv:2506.21601.
    [Google Scholar]
  3. Lystbak, M. S., & Lystbak, U. S. (2025, June). A Norm-Chatbot: Local LLM Vision with Vision-Based RAG for Complex Production Documents and Task-Specific Responses. In International Conference on Engineering Applications of Neural Networks (pp. 17–30). Cham: Springer Nature Switzerland.
    [Google Scholar]
  4. Bonomo, M., & Bianco, S. (2025). Visual RAG: Expanding MLLM visual knowledge without fine-tuning. arXiv preprint arXiv:2501.10834.
    [Google Scholar]
/content/papers/10.3997/2214-4609.202639035
Loading
/content/papers/10.3997/2214-4609.202639035
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error