Full text loading...
This study presents an automated pipeline for fossil identification and annotation by leveraging Large Vision Models (LVMs), Vision Transformers (ViTs), and Optical Character Recognition (OCR). The proposed framework processes unstructured fossil images to segment individual fossils, extract textual annotations, and align them with corresponding names and descriptions. Using zero-shot object detection with OWL-ViT, the system achieves robust segmentation without extensive labeled data, while a transformer-based OCR model extracts and refines textual information. The pipeline was tested on diverse paleontological datasets featuring various fossil types and image complexities. Results demonstrate high segmentation accuracy, effective text-to-image mapping, and adaptability across datasets. Additionally, an interactive interface enables human-in-the-loop refinement, enhancing reliability and usability for domain experts. Overall, the study establishes a scalable and intelligent approach for digitizing and organizing paleontological imagery, supporting automated fossil identification and advancing digital fossil documentation.