Full text loading...
Legacy handwritten reports, often preserved in archive, hold valuable insight for geological knowledge. Despite advancements in Optical Character Recognition (OCR) technologies, which have improved the conversion of handwritten content into digital text, the resulting outputs remain unstructured and incomplete, thereby limiting their applicability to modern analytical tools. By fine-tuning LLMs specifically for geological domains, we can bridge the gap between raw OCR data and structured, actionable insights, enabling seamless access to decades of legacy knowledge. This paper explores innovative methodologies for fine-tuning LLMs to process and structure raw OCR outputs from handwritten geological reports. Our approach started with leveraging the IDEFICS model (Image-aware Decoder Enhanced a la Flamingo with Interleaved Cross-attentions) for OCR, particularly its capabilities in handling image-text. Next, we fine-tuned LLM like model Mistral-7B-Instruct using a carefully selected dataset containing prompts and matching responses. Manual correction was applied to refine the dataset where necessary. This comprehensive approach generates high-quality structured outputs, unlocking the vast potential of geological archives for modern application and analysis. As a result, the research showcases that focused fine-tuning efforts within the geology realm yield considerable advancements in the digitalization and structuration of i handwritten documents.