Full text loading...
Geoscience documents pose challenges for automated information extraction due to complex layouts, poor visual quality, and diverse table structures. Traditional workflows rely on separate models for layout, text, and table extraction, leading to error propagation and inconsistent outputs. Large language models lack domain-specific understanding, limiting their effectiveness. To address these issues, we fine-tune a lightweight vision-language model integrating visual and textual understanding. Our model jointly processes text, tables, and figures in a schema-aware manner, maintaining natural reading order and contextual coherence. This unified approach reduces maintenance complexity, minimizes errors, and improves consistency across geoscience document elements.