Full text loading...
Digitizing legacy geological reports is important for enabling modern analytics, yet most existing Retrieval-Augmented Generation (RAG) pipelines struggle with accuracy, often producing hallucinations or inconsistent answers. In this work, we explore how LangGraph can be used to make these workflows more reliable by adding correction loops and structured state handling. We tested three large language models—Meta LLaMA-3-90B, Anthropic Claude Sonnet, and DeepSeek R1—on geological well data, and evaluated them using three perspectives: expert scoring (LLM-as-a-Judge), lexical alignment (TF-IDF with embeddings), and semantic similarity (Word2Vec with embeddings). Our results show that DeepSeek provides the strongest semantic understanding, Claude Sonnet aligns best with expert phrasing, while LLaMA-3 delivers competitive but more variable outcomes. Overall, the LangGraph approach reduced errors, improved consistency, and provided clearer evaluation of how different models perform in scientific Q&A. This study shows that LangGraph is not just a research framework but a practical method for making generative AI more dependable in specialized fields like geoscience, and the methodology can be extended to other industries facing similar digitization challenges.