Full text loading...
Deploying large datasets for training machine learning models often reveals more information about the target variable and helps to avoid overfitting. However, these advantages are associated with certain challenges, such as data noise and redundancy. In the present study on well log data consisting of a relatively large dataset (40 wells from the Cambay Basin), we deploy different classes of feature selection methods (filter‐based methods, wrapper‐based methods and embedded methods) to obtain the optimal feature set aimed at accurate prediction of sonic logs. Additionally, we utilize methods such as the boxplot and histogram analysis to remove outliers present in the dataset. Subsequently, we use XGBoost as our machine learning model, with fivefold cross‐validation and a 70:30 split. We then proceed to predict the sonic log data in a blind well. We establish that the maximum relevance minimum redundancy method shows the best results with an R‐squared value of 63% when we select three out of six features – depth, neutron porosity and bulk density. Significance of the results was demonstrated using statistical tests of significance, namely one‐way analysis of variance and Tukey's honestly significant difference test. The selection of these features is further validated by established geophysical principles in the form of empirical relationships.