Machine learning predictive insight of water pollution and …

Machine learning predictive insight of water pollution and …

Introduction

The Eastern Province of Saudi Arabia faces critical challenges in managing its water resources due to scarcity and pollution. As a region dominated by arid landscapes, the area grapples with limited precipitation and extensive groundwater exploitation for diverse purposes, including agriculture, domestic use, and industry. Addressing these issues requires a comprehensive understanding of water quality dynamics and the development of robust predictive models to guide effective management strategies.

This study presents an innovative approach that integrates non-parametric kernel Gaussian learning (GPR), adaptive neuro-fuzzy inference system (ANFIS), and decision tree (DT) algorithms to predict water quality index (WQI) and groundwater quality index (GWQI) in the Eastern Province. Unlike previous works, this research marks the first application of a non-parametric kernel for groundwater quality pollution index prediction in Saudi Arabia, offering a significant advancement in the field.

The study’s objectives are threefold: (i) to perform feature engineering based on dependency sensitivity analysis to identify the most influential variables affecting WQI and GWQI, (ii) to develop predictive models using ANFIS, GPR, and DT for both indices, and (iii) to assess the impact of different data portions on WQI and GWQI predictions, exploring data divisions such as (70% / 30%), (60% / 40%), and (80% / 20%) for training and testing phases, respectively.

By filling a critical gap in water resource management, this research offers significant implications for the prediction of water quality in regions facing similar environmental challenges. Through its innovative methodology and comprehensive analysis, this study contributes to the broader effort of managing and protecting water resources in arid and semi-arid areas.

Understanding Water and Groundwater Quality Indices

The Water Quality Index (WQI) and Groundwater Quality Index (GWQI) are crucial tools for assessing the overall quality of water resources for various purposes, including drinking, irrigation, and industrial use. These indices simplify complex hydrogeological data into a single value, making it easier to interpret and compare water quality across different locations and time periods.

The WQI is calculated by examining the interplay between water requirements and the composition and extent of potential impurities, such as pH, dissolved oxygen, biochemical oxygen demand, chemical oxygen demand, total suspended solids, nitrates, phosphates, and heavy metals. The GWQI, on the other hand, focuses on parameters like pH, total dissolved solids, hardness, chloride, sulfate, nitrate, fluoride, and heavy metals.

Leveraging these indices, researchers can classify and present the geographical pattern of water quality, providing invaluable knowledge about the suitability of groundwater for irrigation and the necessity for adequate treatment before application. This data is crucial for directing irrigation techniques and ensuring the responsible use of water resources in agriculture.

Innovative Hybrid Modeling Approach

The present study introduces a novel hybrid method that combines non-parametric kernel Gaussian learning (GPR), adaptive neuro-fuzzy inference system (ANFIS), and decision tree (DT) algorithms to predict WQI and GWQI in the Eastern Province of Saudi Arabia.

Gaussian Process Regression (GPR)

Gaussian Process Regression (GPR) is a non-parametric Bayesian approach to regression that offers a principled way to model complex data without assuming a specific functional form. GPR treats functions as a collection of random variables that have a joint Gaussian distribution, allowing it to model the uncertainty about the function underlying the observed data.

Adaptive Neuro-Fuzzy Inference System (ANFIS)

The Adaptive Neuro-Fuzzy Inference System (ANFIS) represents a highly effective neural network methodology that can be employed to tackle function approximation problems effectively. ANFIS merges fuzzy reasoning with neural network learning algorithms, enabling it to conform to any real continuous function on a compact set and making it an ideal solution for predicting desired outcomes through logical means.

Decision Tree (DT)

Decision Tree (DT) based rule induction is a promising approach to data mining, particularly for predicting groundwater contamination sensitivity, even with limited data and unclear or intricate nonlinear relationships within the dataset. DT uses splits to select characteristics that decrease entropy, improving class assignments.

By integrating these three machine learning techniques, the present study offers a novel and comprehensive approach to predicting water and groundwater quality indices in the Eastern Province of Saudi Arabia, addressing the region’s critical challenges of water scarcity and pollution.

Feature Engineering and Data Exploration

Prior to model development, the study employed feature engineering based on dependency sensitivity analysis to identify the most influential variables affecting WQI and GWQI. This process involved analyzing the correlation coefficients between the water quality parameters and the indices, revealing the parameters with the strongest relationships.

The analysis showed that electrical conductivity (EC) and total dissolved solids (TDS) had very strong positive correlations with both WQI (0.9343) and GWQI (0.9387), indicating that as the mineral content of the water increases, the indices tend to show worse water quality. Similarly, trace elements such as strontium (Sr) and arsenic (As) also exhibited strong positive correlations, suggesting their significant impact on water and groundwater quality.

On the other hand, barium (Ba) showed a negative correlation with both WQI (-0.6199) and GWQI (-0.5547), which could indicate that higher Ba concentrations might be associated with better water quality or could reflect a mathematical artifact of the index calculations. The lower correlation of molybdenum (Mo) with WQI (0.6763) and GWQI (0.6477) compared to other parameters suggests that molybdenum has a lesser but still notable effect on the indices.

This detailed statistical relationship is crucial for refining WQI and GWQI calculations and for prioritizing which water quality parameters should be monitored more closely to assess and manage water pollution effectively.

Model Performance Evaluation

The predictive accuracy of the ANFIS, GPR, and DT models was evaluated using various metrics, including R², R, Mean Square Error (MSE), and Root Mean Square Error (RMSE) with different data partitions, including 70%/30%, 60%/40%, and 80%/20% training and testing phases.

Water Quality Index (WQI) Prediction

In the 70/30 training and testing phase, the ANFIS-M1 model exhibited the best performance, with an R² value of 0.9945 and an RMSE of 0.0401 during the testing phase, outperforming the GPR-M2 and DT-M1 models.

Similarly, in the 60/40 and 80/20 training and testing phases, ANFIS-M1 maintained its superior predictive accuracy, consistently demonstrating higher R² and lower RMSE values compared to the GPR-M2 and DT-M1 models.

Groundwater Quality Index (GWQI) Prediction

For GWQI prediction, the GPR-M1 model showed exceptional testing phase accuracy, with an RMSE of 0.0169, outperforming the ANFIS-M1 and DT-M1 models across the different data partitions.

In the 70/30 training and testing phase, GPR-M1 exhibited an R² value of 0.9991 during the testing phase, indicating a very strong fit to the data. While in the 60/40 and 80/20 splits, GPR-M1 maintained high R² values, albeit with a slight decrease in performance as the training data proportion decreased.

The results highlight the critical role of data quality and quantity in training for enhancing model robustness and prediction precision in water quality assessment. The performance generally decreases with a smaller proportion of training data, underscoring the importance of sufficient data for accurate and reliable predictions.

Implications and Future Directions

The innovative hybrid modeling framework developed in this study offers significant implications for the prediction of water and groundwater quality in arid and semi-arid regions facing similar environmental challenges. By integrating advanced machine learning techniques, the approach reduces the need for continuous field sampling, ultimately saving time and resources for policymakers.

The research aligns with the Sustainable Development Goals (SDG), particularly SDG 6 (Clean Water and Sanitation), by enhancing the ability to monitor and manage water resources effectively. The study’s findings support the guidelines set by the World Health Organization (WHO) and the Environmental Protection Agency (EPA) for water quality assessment and management, providing a tool for early detection of pollution and facilitating proactive management strategies.

Future research should focus on expanding the application of this model to other arid and semi-arid regions globally, considering different hydrogeological contexts and pollution sources. Integrating real-time data from IoT devices into the model could further enhance its predictive capabilities and enable dynamic water quality management. Additionally, assessing the long-term impacts of improved water quality management on public health and ecosystem resilience would provide valuable insights into the broader implications of this research.

By leveraging advanced machine learning techniques, this study offers a robust and innovative approach to addressing the critical water challenges faced by the Eastern Province of Saudi Arabia and similar regions worldwide. Through its comprehensive analysis and predictive models, this research contributes to the global effort of ensuring sustainable water resource management and protecting the environment for present and future generations.

Scroll to Top