Introduction:
Air pollution is one of the most pressing environmental issues of our time, affecting both human
health and the planet. In order to mitigate the negative effects of air pollution, it is crucial to
understand and predict the levels and sources of air pollution. This blog presents a study of global
air pollution prediction using machine learning techniques.
Problem Definitions:
1) Develop a classification model to predict the AQI category using a publicly available dataset of
global air pollution factors.
2) Use machine learning to perform a regression task and predict AQI values using a dataset of
global air pollution readings.
3) Use clustering techniques to group locations that have similar air pollution patterns based on
the AQI Value and PM 2.5 AQI Value.
Each and every problem we are solving with 3 different types of models either supervised or unsupervised machine learning methods.
Data Set Description:
The data set is taken from the Kaggle which is a global air pollution data set. The data set used in
this report contains information on air pollution levels in various locations around the world. The
data set has a total of 23463 rows/instances and 12 columns/attributes.
The features in the dataset include 'Country', 'City', 'AQI Value', 'AQI Category', 'CO AQI Value', 'CO AQI Category', 'Ozone AQI Value', 'Ozone AQI Category', 'NO2 AQI Value', 'NO2 AQI
Category', 'PM2.5 AQI Value', and 'PM2.5 AQI Category'.
The 'Country' and 'City' columns contain categorical data and the rest of the columns contain numerical data. The 'AQI Value' feature represents the Air Quality Index value and the 'AQI Category' feature represents the Air Quality Index category. The other features represent the AQI values for CO, ozone, NO2, and PM2.5. These features will be used in the analysis to understand. The relationship between air pollution and AQI. In order to understand the characteristics of the
dataset, we calculated various descriptive statistics.
Experiment-1: Develop a classification model to predict the AQI category using a publicly available dataset of global air pollution factors.
Model-1: Logistic Regression: In this experiment, we developed a classification model to predict air quality index (AQI) categories using a dataset of global air pollution factors. We employed logistic regression, a commonly used machine learning algorithm for classification tasks. The dataset was preprocessed and columns like 'City' and 'Country' were dropped. The AQI categories were transformed into binary values, and the data was split into training and testing sets. We trained both multinomial and Gaussian logistic regression models. Our analysis revealed that the Gaussian logistic regression model outperformed the multinomial model in terms of accuracy, recall, and precision. ROC AUC analysis demonstrated a high ability to distinguish between positive and negative classes, indicating strong classifier performance. In summary, our logistic regression models effectively predicted AQI categories based on pollution factors, with the Gaussian model showing superior results.
Model-2: Support Vector Classification
Support Vector Classification (SVC) is a powerful machine learning algorithm used for classification tasks, especially when the data isn't linearly separable. SVC finds a hyperplane that optimally separates different classes, and it can handle non-linear decision boundaries by utilizing kernel functions. For our air pollution dataset, where linear separation isn't feasible, SVC proved to be an excellent choice. We trained the model using the same data as our previous model, achieving remarkable results. The SVC exhibited perfect accuracy, F1-score, recall, and precision, all at 1.0, indicating its exceptional ability to classify instances accurately and minimize false positives. This highlights the model's outstanding performance and its suitability for this classification problem.
Model-3: Naive Bayes Theorem
Model 3 utilizes the Naive Bayes Theorem, a probabilistic algorithm based on Bayes' theorem and the assumption of feature independence. Before applying this algorithm, the data was normalized using the Standard Scaler. The model was trained using pre-processed data, and its performance was evaluated. The accuracy of 0.95 indicates that the Naive Bayes model successfully classified 95% of the test samples, showcasing its strong predictive capabilities on our dataset.
Analysis for the Experiment-1:
Based on the evaluation metrics of the three models, it seems that the SVC model performed best
with an accuracy of 1.0, an f1-score of 1.0, a recall score of 1.0, and a precision score of 1.0. The
logistic regression model had an AUC of 0.99 and good accuracy, however, it had some
convergence issues. The Naive Bayes model had an accuracy of 0.95.
Depending on the specific use case and requirements of your problem, I may want to choose the
model that is the best trade-off between performance and computational complexity.
However, SVC appears to have performed best in this experiment.
Experiment-2:
Use a machine learning algorithm to perform a regression task and predict AQI values using a
dataset of global air pollution factors. For this problem, I choose 3 models which are Linear Regression, Random Forest regressor, and the Gradient Boost regression model.
Model-1:
Linear Regression proved to be a suitable machine learning approach for predicting AQI values based on a dataset of global air pollution factors. It's a supervised learning algorithm that establishes relationships between independent variables like pollution factors and dependent variables like AQI values. The data, already preprocessed and cleaned, was used to train the model. Evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R2 Score were applied to assess performance. With an MAE of 4.82, MSE of 73.53, RMSE of 8.58, and an impressive R2 Score of 0.97, indicating that 97.69% of the variance is explained by the model, we can conclude that the model demonstrates strong predictive capabilities and a good fit for the dataset. Scatter plots and box plots further illustrated the alignment between actual and predicted values.
Model-2:
The Random Forest Regressor, an ensemble machine learning algorithm, was employed for regression tasks using the same training data as before. This powerful algorithm combines multiple decision trees to enhance prediction accuracy and is robust to outliers and missing values. The model achieved an impressive R2 score of 0.99, signifying its ability to explain 99.78% of the data's variance. This outstanding result confirms the model's strong fit for the dataset, making it a reliable choice for regression tasks.
Model-3:
In this experiment, we explored three regression models, including a Gradient gradient-boosting regressor. The models were trained and evaluated using the provided dataset. The results revealed that the Gradient Boosting Regressor yielded a relatively low Mean Squared Error (MSE) of 10.88 and an impressive R2 score of 0.99, signifying excellent model performance and a strong fit to the data. In comparing all models, the Random Forest Regressor emerged as the top performer with the lowest MSE and the highest R2 score.
This suggests that Random Forest is the most suitable model among the three for this specific regression task. You can find further details and the experiment code
Experiment-3:
Using unsupervised machine learning techniques to group locations with similar air
pollution patterns into clusters based on the AQI Value and PM 2.5 AQI Value.
Model-1:
In this model utilizing K-Means clustering, we aimed to partition the dataset into meaningful clusters. Following preprocessing, we employed the elbow method and silhouette score to determine the optimal number of clusters, which was found to be 6. We then trained the K-Means model on the data, excluding the 'Country' column, and assigned cluster labels to locations based on AQI Value and PM 2.5 AQI Value attributes. The within-cluster sum of squares (WCSS) score for k=6 was 13.109, while the Silhouette score, a measure of similarity within clusters, reached 0.54. These metrics collectively affirm that the K-Means model with 6 clusters effectively groups the observations into distinct clusters, offering valuable insights into the dataset's underlying structure.
Model-2:
In our Hierarchical Clustering model, we adopted a more flexible approach where the number of clusters doesn't require pre-specification. The algorithm initially treats each data point as its own cluster and subsequently merges the closest clusters. The achieved score of 0.52 indicates moderate similarity within clusters and moderate dissimilarity between clusters, reflecting the structure of the data.
Model-3:
In Model 3, we employed a Gaussian Mixture Model (GMM), a probabilistic approach that assumes data points are generated from a mixture of Gaussian distributions. The model was trained with a specified number of clusters (n_components=6), but the Silhouette score of 0.16 indicated that the data points did not align well with their respective clusters, suggesting that the clusters were not well-defined. This outcome may be influenced by various factors, such as an incorrect choice of cluster count or covariance structure. Overall, based on Silhouette scores, the K-Means algorithm with 6 clusters emerged as the most effective among the three models, with a Silhouette score of 0.54, indicating better-defined clusters and more suitable clustering for the dataset.
Conclusion:
The final conclusion of this study is that machine learning techniques can be effectively applied
to the global air pollution data set to identify patterns, classify sources, and cluster locations with
similar air pollution patterns. By using supervised and unsupervised algorithms, air pollution
levels and sources in different locations can be accurately predicted and understood.
The benefits of using this data set in the study are numerous. By understanding the patterns and
sources of air pollution, policymakers, and decision-makers can take informed actions to reduce
the negative impacts of air pollution on human health and the environment. Additionally, the use
of machine learning techniques allows for the analysis of a large amount of data in a quick and
efficient manner, providing insights that may not have been apparent through traditional
methods. Overall, this study demonstrates the potential for machine learning to inform and
improve efforts to address air pollution on a global scale.
Dataset:
Global Air Quality Index. [online] Kaggle. Available at: