Thursday, March 13, 2025

Understanding Customer Churn: A Data-Driven Insight

In the world of business, understanding customer behavior is key to improving retention and reducing churn. By analyzing customer data, companies can gain valuable insights to tailor their strategies and boost customer loyalty. In this blog post, we will explore a customer dataset, analyze key features, and visualize the findings in an easy-to-understand way.

Let's Dive Into the Data

We start with a dataset of 200 customers, each having a unique customer ID, age, transaction count, credit limit, and churn status (whether they have stayed or left). Our goal is to understand the relationship between these features and how they might correlate with customer churn.

Customer Data Overview:

  • Customer Age: Ranges from 18 to 75 years.
  • Transaction Count (Total_Trans_Ct): How many transactions each customer has made.
  • Credit Limit: The customer's available credit.
  • Churn: Whether the customer has churned ("Yes") or stayed ("No").

The first few rows of data look like this:

Visualizations With Insights

To make the data analysis more intuitive and easy to digest, we'll use some simple yet powerful visualizations.

1. Age Distribution of Customers

A box plot is a great way to display the distribution of customer ages. It helps us see the spread, central tendency, and any potential outliers.

This box plot reveals that most customers are in the age range between 25 and 60, with a few outliers towards the younger and older ends.

2. Transactions per Customer

The number of transactions each customer makes is another important factor. A bar chart can show the distribution of transactions per customer.


From the histogram, we can observe that most customers make between 50 and 100 transactions, with a few customers making exceptionally high transactions.

3. Credit Limit and Churn Status

The relationship between credit limit and customer churn is interesting. Let's visualize it using a bar chart.

The box plot shows that customers who churn tend to have lower credit limits compared to those who stay. This might indicate that customers with lower credit limits are more likely to leave, which could be an area for further investigation.

4. Churn Rate Breakdown

It’s also valuable to break down the churn rate across the dataset. A simple pie chart gives a clear picture of how many customers are staying versus leaving.

As seen in the pie chart, approximately 30% of customers have churned, and 70% have stayed with the company.

Key Takeaways

  • Customer Age Distribution: The majority of customers fall between 25 and 60 years of age, with a few younger and older customers.
  • Transactions: Most customers have between 50 and 100 transactions, and a small group exhibits high transaction volumes.
  • Credit Limit: Customers with higher credit limits are less likely to churn, suggesting that offering higher credit limits might help in reducing churn.
  • Churn Rate: About 30% of customers have churned, which calls for targeted strategies to retain them.

What Can Businesses Do with These Insights?

By analyzing customer age, transaction activity, and credit limit, businesses can tailor their retention strategies more effectively. For example, targeting younger customers with fewer transactions for special offers or improving the credit limits of high-risk customers could help in retaining more users.

Conclusion

Data is a powerful tool for understanding customer behavior. By utilizing visualizations like box plots, bar charts, and pie charts, businesses can make sense of complex datasets and identify key trends. Customer churn analysis is just one example of how data can inform decision-making and guide business strategies.


https://github.com/yamini542/Applied-AI/blob/main/customer_churn.ipynb   (Code )





Sunday, September 8, 2024

Create Stunning Animations with Python in Google Colab

 

Introduction:

In this article, we’ll explore how to create simple but impressive animations using Python in Google Colab. Animations can be a powerful way to visualize data and bring your projects to life. We'll use the matplotlib library to animate a sine wave, but the concepts we cover can be applied to various visualizations.

Prerequisites:

Before we dive into the code, make sure you have a basic understanding of Python and familiarity with Google Colab. If you're new to Colab, it’s a cloud-based platform that allows you to run Python code in your browser, making it perfect for creating and sharing code snippets and data analyses.

Step-by-Step Guide

1. Set Up Your Environment

First, ensure you have the necessary libraries installed. matplotlib and numpy are essential for creating and animating plots. In Google Colab, these libraries are typically pre-installed. However, you can install or upgrade them using the following command:

2. Import Libraries

Start by importing the required libraries:

3. Initialize the Plot

Set up the figure and axis for the plot:


4. Define the Initialization Function

This function sets the limits of the plot and prepares it for animation:

5. Define the Update Function

This function updates the data for each frame of the animation:





6. Create the Animation

Use FuncAnimation to create the animation. This function updates the plot with new data points at regular intervals:


7. Display the Animation

To display the animation in Google Colab, use:


Here is what the result looks like: 




#Tip: We can download the animation files and save use the below code for that






Monday, September 25, 2023

Using Time-Series Models for Heart Rate Prediction

Introduction:

Heart rate prediction plays a pivotal role in healthcare by providing essential insights into a patient's overall well-being. In this blog post, we delve into the significance of heart rate prediction and conduct a comprehensive analysis of three time-series models: ARIMA, SARIMAX, and Exponential Smoothing.

Dataset Description:

Our dataset, derived from a Lifetouch device, encompasses heart rate and respiration rate data, along with SpO2 and pulse data from an oximeter. With approximately four hours of data collected at 1-minute intervals, the dataset consists of 226 entries and 5 columns.



Data Pre-Processing:

In the heart rate prediction task, meticulous data pre-processing was conducted. This included data cleaning, handling missing values, data normalization, and addressing outliers.

Comparison of the Models:

Before delving into model selection, we ensured the dataset's stationarity. Subsequently, we compared the performance of our three chosen models using various metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared, AIC, and BIC. Our rigorous analysis unveiled that the Exponential Smoothing model, particularly the triple exponential smoothing variant, exhibited the most accurate predictions and excelled in all evaluation metrics.

Analyzing the Results and Choosing the Best Model:

The selection of the most suitable model is of paramount importance in machine learning tasks. Our extensive analysis unequivocally identifies the Exponential Smoothing model as the optimal choice for heart rate prediction. This model leverages a combination of trend, seasonality, and error components within the dataset to ensure precise heart rate predictions.



Conclusion:

In conclusion, our in-depth exploration of ARIMA, SARIMAX, and Exponential Smoothing models reveals that the Exponential Smoothing model, with its triple exponential smoothing technique, stands out as the ideal choice for heart rate prediction. This model holds immense potential for real-time applications, enabling healthcare professionals to monitor patient health effectively and provide valuable insights.

References:

1. "Time Series Analysis and Its Applications: With R Examples" by Robert H. Shumway and David S. Stoffer

2. "Forecasting: Principles and Practice" by Rob J Hyndman and George Athanasopoulos

3. "Introduction to Time Series and Forecasting" by Peter J. Brockwell and Richard A. Davis

4. "Applied Time Series Analysis for Fisheries and Environmental Science" by Richard D. Methot, Jr.

5. "Time Series Analysis: Forecasting and Control" by George E.P. Box, Gwilym M. Jenkins, and Gregory C. Reinsel.

Code for reference: 

 https://github.com/yamini542/AppliedAI_Assignments/tree/main/Assignment_1_TimeSeries



Sunday, September 24, 2023

A Machine Learning-Driven Study of Global Air Pollution Dataset

 Introduction:

Air pollution is one of the most pressing environmental issues of our time, affecting both human health and the planet. In order to mitigate the negative effects of air pollution, it is crucial to understand and predict the levels and sources of air pollution. This blog presents a study of global air pollution prediction using machine learning techniques. 

Problem Definitions: 

1) Develop a classification model to predict the AQI category using a publicly available dataset of global air pollution factors.

 2) Use machine learning to perform a regression task and predict AQI values using a dataset of global air pollution readings. 

3) Use clustering techniques to group locations that have similar air pollution patterns based on the AQI Value and PM 2.5 AQI Value. 

Each and every problem we are solving with 3 different types of models either supervised or unsupervised machine learning methods.

Data Set Description: 

The data set is taken from the Kaggle which is a global air pollution data set. The data set used in this report contains information on air pollution levels in various locations around the world. The data set has a total of 23463 rows/instances and 12 columns/attributes.

The features in the dataset include 'Country', 'City', 'AQI Value', 'AQI Category', 'CO AQI Value', 'CO AQI Category', 'Ozone AQI Value', 'Ozone AQI Category', 'NO2 AQI Value', 'NO2 AQI Category', 'PM2.5 AQI Value', and 'PM2.5 AQI Category'. The 'Country' and 'City' columns contain categorical data and the rest of the columns contain numerical data. The 'AQI Value' feature represents the Air Quality Index value and the 'AQI Category' feature represents the Air Quality Index category. The other features represent the AQI values for CO, ozone, NO2, and PM2.5. These features will be used in the analysis to understand. The relationship between air pollution and AQI. In order to understand the characteristics of the dataset, we calculated various descriptive statistics.



Experiment-1: Develop a classification model to predict the AQI category using a publicly available dataset of global air pollution factors.

Model-1: Logistic Regression: In this experiment, we developed a classification model to predict air quality index (AQI) categories using a dataset of global air pollution factors. We employed logistic regression, a commonly used machine learning algorithm for classification tasks. The dataset was preprocessed and columns like 'City' and 'Country' were dropped. The AQI categories were transformed into binary values, and the data was split into training and testing sets. We trained both multinomial and Gaussian logistic regression models. Our analysis revealed that the Gaussian logistic regression model outperformed the multinomial model in terms of accuracy, recall, and precision. ROC AUC analysis demonstrated a high ability to distinguish between positive and negative classes, indicating strong classifier performance. In summary, our logistic regression models effectively predicted AQI categories based on pollution factors, with the Gaussian model showing superior results.


Model-2: Support Vector Classification
Support Vector Classification (SVC) is a powerful machine learning algorithm used for classification tasks, especially when the data isn't linearly separable. SVC finds a hyperplane that optimally separates different classes, and it can handle non-linear decision boundaries by utilizing kernel functions. For our air pollution dataset, where linear separation isn't feasible, SVC proved to be an excellent choice. We trained the model using the same data as our previous model, achieving remarkable results. The SVC exhibited perfect accuracy, F1-score, recall, and precision, all at 1.0, indicating its exceptional ability to classify instances accurately and minimize false positives. This highlights the model's outstanding performance and its suitability for this classification problem.

Model-3: Naive Bayes Theorem
Model 3 utilizes the Naive Bayes Theorem, a probabilistic algorithm based on Bayes' theorem and the assumption of feature independence. Before applying this algorithm, the data was normalized using the Standard Scaler. The model was trained using pre-processed data, and its performance was evaluated. The accuracy of 0.95 indicates that the Naive Bayes model successfully classified 95% of the test samples, showcasing its strong predictive capabilities on our dataset.

Analysis for the Experiment-1:
Based on the evaluation metrics of the three models, it seems that the SVC model performed best
with an accuracy of 1.0, an f1-score of 1.0, a recall score of 1.0, and a precision score of 1.0. The
logistic regression model had an AUC of 0.99 and good accuracy, however, it had some
convergence issues. The Naive Bayes model had an accuracy of 0.95.
Depending on the specific use case and requirements of your problem, I may want to choose the
model that is the best trade-off between performance and computational complexity.
However, SVC appears to have performed best in this experiment. 

Experiment-2:
Use a machine learning algorithm to perform a regression task and predict AQI values using a
dataset of global air pollution factors. For this problem, I choose 3 models which are Linear Regression, Random Forest regressor, and the Gradient Boost regression model. 
Model-1:
Linear Regression proved to be a suitable machine learning approach for predicting AQI values based on a dataset of global air pollution factors. It's a supervised learning algorithm that establishes relationships between independent variables like pollution factors and dependent variables like AQI values. The data, already preprocessed and cleaned, was used to train the model. Evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R2 Score were applied to assess performance. With an MAE of 4.82, MSE of 73.53, RMSE of 8.58, and an impressive R2 Score of 0.97, indicating that 97.69% of the variance is explained by the model, we can conclude that the model demonstrates strong predictive capabilities and a good fit for the dataset. Scatter plots and box plots further illustrated the alignment between actual and predicted values.

Model-2:
The Random Forest Regressor, an ensemble machine learning algorithm, was employed for regression tasks using the same training data as before. This powerful algorithm combines multiple decision trees to enhance prediction accuracy and is robust to outliers and missing values. The model achieved an impressive R2 score of 0.99, signifying its ability to explain 99.78% of the data's variance. This outstanding result confirms the model's strong fit for the dataset, making it a reliable choice for regression tasks.
Model-3:
In this experiment, we explored three regression models, including a Gradient gradient-boosting regressor. The models were trained and evaluated using the provided dataset. The results revealed that the Gradient Boosting Regressor yielded a relatively low Mean Squared Error (MSE) of 10.88 and an impressive R2 score of 0.99, signifying excellent model performance and a strong fit to the data. In comparing all models, the Random Forest Regressor emerged as the top performer with the lowest MSE and the highest R2 score. 

This suggests that Random Forest is the most suitable model among the three for this specific regression task. You can find further details and the experiment code

Experiment-3:
Using unsupervised machine learning techniques to group locations with similar air
pollution patterns into clusters based on the AQI Value and PM 2.5 AQI Value.
Model-1:
In this model utilizing K-Means clustering, we aimed to partition the dataset into meaningful clusters. Following preprocessing, we employed the elbow method and silhouette score to determine the optimal number of clusters, which was found to be 6. We then trained the K-Means model on the data, excluding the 'Country' column, and assigned cluster labels to locations based on AQI Value and PM 2.5 AQI Value attributes. The within-cluster sum of squares (WCSS) score for k=6 was 13.109, while the Silhouette score, a measure of similarity within clusters, reached 0.54. These metrics collectively affirm that the K-Means model with 6 clusters effectively groups the observations into distinct clusters, offering valuable insights into the dataset's underlying structure.
Model-2:
In our Hierarchical Clustering model, we adopted a more flexible approach where the number of clusters doesn't require pre-specification. The algorithm initially treats each data point as its own cluster and subsequently merges the closest clusters. The achieved score of 0.52 indicates moderate similarity within clusters and moderate dissimilarity between clusters, reflecting the structure of the data.
Model-3:
In Model 3, we employed a Gaussian Mixture Model (GMM), a probabilistic approach that assumes data points are generated from a mixture of Gaussian distributions. The model was trained with a specified number of clusters (n_components=6), but the Silhouette score of 0.16 indicated that the data points did not align well with their respective clusters, suggesting that the clusters were not well-defined. This outcome may be influenced by various factors, such as an incorrect choice of cluster count or covariance structure. Overall, based on Silhouette scores, the K-Means algorithm with 6 clusters emerged as the most effective among the three models, with a Silhouette score of 0.54, indicating better-defined clusters and more suitable clustering for the dataset.


Conclusion:
The final conclusion of this study is that machine learning techniques can be effectively applied
to the global air pollution data set to identify patterns, classify sources, and cluster locations with
similar air pollution patterns. By using supervised and unsupervised algorithms, air pollution
levels and sources in different locations can be accurately predicted and understood.
The benefits of using this data set in the study are numerous. By understanding the patterns and
sources of air pollution, policymakers, and decision-makers can take informed actions to reduce
the negative impacts of air pollution on human health and the environment. Additionally, the use
of machine learning techniques allows for the analysis of a large amount of data in a quick and
efficient manner, providing insights that may not have been apparent through traditional
methods. Overall, this study demonstrates the potential for machine learning to inform and
improve efforts to address air pollution on a global scale. 

Dataset:

Global Air Quality Index. [online] Kaggle. Available at:

Hand-Free Notes

 Introduction:

In today's fast-paced world, access to information is more valuable than ever. We understand the importance of having the right resources at your fingertips, which is why we've created Hands-free notes.

Our platform is your one-stop destination for hassle-free PDF file sharing and downloading. Whether you're a student looking for study materials, a professional in search of reference documents, or anyone with a thirst for knowledge, we've got you covered.

Key Features:

Effortless PDF Sharing: Upload your PDF files with ease, making them accessible to a global community of users who may benefit from your knowledge.

Seamless Downloading: Discover a treasure trove of PDF documents uploaded by our community. Download any file you need without restrictions.

User-Friendly Interface: Our intuitive design ensures a smooth and convenient user experience, whether you're a tech enthusiast or a casual visitor.

Technology: Powered by PHP, CSS, JavaScript, and HTML.

How this works?

Login Page: Login into the website through login details email and password if you are a new user then sign up.

Add Books: Once the user successfully login they can add books or notes here and enter the fields like title and description so the user can easily view them later.

Books: The user can view the book upon clicking the books icon

Main Page where we can Add and view Books







Tuesday, September 19, 2023

Harnessing Large Language Models for Sustainable Restaurant Operations: A Chatbot Approach

 Abstract:

In an age where technology permeates every facet of our lives, it's only natural that our dining experiences evolve with it. Artificial Intelligence (AI) has opened up exciting avenues to revolutionize the way we dine, moving beyond mere culinary delights to cater to our unique tastes, dietary needs, and environmental concerns. This shift raises a compelling question: How can AI not only enhance dining but also promote responsible eating and sustainable practices in restaurants?

Surprisingly, while AI is omnipresent in our digital lives, AI chatbots tailored for food recommendations are a rarity. This gap in the market prompted the idea of harnessing the formidable power of AI language models, exemplified by OpenAI's GPT-3.5 models, to build a chatbot. This chatbot's mission: to deliver personalized food recommendations and dietary plans to users, wrapped in a user-friendly interface and backed by a CSV agent.

Objective:

Chatbot Application: Develop an intuitive and user-friendly chatbot application that utilizes large language models for natural language understanding and generation. 

Leveraging Large Language Models: Employ cutting-edge language models, specifically GPT-3.5 models, to empower the chatbot with an effective understanding of user queries and contextually relevant responses. 

Intuitive User Interface: Design an intuitive user interface using Streamlit to ensure that users of all technical backgrounds can easily navigate and interact with the chatbot. 

Conversational Interaction: Implement a conversational interaction model that allows users to engage with the chatbot through questions, recommendations, and dynamic dialogues to enhance user engagement and personalization. 

Question-Answer and Food Recommendations: Focus on two core functionalities - answering user questions effectively and providing tailored food recommendations based on user preferences and dietary requirements

Architecture:

Implementation of the Chatbot:

Necessary requirements to implement the application:

1. API key to connect to the LLM models

2.Platform to build the Project Google Collab

3. Streamlit for the user interface

For this Project I have chosen GPT 3.5 models like gpt-3.5-turbo', 'gpt-3.5-turbo-16k', 'gpt-3.5-turbo-0613 and conversational model and the CSV agent.

To build this chatbot, we need an API key to connect to the LLM models, if we don't have one create an API key through this link (https://platform.openai.com/account/api-keys).

Google Collab and install the necessary libraries and the Streamlit

In this section, we'll provide a step-by-step guide to implementing the chatbot for sustainable restaurant operations. We'll cover the essential components and explain how they come together to create a user-friendly and effective tool.

Step 1. Set Up Your Environment:

Import necessary libraries such as streamlit, pandas, openai, and others.

Configure your OpenAI API key for language models.


Step 2. Create a Streamlit App


Step 3. Upload CSV file and View Data

In this step, we focus on how users can interact with CSV data uploaded to the chatbot. CSV (Comma-Separated Values) files are a common format for storing structured data, making them suitable for various applications, including our chatbot. We use a library called langchain to create a CSV agent, which acts as an interface between the user and the uploaded data.


Screen shot of Step 3

Step 4: View data

Step 4: Know about CSV


Step 4. Setup CSV Agent for interaction.
When a user uploads a CSV file, the chatbot reads and processes this data, making it available for answering user queries. Users can input specific questions or prompts related to the data, and the chatbot responds with relevant information from the CSV file. It's a powerful way to provide data-driven recommendations and responses based on the uploaded dataset.
Step 5: Conversation Model Setup

In this step, we enable users to have dynamic conversations with the chatbot. We introduce a conversation model that understands and responds to user queries and prompts. Users can enter text-based queries, and the chatbot interacts with them in a conversational manner.



The conversation model is powered by advanced language models like gpt-3.5-turbo', 'gpt-3.5-turbo-16k', 'gpt-3.5-turbo-0613  . Users can ask questions, seek recommendations, or engage in a dialogue with the chatbot. The chatbot understands natural language and responds contextually. The conversation history is displayed, allowing users to see the ongoing interaction between themselves and the chatbot.

                                              

                                                          Step5 : Recommendations

This step transforms the chatbot from a data-driven tool into a conversational AI assistant, making it user-friendly and engaging. It adds a dynamic layer to the chatbot's capabilities, enhancing its utility in helping users make informed food-related decisions.

Future Work:

In the future, our work will focus on enhancing the chatbot's capabilities by integrating advanced AI models like GPT-4 to improve language comprehension and nuanced recommendations.

We will also prioritize multilingual proficiency, expanding beyond English to cater to global markets, and delivering culturally adapted dining suggestions. Additionally, we aim to offer highly personalized experiences by leveraging user data and implementing advanced inventory management techniques for accurate demand forecasting.

The exploration of innovative voice and visual interfaces, coupled with ethical considerations, will pivotal as we strive to promote responsible AI use and sustainability metrics. Seamless integration with restaurant systems, offline access, and a robust feedback mechanism will ensure that the chatbot continues to evolve and provide valuable support for sustainable dining practices

Conclusion:

In conclusion, our AI-powered chatbot, fueled by advanced language models like GPT-3.5 models and supported by robust data-driven agents, signifies a transformative step in the realm of sustainable restaurant operations. It excels in providing food recommendations and dietary guidance, empowering patrons to make informed and nutritious choices, thereby reducing food waste and promoting healthier dining. Additionally, by optimizing inventory management, the chatbot aids in cost reduction and minimizes the environmental footprint associated with food production and disposal, underlining its potential to revolutionize the restaurant industry and foster sustainability.

References:

1.https://openai.com/blog/openai-api

2.https://python.langchain.com/docs/integrations/toolkits/csv

3.https://streamlit.io/

GitHub Link: https://github.com/yamini542/Thesis-Project






Monday, October 31, 2022

Predicting the decision of Income classifier dataset, if a person can earn more than or less than 50k a year or not, by using Python and MongoDB.



Introduction:

    Visualization is the one in which people can easily understand the context of whatever content it has. In this blog, I am trying to predict whether a person can make more than 50k a year yes or no by using coding.
    In this Blog, you will learn to create a decision tree of the sample dataset of the Income classifier and how to create a GUI window through python coding with secure Database connections. 

Overview:

The following Topics will be discussed in this blog:
1. What is a Decision Tree.
2. Installation guides for the necessary IDE.
3. How to import data in the Dataset and implement it in a python program.
4. How to implement a python program to build a Decision Tree with a MongoDB connection.
5. How to import data in the Dataset and implement it in a python program.
6. How the program works.
7. References.
 

What is a Decision Tree?

    When we decide on any decisions we will follow the chances of probability, for example, if I need to take food to college today or not?  I will get a decision based on the circumstances and probability. If we think of the same probabilities and circumstances. we are visualizing the same in a tree for a better understanding of the situation and to choose the right decision.

    Decision tree: 
        
        A decision tree is a decision-making tool that employs a tree-like model of decisions and their potential outcomes, such as chance event outcomes, resource costs, and utility. One technique to show an algorithm that solely uses conditional control statements is to use this method.

The most likely course of action to achieve a goal is identified using decision trees, a common machine-learning technique. Decision trees are frequently used in operations research, notably in decision analysis.

    How Decision Tree Algorithm works:
        
           The decision Tree Algorithm helps a machine to understand the data and it will predict the decision by using the root labels and the root attributes. Nodes, Edges, Root, Leaves, and Splitting are the common terms we use in Decision Tree

There are 3 methods to make a split in the decision tree.
  • Information Gain & Entropy
  • Gini index
  • Gain ratio
Decision trees use various algorithms to decide to split a node into 2 or more sub-nodes. The purity of the node increases w.r.t to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in the most homogeneous sub-nodes.

The algorithm begins with the original set as the root node.

On each iteration, it iterates through the very unused attribute of the set and calculates the Entropy(H) and Information gain(IG) of this attribute.

Select the attribute which has the smallest Entropy or Largest Information gain.

Split the set by the selected attribute to produce a data subset.

The algorithm goes on to each subset, considering only attributes never selected before.

Attribute Selection Measures :

      Attribute selection measures are the basic set of rules to choose the splitting criterion that helps in dividing the dataset in the best way. Information Gain, Gini Index, and Gain Ratio are the most commonly used and popular selection measures.

Gini impurity :

    Gini says, if we select two items from a population at random then they must be of the same class and the probability for this is 1 if the population is pure Information Gain A less impure node requires less information to describe it and, a more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy.




Installation of necessary Ide's:

     We need to choose an IDE that is suitable for our problem statement. In this problem statement, I have chosen Anaconda for doing programming, and for the database connections, I used MongoDB. Anaconda is like a play store where you can use so many IDE in that it is just an environment and it is install path settings as well. For doing python programming, here in this, I used Spyder. In spyder, console we can see plotting as well its easy for us to find out the mistakes at the time of debugging.

    For the database connection, we are using MongoDB, here we can do it through the cloud and localhost. Install the MongoDB compass for the local host and go to MongoDB.com and create an account for utilizing the cloud for storing the data. 

    Links for installation are given below:

    
How to import data in the Dataset and implement it in a python program:

    We can create a dataset in two ways either by importing it in MongoDB or by creating a field one by one here it looks like JSON.

 Step 1:   I have downloaded the data set from the Kaggle website as CSV files(comma-separated values).  The link is given below for your reference.


Step 2: Go to MongoDB and create a Database name and Collection name where there, is an option to import data from the website just import the data but before importing check that the data has any null values(any row data is empty or null). Upload it and you can see the data shown below.



Database name: Predict_Income  
Collection name: Income_data

Step3: If you are connecting through the cloud connection to the URL which provides admin and password, here in this article I used MongoDB cloud my database access by everyone. Otherwise, my database is accessed only through localhost.



    Below is the URL link for the MongoDb connection you need to connect through this link from the MongoDB compass.  👇

mongodb+srv://Yamini_admin:password@ranking.opqbg9f.mongodb.net/?retryWrites=true&w=majority 






(https://www.mongodb.com/ to get to know more about MongoDb)

After doing the necessary requirements for our database and importing it into our database cloud. 




step4: First check if necessary libraries are in the anaconda prompt other wise install like this  anaconda command prompt or spyder console 

                                    conda install libraryname(conda install pymongo)


How to connect to MongoDB in python programming:

Firstly need a library of Pymongo (import pymongo) and need a Mongoclient (from pymongo import MongoClient) for the connection as explained above given a database name as Predict_Income and a collection name as Income_data.(here collection is meant to be recorded) assigning those names to two variables.


this is how we connect to the database. In case Mongodb we are using localhost we need to at client =MongoClient("mongodb://localhost:27017") localhost and port number to connect to the database locally.

👉 If you are connecting through the cloud give access to the Ip address, which you want your database wants to share.

👉 After perfectly installing MongoDB without any mistakes in the python code as well when you are connecting to its and getting like certification error then use this 👇

client=MongoClient("mongodb+srv://Yamini_admin:oKov2HN7On5CBD4n@ranking.opqbg9f.mongodb.net/?retryWrites=true&w=majority", tls=True, tlsAllowInvalidCertificates=True).

How to Implement a python program to build a Decision Tree with MongoDB connections:

    We are chosen the dataset Incomedata.csv and imported it to MongoDB as well then import all the necessary libraries here in that data set we can see below fields given below.


This Data is collected from the 1994 census bureau. Data set consists of attributes like 

👉Age: It consists the data of age of the person ( Integer data type) 

👉 Workclass : Work class consists of the data of the person's work class like Private, Local-gov, self-emp-not-income, Federal-gov, State-gov.

👉 Education: It contains the data of Education acquired by the persons like 11th, Hs-grad, Doctors, Batchelors, 7th or 8th.
👉Education-num: Count of the persons who studied the particular education or till that education for example: 25 years age  people, who studied till 11th total number of count is like 7.

👇....


From the Data set we can say that here target column is the Income field which is the one, we are going to predict based on the feature columns (age, capital gain, capital-loss, and hours-per-week).

Step1: Importing Libraries we need to import the Decision Tree classifier.


Step2: Connecting to the Database from where we are getting our data, here we are using data from the database, in database records will be in a JSON format first, we need to convert that into data frames shown below 👇
       
client=MongoClient("")
mydb=client["Predict_Income"]
#---- collection name is Income_data we imported the CSV file from the MongoDB------
Name_Collection = mydb["Income_data"]
self.Convert_to_Dataframe = pd.DataFrame(list(Name_Collection.find()))
 #--- converting the JSON files in to the Data frames from the MongoDB

Spte3: Then the database is converted into data frames in which we can say that they are 15 columns, we are using 4 columns to predict the target columns. 
We need to convert the non-integer values to the number because we can not do arithmetic operations on Strings.

Check every column of data if it is having any strings then convert those values to the numbers(user choice) by using the map() function and check for the null values also change back to the appropriate data. 


Step 4: Do the step 3 process on every feature column.

DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape (n_samples, n_features) holding the training samples, and an array Y of integer values, shape, holding the class labels for the training samples:

DecisionTreeclassifer uses 2 arrays one is for X which holds the feature values and the Y array holds the class labels for the training samples:
>>>
>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)

After being fitted, the model can then be used to predict the class of samples:


Step 5:  By using the DecionTree classifier we will fit the model into the  X and Y and show the values in a tree model and train it.



    
Run the program it will show like below: make sure that the plot plane is on in spyder otherwise you can't see the tree or run in Jupyter if you are familiar with that and you can save that image as well.

👆Dtree for only 10 rows

Step6: Try to predict if we get the right value or not here for the column name I replace the string <50k to 99 and >=50k to 100, 

        print(model.predict([[30, 0, 0, 7]])
age: 30 capital-loss : 0 capital-gain:0 capital-loss:0 hours-per-week: 7 

From the above prediction, we can say that its 99 value means a person can not earn an income of more than 50 thousand in this scenario.

Program Flow:



 Step1: testdisplay.py


1. Class Show_DataAnalysis - main class which having methods to show tkinter window, Graph, Image.

  • Connection to Database by MongoClient

  • Starting tkinter window tkinter object and mentioned size

  • title and heading label

  • creation of two buttons click_DecisionTree and Cick_graph, these buttons call the

  • functions Display_DecisionTree, Display_Graph.


2. Function: Display_Graph

  • get the dataframe from the convert_to_dataframe.
  • scatter plot given x and y label from the data frame fileds
  • x-educational_num
  • y-education
  • Using canvas to show graph on the tkinter window
  • show the graph upon clicking the button display_Graph



3. Function: Display_DecisionTree:

  • Show the image which we get before as a decision tree save the image in the same path
  • Assign variable to the image as 'd'.
  • Pass image to the label and pack with size.


How the Output looks alike:

Output-1:



Output-2: When you click on Button-1

This is How the graph look alike





Output-3: This is how the Final window appear


References:
            https://en.wikipedia.org/wiki/Decision_tree            https://www.mongodb.com.            https://docs.python.org.            https://docs.spyder-ide.org/current/index.html
            Find out the source code here