Tech Blog @ Yamini.Peketi: Predicting the decision of Income classifier dataset, if a person can earn more than or less than 50k a year or not, by using Python and MongoDB.

Introduction:

Visualization is the one in which people can easily understand the context of whatever content it has. In this blog, I am trying to predict whether a person can make more than 50k a year yes or no by using coding.

In this Blog, you will learn to create a decision tree of the sample dataset of the Income classifier and how to create a GUI window through python coding with secure Database connections.

Overview:

The following Topics will be discussed in this blog:

1. What is a Decision Tree.

2. Installation guides for the necessary IDE.

3. How to import data in the Dataset and implement it in a python program.

4. How to implement a python program to build a Decision Tree with a MongoDB connection.

5. How to import data in the Dataset and implement it in a python program.

6. How the program works.

7. References.

What is a Decision Tree?

When we decide on any decisions we will follow the chances of probability, for example, if I need to take food to college today or not? I will get a decision based on the circumstances and probability. If we think of the same probabilities and circumstances. we are visualizing the same in a tree for a better understanding of the situation and to choose the right decision.

Decision tree:

A decision tree is a decision-making tool that employs a tree-like model of decisions and their potential outcomes, such as chance event outcomes, resource costs, and utility. One technique to show an algorithm that solely uses conditional control statements is to use this method.

The most likely course of action to achieve a goal is identified using decision trees, a common machine-learning technique. Decision trees are frequently used in operations research, notably in decision analysis.

How Decision Tree Algorithm works:

The decision Tree Algorithm helps a machine to understand the data and it will predict the decision by using the root labels and the root attributes. Nodes, Edges, Root, Leaves, and Splitting are the common terms we use in Decision Tree

There are 3 methods to make a split in the decision tree.

Information Gain & Entropy
Gini index
Gain ratio

Decision trees use various algorithms to decide to split a node into 2 or more sub-nodes. The purity of the node increases w.r.t to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in the most homogeneous sub-nodes.

The algorithm begins with the original set as the root node.

On each iteration, it iterates through the very unused attribute of the set and calculates the Entropy(H) and Information gain(IG) of this attribute.

Select the attribute which has the smallest Entropy or Largest Information gain.

Split the set by the selected attribute to produce a data subset.

The algorithm goes on to each subset, considering only attributes never selected before.

Attribute Selection Measures :

Attribute selection measures are the basic set of rules to choose the splitting criterion that helps in dividing the dataset in the best way. Information Gain, Gini Index, and Gain Ratio are the most commonly used and popular selection measures.

Gini impurity :

Gini says, if we select two items from a population at random then they must be of the same class and the probability for this is 1 if the population is pure Information Gain A less impure node requires less information to describe it and, a more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy.

Installation of necessary Ide's:

We need to choose an IDE that is suitable for our problem statement. In this problem statement, I have chosen Anaconda for doing programming, and for the database connections, I used MongoDB. Anaconda is like a play store where you can use so many IDE in that it is just an environment and it is install path settings as well. For doing python programming, here in this, I used Spyder. In spyder, console we can see plotting as well its easy for us to find out the mistakes at the time of debugging.

For the database connection, we are using MongoDB, here we can do it through the cloud and localhost. Install the MongoDB compass for the local host and go to MongoDB.com and create an account for utilizing the cloud for storing the data.

Links for installation are given below:

How to import data in the Dataset and implement it in a python program:

We can create a dataset in two ways either by importing it in MongoDB or by creating a field one by one here it looks like JSON.

Step 1: I have downloaded the data set from the Kaggle website as CSV files(comma-separated values). The link is given below for your reference.

download the CSV file adult.csv

Step 2: Go to MongoDB and create a Database name and Collection name where there, is an option to import data from the website just import the data but before importing check that the data has any null values(any row data is empty or null). Upload it and you can see the data shown below.

Database name: Predict_Income

Collection name: Income_data

Step3: If you are connecting through the cloud connection to the URL which provides admin and password, here in this article I used MongoDB cloud my database access by everyone. Otherwise, my database is accessed only through localhost.

Below is the URL link for the MongoDb connection you need to connect through this link from the MongoDB compass. 👇

mongodb+srv://Yamini_admin:password@ranking.opqbg9f.mongodb.net/?retryWrites=true&w=majority

(https://www.mongodb.com/ to get to know more about MongoDb)

After doing the necessary requirements for our database and importing it into our database cloud.

step4: First check if necessary libraries are in the anaconda prompt other wise install like this anaconda command prompt or spyder console

conda install libraryname(conda install pymongo)

How to connect to MongoDB in python programming:

Firstly need a library of Pymongo (import pymongo) and need a Mongoclient (from pymongo import MongoClient) for the connection as explained above given a database name as Predict_Income and a collection name as Income_data.(here collection is meant to be recorded) assigning those names to two variables.

this is how we connect to the database. In case Mongodb we are using localhost we need to at client =MongoClient("mongodb://localhost:27017") localhost and port number to connect to the database locally.

👉 If you are connecting through the cloud give access to the Ip address, which you want your database wants to share.

👉 After perfectly installing MongoDB without any mistakes in the python code as well when you are connecting to its and getting like certification error then use this 👇

client=MongoClient("mongodb+srv://Yamini_admin:oKov2HN7On5CBD4n@ranking.opqbg9f.mongodb.net/?retryWrites=true&w=majority", tls=True, tlsAllowInvalidCertificates=True).

How to Implement a python program to build a Decision Tree with MongoDB connections:

We are chosen the dataset Incomedata.csv and imported it to MongoDB as well then import all the necessary libraries here in that data set we can see below fields given below.

This Data is collected from the 1994 census bureau. Data set consists of attributes like

👉Age: It consists the data of age of the person ( Integer data type)

👉 Workclass : Work class consists of the data of the person's work class like Private, Local-gov, self-emp-not-income, Federal-gov, State-gov.

👉 Education: It contains the data of Education acquired by the persons like 11th, Hs-grad, Doctors, Batchelors, 7th or 8th.

👉Education-num: Count of the persons who studied the particular education or till that education for example: 25 years age people, who studied till 11th total number of count is like 7.

👇....

Find out more about the data set click here

From the Data set we can say that here target column is the Income field which is the one, we are going to predict based on the feature columns (age, capital gain, capital-loss, and hours-per-week).

Step1: Importing Libraries we need to import the Decision Tree classifier.

Step2: Connecting to the Database from where we are getting our data, here we are using data from the database, in database records will be in a JSON format first, we need to convert that into data frames shown below 👇

client=MongoClient("")

mydb=client["Predict_Income"]

#---- collection name is Income_data we imported the CSV file from the MongoDB------

Name_Collection = mydb["Income_data"]

self.Convert_to_Dataframe = pd.DataFrame(list(Name_Collection.find()))

#--- converting the JSON files in to the Data frames from the MongoDB

Spte3: Then the database is converted into data frames in which we can say that they are 15 columns, we are using 4 columns to predict the target columns.

We need to convert the non-integer values to the number because we can not do arithmetic operations on Strings.

Check every column of data if it is having any strings then convert those values to the numbers(user choice) by using the map() function and check for the null values also change back to the appropriate data.

Step 4: Do the step 3 process on every feature column.

DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape (n_samples, n_features) holding the training samples, and an array Y of integer values, shape, holding the class labels for the training samples:

DecisionTreeclassifer uses 2 arrays one is for X which holds the feature values and the Y array holds the class labels for the training samples:

>>>>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)

After being fitted, the model can then be used to predict the class of samples:

Step 5: By using the DecionTree classifier we will fit the model into the X and Y and show the values in a tree model and train it.

Run the program it will show like below: make sure that the plot plane is on in spyder otherwise you can't see the tree or run in Jupyter if you are familiar with that and you can save that image as well.

👆Dtree for only 10 rows

Step6: Try to predict if we get the right value or not here for the column name I replace the string <50k to 99 and >=50k to 100,

print(model.predict([[30, 0, 0, 7]])

age: 30 capital-loss : 0 capital-gain:0 capital-loss:0 hours-per-week: 7

From the above prediction, we can say that its 99 value means a person can not earn an income of more than 50 thousand in this scenario.

Program Flow:

Step1: testdisplay.py

1. Class Show_DataAnalysis - main class which having methods to show tkinter window, Graph, Image.

Connection to Database by MongoClient
Starting tkinter window tkinter object and mentioned size
title and heading label
creation of two buttons click_DecisionTree and Cick_graph, these buttons call the
functions Display_DecisionTree, Display_Graph.

2. Function: Display_Graph

get the dataframe from the convert_to_dataframe.
scatter plot given x and y label from the data frame fileds
x-educational_num
y-education
Using canvas to show graph on the tkinter window
show the graph upon clicking the button display_Graph

3. Function: Display_DecisionTree:

Show the image which we get before as a decision tree save the image in the same path
Assign variable to the image as 'd'.
Pass image to the label and pack with size.