Machine learning project in python to predict loan approval (Part 6 of 6)

We have the dataset with the loan applicants data and whether the application was approved or not. In this tutorial we will build a machine learning model to predict the loan approval probabilty. This would be last project in this course.

Steps involved in this machine learning project:

Following are the steps involved in creating a well-defined ML project:

Understand and define the problem
Analyse and prepare the data
Apply the algorithms
Reduce the errors
Predict the result

Our Third Project : Predict if the loan application will get approved

We have the loan application information like the applicant's name, personal details, financial information and requested loan amount and related details and the outcome (whether the application was approved or rejected). Based on this we are going to train a model and predict if a loan will get approved or not.

Here's our data set:

Loading our Loan-applications-information dataset:

Launch Anaconda navigator and open the terminal
Type the below command to start the python environment

python

Lets make sure the python environment is up and running. Copy paste the below command in the terminal to check if its working properly

print("Hello World")

Lets load the required libraries for our analysis

#Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, export_graphviz

Lets load the loan applications training data set and assign it to a variable called "iris". We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization

#Load dataset
url = "https://raw.githubusercontent.com/callxpert/datasets/master/Loan-applicant-details.csv"
names = ['Loan_ID','Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area','Loan_Status']
dataset = pd.read_csv(url, names=names)

Lets take a peek at the data

print(dataset.head(20))

sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:

from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    dataset[i] = le.fit_transform(dataset[i])

Splitting the Data set

As we have seen already, In Machine learning we have two kinds of datasets

Training dataset - used to train our model
Testing dataset - used to test if our model is making accurate predictions

Our dataset has 480 records. We are going to use 80% of it for training the model and 20% of the records to evaluate our model. copy paste the below commands to prepare our data sets

Though our dataset has lot of columns, we are only going to use the Income fields, loan amount, loan duration and credit history fields to train our model.

array = dataset.values
X = array[:,6:11]
Y = array[:,12]
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)

Evaluating the model and training the Model

We are going to apply the below four algorithms to this problem and evaluate its effectiveness. And finally choose the best algorithm and train it.

Logistic Regression : Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables

model = LogisticRegression()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Result: 0.7708333333333334

Decision tree : Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables

model = DecisionTreeClassifier()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Result: 0.6458333333333334

Random forest : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees

model = RandomForestClassifier(n_estimators=100)
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Result: 0.7604166666666666

So, Regression algorithm works fine for our use case. We can change / play around with the variables that we used for training the model till we get better accuracy.

Summary

We built an end-to-end project and tested different algorithms in this tutorial. This concludes this mini course on machine learning. Hope the course gave you a good primer to the machine learning concepts and boosted your overall confidence with machine learning.

Next steps:

Continue working on more projects and build your portfolio
Learn more about various machine learning algorithms
Understand how the algorithms work behind the scenes and how we can fine tune it
Sign up and share your knowledge with others in this copycoding community

nVector

posted on 06 Sep 18

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds