Random Forest
Random Forest machine Learning Classification and Regression Techniques

Random Forest

CART (Classification and Regression Trees)

An Introduction and Motivation for choosing CART:

Two of the very basic ideas behind any algorithm used for predictive analysis are : –

  1. prediction on the basis of regression and
  2. prediction on the basis of classification.

However, both are equally important concepts of data science. Having said that, there are several dissimilarities between the two concepts also.
In case of regression, as we all know the predicted outcome is a numeric variable and that too continuous.
For a classification task, the predicted outcome is not numeric at all and represents categorical classes or factors i.e. the outcome variable in such a task has to be assuming limited number of values which may be binary in nature (dichotomous) or multinomial (having more than 2 classes).
We in our analysis are motivated to work only on the ‘classification’ scheme of tasks from a predictive analysis domain keeping our focus not on regression trees but only on classification trees, as the name suggests ‘Classification and Regression Trees’.
Now coming to the point of Classification Trees, we see that a lot of decision sciences are involved in such classification tasks. Decisions need to be taken on the procedure of classification of the outcome variable when we go for predicting.
These decisions are not certainly human made decisions in such cases. We as the students trying to do some analysis can merely set the parameters of the decision criteria and nothing else after that. The entire process of decision making while doing the classification task is done by machine learning algorithms like ‘Decision Trees’ and ‘Random Forest’.
The name ‘tree’ used here is quite analogous to the tree that we are familiar with in ‘Game Theory’. These are a combinations of ‘Decision Nodes’ and ‘Sub-Nodes’ which are subsequently linked to one another in a logical manner. On the basis of the results obtained at each node criteria, the algorithm itself chooses to operate and implement those criteria and takes decision regarding which rows of the data set satisfy each condition generated at each node and then segregate those rows on its own decision criteria.
This entire thing is a self learnt process for the machine (computer software here). The machine learns from its past decision on each rows of the Training sample and implements the same learnt knowledge later on the Testing sample to give us an idea of its accuracy of self learning.
All these concepts together give us the specific nomenclature ‘Decision Tree Learning’.
By decision trees we mean decision tree models of self learning used by the machine for segregation and classification of the tasks on its own.
However, some times a single decision tree classifier cannot suffice for the perfect classification of tasks. So, we in most cases require another machine learning algorithm called Random Forest, which actually is an ensemble of different and multiple number of decision trees taken together to obtain better predictive performance than could be obtained from any of the constituent learning algorithm alone i.e. a single decision tree model here.
So our focus will be primarily on Random Forest Classifier algorithm and its comparison with the single decision tree algorithm.

Objective of our Coding problem and the relevant Methodology:

We need to build a classifier that classifies the “salary” attribute. We can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), but we will be providing explanation why we have chosen to do that.
We will be splitting the data set into training and testing sets to accurately set the parameters and evaluate the quality of the classifier.
We may use any tool such as classifiers in R, Weka, Python, Orange, scikit-learn or other pieces of software to do this. While doing this though, we need to explain more about the classifier used and be sure that we are producing valid results.
However, here we will limit ourselves to the most common classifiers used out of the broad umbrella of ‘CART’, viz. Decision Trees and Random Forests.
We will be using ‘R’ as our desired software tool and write out the entire analysis in an R-Markdown Format.

Data Set:

We have a dataset named ‘Train’ itself and we divide it in a 7:3 ratio into a Testing Data Set and an in-sample Testing data set respectively. As usual we will build the model in R using the training data set and test it on the testing data set.
The data set ‘Train’ is available here: –
[train.csv: https://drive.google.com/file/d/0B7cB2Fb0wrTEMEdiU2FVcXdXZkE/view?usp=sharing]

Data Description:

The data set consists of the following variables: –
ID
Age
Employment class
Fnlwgt
Education level
Education years
Marital status
Occupation
Relationship status
Race
Sex
Capital gain
Capital loss
Work hours per week
Native country
Salary
Some variables are categorical while some are numeric.

Coding Demonstration:

We hereby start with the modelling using Decision Trees and Random Forests

Loading the libraries

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
##     format.pval, round.POSIXt, trunc.POSIXt, units
library(caret)
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
##     cluster
library(caret)
library(kernlab)
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
##     alpha
library(pROC)
## Type 'citation("pROC")' for a citation.
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
##     cov, smooth, var
library(plotly)
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
##     subplot
## The following object is masked from 'package:ggplot2':
##
##     last_plot
## The following object is masked from 'package:stats':
##
##     filter
## The following object is masked from 'package:graphics':
##
##     layout
library(rpart.plot)
## Loading required package: rpart

Getting the data: –

setwd("C://Users//arkad//Documents//STEP UP ANALYTICS//Blog Project//Random Forests")

My file path in my system is: – “C:UP ANALYTICSProjectForests”

df <- read.csv('train.csv')

Reading the structure of the data

str(df)
## 'data.frame':    38000 obs. of  16 variables:
##  $ ID                 : int  40947 17139 29557 10344 33206 834 12148 47271 37594 40982 ...
##  $ Age                : int  50 27 30 40 52 58 37 33 41 34 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 5 5 6 5 5 5 2 ...
##  $ Fnlwgt             : int  104729 138705 144593 181015 110563 349910 218490 222205 315834 67083 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 12 12 16 15 13 12 10 10 ...
##  $ Education.years    : int  9 9 9 9 10 15 14 9 13 13 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 1 3 5 6 3 3 3 3 5 5 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 4 9 9 5 11 5 4 11 5 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 3 1 2 5 1 1 1 6 2 5 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 2 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 1 2 2 1 2 2 2 1 1 2 ...
##  $ Capital.gain       : int  0 0 0 0 0 0 0 0 0 1471 ...
##  $ Capital.loss       : int  0 0 0 0 0 1977 0 0 1876 0 ...
##  $ Work.hours.per.week: int  48 53 40 47 40 50 55 40 40 40 ...
##  $ Native.country     : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 1 40 40 40 9 40 40 2 ...
##  $ Salary             : int  0 0 0 0 1 1 1 1 0 0 ...

Reading the first few rows

head(df)
##      ID Age Employment.class Fnlwgt Education.level Education.years
## 1 40947  50          Private 104729         HS-grad               9
## 2 17139  27          Private 138705         HS-grad               9
## 3 29557  30          Private 144593         HS-grad               9
## 4 10344  40          Private 181015         HS-grad               9
## 5 33206  52          Private 110563    Some-college              10
## 6   834  58     Self-emp-inc 349910     Prof-school              15
##        Marital.status         Occupation Relationship.status   Race
## 1            Divorced  Machine-op-inspct      Other-relative  White
## 2  Married-civ-spouse       Craft-repair             Husband  White
## 3       Never-married      Other-service       Not-in-family  Black
## 4           Separated      Other-service           Unmarried  White
## 5  Married-civ-spouse    Exec-managerial             Husband  White
## 6  Married-civ-spouse     Prof-specialty             Husband  White
##       Sex Capital.gain Capital.loss Work.hours.per.week Native.country
## 1  Female            0            0                  48  United-States
## 2    Male            0            0                  53  United-States
## 3    Male            0            0                  40              ?
## 4  Female            0            0                  47  United-States
## 5    Male            0            0                  40  United-States
## 6    Male            0         1977                  50  United-States
##   Salary
## 1      0
## 2      0
## 3      0
## 4      0
## 5      1
## 6      1

It seems from looking at the initial rows of the Data Frame that the variable ‘ID’ is not useful for any analysis.
The variable ‘Fnlwgt’ is a kind of final sampling weight.

describe(df)
## df
##
##  16  Variables      38000  Observations
## ---------------------------------------------------------------------------
## ID
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0    38000        1    23988    15982     2388     4794
##      .25      .50      .75      .90      .95
##    12013    24004    35947    43177    45591
##
## lowest :     1     2     3     4     5, highest: 47996 47997 47998 47999 48000
## ---------------------------------------------------------------------------
## Age
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0       71    0.999    38.53    15.39       19       22
##      .25      .50      .75      .90      .95
##       28       37       47       58       63
##
## lowest : 17 18 19 20 21, highest: 83 84 85 88 90
## ---------------------------------------------------------------------------
## Employment.class
##        n  missing distinct
##    38000        0        9
##
## ? (2196, 0.058), Federal-gov (1147, 0.030), Local-gov (2498, 0.066),
## Never-worked (8, 0.000), Private (26392, 0.695), Self-emp-inc (1285,
## 0.034), Self-emp-not-inc (3023, 0.080), State-gov (1443, 0.038),
## Without-pay (8, 0.000)
## ---------------------------------------------------------------------------
## Fnlwgt
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0    16399        1   189386   111893    38948    65460
##      .25      .50      .75      .90      .95
##   117746   178370   236328   326886   378460
##
## lowest :   12285   13769   14878   18827   19214
## highest: 1097453 1125613 1161363 1226583 1455435
## ---------------------------------------------------------------------------
## Education.level
##        n  missing distinct
##    38000        0       16
##
## 10th (1125, 0.030), 11th (1316, 0.035), 12th (501, 0.013), 1st-4th (198,
## 0.005), 5th-6th (381, 0.010), 7th-8th (740, 0.019), 9th (594, 0.016),
## Assoc-acdm (1240, 0.033), Assoc-voc (1652, 0.043), Bachelors (6188,
## 0.163), Doctorate (491, 0.013), HS-grad (12303, 0.324), Masters (1911,
## 0.050), Preschool (52, 0.001), Prof-school (650, 0.017), Some-college
## (8658, 0.228)
## ---------------------------------------------------------------------------
## Education.years
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0       16     0.95    10.07    2.725        5        7
##      .25      .50      .75      .90      .95
##        9       10       12       13       14
##
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency     52   198   381   740   594  1125  1316   501 12303  8658
## Proportion 0.001 0.005 0.010 0.019 0.016 0.030 0.035 0.013 0.324 0.228
##
## Value         11    12    13    14    15    16
## Frequency   1652  1240  6188  1911   650   491
## Proportion 0.043 0.033 0.163 0.050 0.017 0.013
## ---------------------------------------------------------------------------
## Marital.status
##        n  missing distinct
##    38000        0        7
##
## Divorced (5287, 0.139), Married-AF-spouse (34, 0.001), Married-civ-spouse
## (17350, 0.457), Married-spouse-absent (485, 0.013), Never-married (12514,
## 0.329), Separated (1218, 0.032), Widowed (1112, 0.029)
## ---------------------------------------------------------------------------
## Occupation
##        n  missing distinct
##    38000        0       15
##
## ? (2204, 0.058), Adm-clerical (4337, 0.114), Armed-Forces (13, 0.000),
## Craft-repair (4897, 0.129), Exec-managerial (4690, 0.123), Farming-fishing
## (1174, 0.031), Handlers-cleaners (1617, 0.043), Machine-op-inspct (2382,
## 0.063), Other-service (3852, 0.101), Priv-house-serv (203, 0.005),
## Prof-specialty (4812, 0.127), Protective-serv (724, 0.019), Sales (4182,
## 0.110), Tech-support (1113, 0.029), Transport-moving (1800, 0.047)
## ---------------------------------------------------------------------------
## Relationship.status
##        n  missing distinct
##    38000        0        6
##
## Value             Husband  Not-in-family Other-relative      Own-child
## Frequency           15268           9775           1191           5800
## Proportion          0.402          0.257          0.031          0.153
##
## Value           Unmarried           Wife
## Frequency            4136           1830
## Proportion          0.109          0.048
## ---------------------------------------------------------------------------
## Race
##        n  missing distinct
##    38000        0        5
##
## Value      Amer-Indian-Eskimo Asian-Pac-Islander              Black
## Frequency                 386               1226               3624
## Proportion              0.010              0.032              0.095
##
## Value                   Other              White
## Frequency                 309              32455
## Proportion              0.008              0.854
## ---------------------------------------------------------------------------
## Sex
##        n  missing distinct
##    38000        0        2
##
## Value      Female   Male
## Frequency   12622  25378
## Proportion  0.332  0.668
## ---------------------------------------------------------------------------
## Capital.gain
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0      114    0.229     1098     2122        0        0
##      .25      .50      .75      .90      .95
##        0        0        0        0     5013
##
## lowest :     0   114   401   594   914, highest: 25236 27828 34095 41310 99999
## ---------------------------------------------------------------------------
## Capital.loss
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0       89    0.134     87.8    168.2        0        0
##      .25      .50      .75      .90      .95
##        0        0        0        0        0
##
## lowest :    0  155  213  323  419, highest: 3004 3683 3770 3900 4356
## ---------------------------------------------------------------------------
## Work.hours.per.week
##        n  missing distinct     Info     Mean      Gmd      .05      .10
##    38000        0       90    0.896    40.48    12.27       18       25
##      .25      .50      .75      .90      .95
##       40       40       45       55       60
##
## lowest :  1  2  3  4  5, highest: 95 96 97 98 99
## ---------------------------------------------------------------------------
## Native.country
##        n  missing distinct
##    38000        0       42
##
## lowest :  ?                Cambodia         Canada           China            Columbia
## highest:  Thailand         Trinadad&Tobago  United-States    Vietnam          Yugoslavia
## ---------------------------------------------------------------------------
## Salary
##        n  missing distinct     Info      Sum     Mean      Gmd
##    38000        0        2    0.546     9087   0.2391   0.3639
##
## ---------------------------------------------------------------------------

Getting some plots done to understand the data visually: –

hist(df$Age[df$Salary==0], col=rgb(1,0,0,0.60),breaks = 30,
     main= 'Overlapping Histogram', xlab='Age')
hist(df$Age[df$Salary==1],col=rgb(0,0,1,0.60), breaks = 30, add=T)
legend("topright", c("Age | Salary = 0", "Age | Salary = 1"), fill=c(rgb(1,0,0,0.60), rgb(0,0,1,0.60)))
box()


So, from the graph it is pretty clear that the non-salaried people mostly accumulate towards the younger age group.
While the salaried people are concentrated arounf the Middle Age groups.

hist(df$Fnlwgt[df$Salary==0], col=rgb(1,0,0,0.60),breaks =30,
     main= 'Overlapping Histogram', xlab='Fnlwgt')
hist(df$Fnlwgt[df$Salary==1],col=rgb(0,0,1,0.60), breaks =30, add=T)
legend("topright", c("Fnlwgt | Salary = 0", "Fnlwgt | Salary = 1"), fill=c(rgb(1,0,0,0.60), rgb(0,0,1,0.60)))
box()


We see that both the Salaried and the Non-Salaried people are concentrated around Fnlwgt values of 150000 to 200000.
So, there is no stark diffrence between the two groups of people.
However, Fnlwgt is the final sampling weight assigned. As it makes no sense to use it. We will drop it.

hist(df$Education.years[df$Salary==0], col=rgb(1,0,0,0.60),breaks =30,
     main= 'Overlapping Histogram', xlab='Education Years')
hist(df$Education.years[df$Salary==1],col=rgb(0,0,1,0.60), breaks =30, add=T)
legend("topright", c("Education Years | Salary = 0", "Education Years | Salary = 1"), fill=c(rgb(1,0,0,0.60), rgb(0,0,1,0.60)))
box()


We see that both the Non-Salaried and the Salaried groups of people are present across all values of Education years.
This signifies that Years of Education does not have that much impact on the distribution of Salaried and Non-Salaried people.

hist(df$Work.hours.per.week[df$Salary==0], col=rgb(1,0,0,0.60),breaks =30,
     main= 'Overlapping Histogram', xlab='Work Hours Per Week')
hist(df$Work.hours.per.week[df$Salary==1],col=rgb(0,0,1,0.60), breaks =30, add=T)
legend("topright", c("Work hours per week | Salary = 0", "Work hours per week | Salary = 1"), fill=c(rgb(1,0,0,0.60), rgb(0,0,1,0.60)))
box()


The highest number of non-salaried people are working at about 40 Work hours per week.
Similarly, the highest number of salaried people also are working at about 40 Work hours per week.
However, people working at less than 40 working hours per week are mostly non-salaried.
This is because the number of salaried workers woking less than 40 work hours per week is much lesser.

hist(df$Capital.gain[df$Salary==0], col=rgb(1,0,0,0.60),breaks =15, main= 'Overlapping Histogram', xlab='Capital gain')
hist(df$Capital.gain[df$Salary==1],col=rgb(0,0,1,0.60), breaks =60, add=T)
legend("topright", c('Capital gain | Salary = 0', 'Capital gain | Salary = 1'), fill=c(rgb(1,0,0,0.60), rgb(0,0,1,0.60)))
box()


Capital gain is 0 for all Non-Salaried people.
Capital gain for Salaried people is highest at 0 and gradually decraeses as Capital gain increases.
However Capital gain is skewed a lot. So it’s better we bin them.

hist(df$Capital.loss[df$Salary==0], col=rgb(1,0,0,0.60),breaks =30, main= 'Overlapping Histogram', xlab='Capital loss')
hist(df$Capital.loss[df$Salary==1],col=rgb(0,0,1,0.60), breaks =15, add=T)
legend("topright", c('Capital loss | Salary = 0', 'Capital loss | Salary = 1'), fill=c(rgb(1,0,0,0.60), rgb(0,0,1,0.60)))
box()


Capital gain for Non-Salaried people is highest accumulated at capital gain value of 0.
Capital gain for Salaried people is also highest at 0.
But the occurrence of Non-Salaried people is much much higher than the Salaried people at capital gain value of zero.
However Capital loss is skewed a lot. So it’s better we bin them.

empl0 <- table(df$Employment.class[df$Salary == 0])
empl1 <- table(df$Employment.class[df$Salary == 1])
freq_empl <- data.frame(empl0, empl1)
p_empl <- plot_ly(freq_empl, x = ~Var1, y = ~Freq, type = 'bar', name = 'Salary = 0') %>%
  add_trace(y = ~Freq.1, name = 'Salary = 1') %>%
  layout(xaxis = list(title = NULL), yaxis = list(title = 'Frequency'), barmode = 'group')
p_empl
 

 
Non-Salaried people are much more in number for all educational levels except for Doctorate, Masters and Prof-school.

marri0 <- table(df$Marital.status[df$Salary == 0])
marri1 <- table(df$Marital.status[df$Salary == 1])
freq_marri <- data.frame(marri0, marri1)
p_marri <- plot_ly(freq_marri, x = ~Var1, y = ~Freq, type = 'bar', name = 'Salary = 0') %>%
  add_trace(y = ~Freq.1, name = 'Salary = 1') %>%
  layout(xaxis = list(title = NULL), yaxis = list(title = 'Frequency'), barmode = 'group')
p_marri
 

 
Number of Non-Slaried people are always higher across all Marital Statuses.

occu0 <- table(df$Occupation[df$Salary == 0])
occu1 <- table(df$Occupation[df$Salary == 1])
freq_occu <- data.frame(occu0, occu1)
p_occu <- plot_ly(freq_occu, x = ~Var1, y = ~Freq, type = 'bar', name = 'Salary = 0') %>%
  add_trace(y = ~Freq.1, name = 'Salary = 1') %>%
  layout(xaxis = list(title = NA), yaxis = list(title = 'Frequency'), barmode = 'group')
p_occu
 

 
Non-Salaried people are more across all professions.

Relan0 <- table(df$Relationship.status[df$Salary == 0])
Relan1 <- table(df$Relationship.status[df$Salary == 1])
freq_Relan <- data.frame(Relan0, Relan1)
p_Relan <- plot_ly(freq_Relan, x = ~Var1, y = ~Freq, type = 'bar', name = 'Salary = 0') %>%
  add_trace(y = ~Freq.1, name = 'Salary = 1') %>%
  layout(xaxis = list(title = NA), yaxis = list(title = 'Frequency'), barmode = 'group')
p_Relan
 

 
Number of Non-Salaried people are always greater across all Relationship status.

race0 <- table(df$Race[df$Salary == 0])
race1 <- table(df$Race[df$Salary == 1])
freq_race <- data.frame(race0, race1)
p_race <- plot_ly(freq_race, x = ~Var1, y = ~Freq, type = 'bar', name = 'Salary = 0') %>%
  add_trace(y = ~Freq.1, name = 'Salary = 1') %>%
  layout(xaxis = list(title = NA), yaxis = list(title = 'Frequency'), barmode = 'group')
p_race
 

 
Non-Salaried people are highest among the Whites and it is higher than the number of salaried people across all Race.

sex0 <- table(df$Sex[df$Salary == 0])
sex1 <- table(df$Sex[df$Salary == 1])
freq_sex <- data.frame(sex0, sex1)
p_sex <- plot_ly(freq_sex, x = ~Var1, y = ~Freq, type = 'bar', name = 'Salary = 0') %>%
  add_trace(y = ~Freq.1, name = 'Salary = 1') %>%
  layout(xaxis = list(title = NA), yaxis = list(title = 'Frequency'), barmode = 'group')
p_sex
 

Interestingly, although Non-Salaried are more in number for both males and Females, the disparity is less for Males than for the females.
Now we will be doing several bivariate plots

p_age_educ <- plot_ly(data = df, x = ~Age, y = ~Education.years,
        marker = list(size = 5,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',
                                   width = 2))) %>%
  layout(title = 'Styled Scatter',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))
p_age_educ
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
 

 
We see that there is as such no definite pattern shown here.

p_age_work_hours <- plot_ly(data = df, x = ~Age, y = ~Work.hours.per.week,
        marker = list(size = 5,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',
                                   width = 2))) %>%
  layout(title = 'Styled Scatter',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))
p_age_work_hours
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
 

 
To see the trend between Work hours per week and Age we recreate this scatter plot.
We see that there is as such no definite pattern shown here.

p_educ_work_hours <- plot_ly(data = df, x = ~Education.years, y = ~Work.hours.per.week,
        marker = list(size = 5,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',
                                   width = 2))) %>%
  layout(title = 'Styled Scatter',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))
p_educ_work_hours
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
 

 
To see the trend between Work hours per week and Education years we recreate this jointplot. We see that there is as such no definite pattern shown here.¶

Setting up the Data

str(df)
## 'data.frame':    38000 obs. of  16 variables:
##  $ ID                 : int  40947 17139 29557 10344 33206 834 12148 47271 37594 40982 ...
##  $ Age                : int  50 27 30 40 52 58 37 33 41 34 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 5 5 6 5 5 5 2 ...
##  $ Fnlwgt             : int  104729 138705 144593 181015 110563 349910 218490 222205 315834 67083 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 12 12 16 15 13 12 10 10 ...
##  $ Education.years    : int  9 9 9 9 10 15 14 9 13 13 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 1 3 5 6 3 3 3 3 5 5 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 4 9 9 5 11 5 4 11 5 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 3 1 2 5 1 1 1 6 2 5 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 2 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 1 2 2 1 2 2 2 1 1 2 ...
##  $ Capital.gain       : int  0 0 0 0 0 0 0 0 0 1471 ...
##  $ Capital.loss       : int  0 0 0 0 0 1977 0 0 1876 0 ...
##  $ Work.hours.per.week: int  48 53 40 47 40 50 55 40 40 40 ...
##  $ Native.country     : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 1 40 40 40 9 40 40 2 ...
##  $ Salary             : int  0 0 0 0 1 1 1 1 0 0 ...

Treatment of the variables ‘Capital gain’ and ‘Capital loss’ into categories.

We divide the variables Capital.gain and Capital.loss into 3 groups namely, ‘None’, ‘Low’, ‘High’

df$Capital.gain.class = cut(df$Capital.gain, breaks = c(-Inf,  0, median(df$Capital.gain[df$Capital.gain > 0]),                                                        Inf),labels = c("None", "Low", "High"))
df$Capital.loss.class = cut(df$Capital.loss, breaks = c(-Inf,  0, median(df$Capital.loss[df$Capital.loss > 0]),                                                        Inf),labels = c("None", "Low", "High"))
head(df)
##      ID Age Employment.class Fnlwgt Education.level Education.years
## 1 40947  50          Private 104729         HS-grad               9
## 2 17139  27          Private 138705         HS-grad               9
## 3 29557  30          Private 144593         HS-grad               9
## 4 10344  40          Private 181015         HS-grad               9
## 5 33206  52          Private 110563    Some-college              10
## 6   834  58     Self-emp-inc 349910     Prof-school              15
##        Marital.status         Occupation Relationship.status   Race
## 1            Divorced  Machine-op-inspct      Other-relative  White
## 2  Married-civ-spouse       Craft-repair             Husband  White
## 3       Never-married      Other-service       Not-in-family  Black
## 4           Separated      Other-service           Unmarried  White
## 5  Married-civ-spouse    Exec-managerial             Husband  White
## 6  Married-civ-spouse     Prof-specialty             Husband  White
##       Sex Capital.gain Capital.loss Work.hours.per.week Native.country
## 1  Female            0            0                  48  United-States
## 2    Male            0            0                  53  United-States
## 3    Male            0            0                  40              ?
## 4  Female            0            0                  47  United-States
## 5    Male            0            0                  40  United-States
## 6    Male            0         1977                  50  United-States
##   Salary Capital.gain.class Capital.loss.class
## 1      0               None               None
## 2      0               None               None
## 3      0               None               None
## 4      0               None               None
## 5      1               None               None
## 6      1               None               High

Now, we drop the variables which are uneccessary viz. ‘ID’,‘Native country’,‘Fnlwgt’,‘Capital gain’,‘Capital loss’

str(df)
## 'data.frame':    38000 obs. of  18 variables:
##  $ ID                 : int  40947 17139 29557 10344 33206 834 12148 47271 37594 40982 ...
##  $ Age                : int  50 27 30 40 52 58 37 33 41 34 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 5 5 6 5 5 5 2 ...
##  $ Fnlwgt             : int  104729 138705 144593 181015 110563 349910 218490 222205 315834 67083 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 12 12 16 15 13 12 10 10 ...
##  $ Education.years    : int  9 9 9 9 10 15 14 9 13 13 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 1 3 5 6 3 3 3 3 5 5 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 4 9 9 5 11 5 4 11 5 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 3 1 2 5 1 1 1 6 2 5 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 2 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 1 2 2 1 2 2 2 1 1 2 ...
##  $ Capital.gain       : int  0 0 0 0 0 0 0 0 0 1471 ...
##  $ Capital.loss       : int  0 0 0 0 0 1977 0 0 1876 0 ...
##  $ Work.hours.per.week: int  48 53 40 47 40 50 55 40 40 40 ...
##  $ Native.country     : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 1 40 40 40 9 40 40 2 ...
##  $ Salary             : int  0 0 0 0 1 1 1 1 0 0 ...
##  $ Capital.gain.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Capital.loss.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 3 1 1 2 1 ...
df <- df[,-c(1,4,12,13,15)]
View(df)
str(df)
## 'data.frame':    38000 obs. of  13 variables:
##  $ Age                : int  50 27 30 40 52 58 37 33 41 34 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 5 5 6 5 5 5 2 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 12 12 16 15 13 12 10 10 ...
##  $ Education.years    : int  9 9 9 9 10 15 14 9 13 13 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 1 3 5 6 3 3 3 3 5 5 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 4 9 9 5 11 5 4 11 5 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 3 1 2 5 1 1 1 6 2 5 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 2 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 1 2 2 1 2 2 2 1 1 2 ...
##  $ Work.hours.per.week: int  48 53 40 47 40 50 55 40 40 40 ...
##  $ Salary             : int  0 0 0 0 1 1 1 1 0 0 ...
##  $ Capital.gain.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Capital.loss.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 3 1 1 2 1 ...

We see most of the variables are in their required format.
But Salary is in numeric form.
So we need it to convert it to factors.

df$Salary <- as.factor(df$Salary)
str(df)
## 'data.frame':    38000 obs. of  13 variables:
##  $ Age                : int  50 27 30 40 52 58 37 33 41 34 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 5 5 6 5 5 5 2 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 12 12 16 15 13 12 10 10 ...
##  $ Education.years    : int  9 9 9 9 10 15 14 9 13 13 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 1 3 5 6 3 3 3 3 5 5 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 4 9 9 5 11 5 4 11 5 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 3 1 2 5 1 1 1 6 2 5 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 2 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 1 2 2 1 2 2 2 1 1 2 ...
##  $ Work.hours.per.week: int  48 53 40 47 40 50 55 40 40 40 ...
##  $ Salary             : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Capital.gain.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Capital.loss.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 3 1 1 2 1 ...

Train Test Split

trainIndex <- createDataPartition(df$Salary, p = .7, list = FALSE, times = 1)
dfTrain <- df[trainIndex,]
str(dfTrain)
## 'data.frame':    26601 obs. of  13 variables:
##  $ Age                : int  50 30 52 58 37 41 34 41 61 35 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 6 5 5 2 6 5 3 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 16 15 13 10 10 16 6 12 ...
##  $ Education.years    : int  9 9 10 15 14 13 13 10 4 9 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 1 5 3 3 3 5 5 3 3 1 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 9 5 11 5 11 5 6 15 6 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 3 2 1 1 1 2 5 1 1 4 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 3 5 5 5 5 2 5 3 2 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 1 2 2 2 2 1 2 2 2 2 ...
##  $ Work.hours.per.week: int  48 40 40 50 55 40 40 54 40 56 ...
##  $ Salary             : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 2 1 1 ...
##  $ Capital.gain.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Capital.loss.class : Factor w/ 3 levels "None","Low","High": 1 1 1 3 1 2 1 1 2 1 ...
dfTest  <- df[-trainIndex,]
str(dfTest)
## 'data.frame':    11399 obs. of  13 variables:
##  $ Age                : int  27 40 33 44 66 36 40 63 25 53 ...
##  $ Employment.class   : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 5 5 5 6 5 5 5 ...
##  $ Education.level    : Factor w/ 16 levels " 10th"," 11th",..: 12 12 12 13 7 9 12 12 10 12 ...
##  $ Education.years    : int  9 9 9 14 5 11 9 9 13 9 ...
##  $ Marital.status     : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 6 3 5 6 1 3 1 3 3 ...
##  $ Occupation         : Factor w/ 15 levels " ?"," Adm-clerical",..: 4 9 4 11 9 14 5 6 11 8 ...
##  $ Relationship.status: Factor w/ 6 levels " Husband"," Not-in-family",..: 1 5 6 2 2 5 1 2 1 1 ...
##  $ Race               : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 5 3 5 5 5 5 5 ...
##  $ Sex                : Factor w/ 2 levels " Female"," Male": 2 1 1 1 1 1 2 2 2 2 ...
##  $ Work.hours.per.week: int  53 47 40 35 30 40 50 25 40 40 ...
##  $ Salary             : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 2 ...
##  $ Capital.gain.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Capital.loss.class : Factor w/ 3 levels "None","Low","High": 1 1 1 1 1 1 1 1 1 1 ...

We check for missing values.

anyNA(dfTrain)
## [1] FALSE
 anyNA(dfTest)
## [1] FALSE

So, there are no missing values.
Now we need to train the decision tree classifier.

Training the Decision Tree classifier Model for the sake of understanding.

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(Salary ~., data = dfTrain, method = "rpart", parms = list(split = "information"),
              trControl=trctrl, tuneLength = 10)

So, we have fit a decision tree model with ‘rpart’ as the method where the criteraia was ‘information gain’.
Now, we can check the result of our train() method by a print the results of the fit variable.
It shows us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree.

dtree_fit
## CART
##
## 26601 samples
##    12 predictor
##     2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 23941, 23940, 23941, 23941, 23941, 23941, ...
## Resampling results across tuning parameters:
##
##   cp           Accuracy   Kappa
##   0.001572080  0.8493164  0.5353711
##   0.002200912  0.8461084  0.5289723
##   0.002279516  0.8458077  0.5285530
##   0.002358120  0.8456824  0.5268535
##   0.003772992  0.8411212  0.5181588
##   0.003930200  0.8405323  0.5141373
##   0.005502280  0.8390285  0.5036012
##   0.038673165  0.8293047  0.4572364
##   0.039301997  0.8244425  0.4368127
##   0.121836189  0.7936166  0.2410693
##
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.00157208.

So, in the above table we see the accuracy parameters.

Visualisation of the fitted tree.

knitr::opts_chunk$set(fig.width=12, fig.height=10)

We want to visualise the fited tree by using the prp() method

prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)


We can clearly see that this tree is not that good, as the tree has developed much across the later on child nodes.
So there are high cahnces of overfitting.
We need to do some thing else, which we will show later.

Prediction using this decision tree model

cp = 0.001886496 is the value with which our model was built.

test_Pred <- predict(dtree_fit, newdata = dfTest)
test_Pred
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
## [35] 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
## [69] 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
## ...
## ...
## [11255] 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1
## [11289] 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
## [11323] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [11357] 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [11391] 0 0 0 0 0 0 0 0 0
## Levels: 0 1

We now want to see the confusion matrix for the predictions made.

confusionMatrix(test_Pred, dfTest$Salary )
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 8251 1353
##          1  422 1373
##
##                Accuracy : 0.8443
##                  95% CI : (0.8375, 0.8509)
##     No Information Rate : 0.7609
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.5154
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.9513
##             Specificity : 0.5037
##          Pos Pred Value : 0.8591
##          Neg Pred Value : 0.7649
##              Prevalence : 0.7609
##          Detection Rate : 0.7238
##    Detection Prevalence : 0.8425
##       Balanced Accuracy : 0.7275
##
##        'Positive' Class : 0
## 

The accuracy is 84 % when information gain is considered the criteria.
Let us try to fit a different decision tree on the criteria of gini index.

set.seed(3333)
dtree_fit_gini <- train(Salary ~., data = dfTest, method = "rpart",
                   parms = list(split = "gini"),
                   trControl=trctrl,
                   tuneLength = 10)
dtree_fit_gini
## CART
##
## 11399 samples
##    12 predictor
##     2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 10260, 10259, 10259, 10260, 10258, 10259, ...
## Resampling results across tuning parameters:
##
##   cp           Accuracy   Kappa
##   0.001650770  0.8482893  0.5589958
##   0.001711910  0.8482893  0.5589958
##   0.001834189  0.8476462  0.5585721
##   0.002292737  0.8459506  0.5543998
##   0.002494497  0.8458341  0.5540337
##   0.004157496  0.8399559  0.5272490
##   0.004181952  0.8398682  0.5262446
##   0.035950110  0.8315043  0.4705871
##   0.040719002  0.8251008  0.4446362
##   0.121973588  0.7841048  0.1835723
##
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.00171191.

So, in the above table we see the accuracy parameters.

Visualisation of the newly fitted tree.

prp(dtree_fit_gini$finalModel, box.palette = "Blues", tweak = 1.2)


We see that this tree is much better. As it has more but limited number of nodes.

Prediction using this newly fitted Decision Tree.

test_pred_gini <- predict(dtree_fit_gini, newdata = dfTest)

We now look at the confusion matrix for this model.

confusionMatrix(test_pred_gini, dfTest$Salary )
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 7928  904
##          1  745 1822
##
##                Accuracy : 0.8553
##                  95% CI : (0.8487, 0.8617)
##     No Information Rate : 0.7609
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.5944
##  Mcnemar's Test P-Value : 9.989e-05
##
##             Sensitivity : 0.9141
##             Specificity : 0.6684
##          Pos Pred Value : 0.8976
##          Neg Pred Value : 0.7098
##              Prevalence : 0.7609
##          Detection Rate : 0.6955
##    Detection Prevalence : 0.7748
##       Balanced Accuracy : 0.7912
##
##        'Positive' Class : 0
## 

So we see that the accuracy is 85 %.
So, this tree gives us a slightly better classification result.
Now we will proceed to the Random Forests Mechanism.

Training the Random Forest Model

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
##
##     combine
## The following object is masked from 'package:ggplot2':
##
##     margin

We now fit the model.

fit.rf <- randomForest(Salary ~ ., data=dfTrain, importance=T)
fit.rf
## Call:
##  randomForest(formula = Salary ~ ., data = dfTrain, importance = T)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
##
##         OOB estimate of  error rate: 9.88%
## Confusion matrix:
##       0    1 class.error
## 0 19258  982  0.04851779
## 1  1646 4715  0.25876435

So we see that the random forest fit has 500 trees by default in it.
Error rate is 9.88 %, which is quite a bit.
We need to check on the predictions and the model performance statistics too.

plot(fit.rf)


So, from the plot we see that the error rate becomes stable after 400 decision trees are alreday passed.
After 400 trees, there is not a significant reduction in the error rate.
We also want to plot which variable is more important than the other.

varImpPlot(fit.rf, sort = T, main="Variable Importance")


This is a detailed view of the variable impact scenario.
For a more sophisticated version, we see the numeric table as following.

var.imp <- data.frame(importance(fit.rf, type=2))
var.imp$Variables <- row.names(var.imp)
var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]
##                     MeanDecreaseGini           Variables
## Age                        1121.4979                 Age
## Relationship.status        1043.5788 Relationship.status
## Occupation                  917.0874          Occupation
## Marital.status              864.6072      Marital.status
## Capital.gain.class          772.8944  Capital.gain.class
## Work.hours.per.week         680.1271 Work.hours.per.week
## Education.years             628.1257     Education.years
## Education.level             562.0340     Education.level
## Employment.class            371.0931    Employment.class
## Capital.loss.class          160.5204  Capital.loss.class
## Race                        146.6559                Race
## Sex                         119.7670                 Sex

Based on Random Forest variable importance, the variables could be selected for any other further predictive modelling techniques or machine learning.
Here we see variables like ‘Age’, ‘Realtionship.status’, ‘Occupation’, ‘Marrital.status’, etc. are more influential than the other variables.
We now go for predicting using this fitted model.

Predicting using Random Forests

dfTest$predicted_salary <- predict(fit.rf, dfTest)

So we prdicted the values for ‘Salary’ using Random Forests
We now need to check how good was the prediction.

library(e1071)
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
##     impute
library(caret)

We go for the Confusion Matrix.

predicted.rfs <- predict(fit.rf, dfTest)
dfTest$predicted <- predicted.rfs
confusionMatrix(predicted.rfs , dfTest$Salary)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 8274  761
##          1  399 1965
##
##                Accuracy : 0.8982
##                  95% CI : (0.8925, 0.9037)
##     No Information Rate : 0.7609
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.707
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.9540
##             Specificity : 0.7208
##          Pos Pred Value : 0.9158
##          Neg Pred Value : 0.8312
##              Prevalence : 0.7609
##          Detection Rate : 0.7259
##    Detection Prevalence : 0.7926
##       Balanced Accuracy : 0.8374
##
##        'Positive' Class : 0
## 

We see that the accuracy of the model is 89 %, which is good.
So, the model worked well in such a case.
However, we need to check the model performance also.
So we look at the Gini based cummulative lift chart as follows: –

library(gains)
library(ineq)
# Gini Index
ineq(dfTest$predicted,type="Gini")
## [1] 0.1361431
## Lorenz Curve
plot(Lc(dfTest$predicted),col="darkred",lwd=2)


So the gini index is about 0.13 , which is not that good and the area under the Lorenz curve is too meagre, indicating that the Random Forest classifier selected here does not suit the data and gives us inefficient results.

So the model requires further more different procedures to be dealt in.

Application of Support Vector Machines could well have served our purpose better than this.

Our very basic analysis on Random Forests ends here.

Basic Comparison with SVM :

As we have many different machine learning algorithms for classification tasks, we could have improved on the accuracy by implementing SVM too.
As we all know SVM requires too much computing space and time. But at the same time it classifies better in some cases.
We try out the svm clasification seprately and compare its results with the Random Forest model done here. However the intrecate details of coding and stuff for SVM and its relevant explanations are kept at the back end and not presented here to eliminate cluttering and unneccessary confusion.
We do not guarantee a vast change in results from this even if we use SVM. However there is expected to be little bit of improvement.

memory.limit(size = 900000)
## [1] 9e+05
library(caret)
library(e1071)
library(kernlab)
library(pROC)

Fitting the SVM Model

model_svm <- svm(Salary ~ . , dfTrain)
model_svm
##
## Call:
## svm(formula = Salary ~ ., data = dfTrain)
##
##
## Parameters:
##    SVM-Type:  C-classification
##  SVM-Kernel:  radial
##        cost:  1
##       gamma:  0.01639344
##
## Number of Support Vectors:  9196

Predicting using SVM

pred_svm <- predict(model_svm, dfTest)
dfTest$predicted_Salary_SVM <- pred_svm

Confusion Matrix and accuracy check of SVM.

Conf_SVM_Test <- confusionMatrix(pred_svm , dfTest$Salary)
Conf_SVM_Test
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 8242 1281
##          1  431 1445
##
##                Accuracy : 0.8498
##                  95% CI : (0.8431, 0.8563)
##     No Information Rate : 0.7609
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.5379
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.9503
##             Specificity : 0.5301
##          Pos Pred Value : 0.8655
##          Neg Pred Value : 0.7703
##              Prevalence : 0.7609
##          Detection Rate : 0.7230
##    Detection Prevalence : 0.8354
##       Balanced Accuracy : 0.7402
##
##        'Positive' Class : 0
## 

The accuracy of the SVM model is 84.98 % which is less than that of the Random Forest model.
SVM Model Performance

# Gini Index
ineq(dfTest$predicted_Salary_SVM,type="Gini")
## [1] 0.1180607
## Lorenz Curve
plot(Lc(dfTest$predicted_Salary_SVM),col="darkred",lwd=2)

Comparison and Conclusion:

We see that using SVM we do not get a good classification either.
But that is not the end of the world. If we look closely, we will see that Random Forest had a Gini Coefficient of 0.13 and this SVM fit has a Gini Coefficient of 0.12 i.e the goodness of fit of the model has decreased in the case of SVM but by not a big margin.
So the SVM Classifier is less effective than the Random Forest Classifier, though both are unsatisfactory as per as overall fit is concerned.
So, it is evident that accuracy is not important here and is not the ultimate decider. Instead the overall goodness of fit of the model should be maximised as our ultimate target of classification.
In this scenario Random Forest classified better than SVM in terms of both accuracy and goodness of model fit.
Random Forest turns out to be the boss till now. In other cases it might happen that SVM turns out to be the boss.

Our comparison task is thus completed here.

Leave a Reply

Close Menu