Decision Tree
Decision tree in R

Decision Tree

Decision Tree

Introduction

A decision tree is a decision support tool that uses
a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs, and
utility. It is one way to display an algorithm that only contains
conditional control statements.
Decision trees are commonly used in operations research, specifically
in decision analysis, to help identify a strategy most likely to reach a
goal, but are also a popular tool in machine learning. In this
technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant splitter
/ differentiator in input variables.

Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

  • Regression Trees: Decision Trees with a continuous target variable are termed as regression trees.

We are all familiar with the idea of linear regression as a way of
making quantitative predictions. In simple linear regression, a
real-valued dependent Variable Y is modeled as a linear function of a
real-valued independent variable X plus noise. Even in multiple
regression, we let there be multiple independent variables X1, X2, . . .
Xp and frame the model.
This all goes along so well as the variables are independent and each
have a strictly additive effect on Y. Even though if the variables are
not independent, it is possible to incorporate some amount of
interactions. However, with more number of variables, it gets tougher
and tougher. Moreover, the relationship may no longer be a linear one.
Thus, arises the need of regression trees.

  • Classification Tree: A classification tree is very
    similar to regression tree, except it is used to predict a qualitative
    response rather than a quantitative one. In case of classification tree,
    we predict that each observation belongs to the most commonly occurring
    class of training observations in the region to which it belongs. In
    interpreting the results of a classification tree, we are often
    interested not only in the class predictions corresponding to a
    particular terminal node region, but also in the class proportion among
    the training observations that fall in the region.

Advantages

1. Easy to Explain: Decision tree are very easy to
understand for people, even from non-analytical background. It does not
require any statistical knowledge to read and interpret them. In fact,
it is even easier to interpret than linear regression!
2. Useful in Data exploration: Decision tree is one
of the fastest way to identify most significant variables and relation
between two or more variables. With the help of decision trees, we can
create new variables / features that has better power to predict target
variable.
3. Less data cleaning required: It is not influenced by outliers and missing values to a fair degree.
4. Non Parametric Method: Decision tree is
considered to be a non-parametric method. This means that decision trees
have no assumptions about the parent distribution and the
classification system.

Disadvantages

1. Over fitting: Over fitting is one of the most
practical difficulty for decision tree models. This problem gets solved
by setting constraints on model parameters and pruning.
2. Lack of predictive accuracy: It is less efficient than regression models and cross-validation models.
3. Non-Robust: Decision trees are non-robust, meaning that a small change in the data can cause a large change in the final estimated tree.

DECISION TREE IN R:

For this, we will use the data-set CarSeats, which has the data on
the sales of child car seats sold in 400 different stores in US.
It consists of a data frame with 400 observations on the following 11 variables namely:

  1. Sales: Unit sales (in thousands) at each location
  2. CompPrice: Price charged by competitor at each location
  3. Income: Community income level (in thousands of dollars)
  4. Advertising: Local advertising budget for company at each location (in thousands of dollars)
  5. Population: Population size in region (in thousands)
  6. Price: Price company charges for car seats at each site
  7. ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
  8. Age: Average age of the local population
  9. Education: Education level at each location
  10. Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location
  11. US: A factor with levels No and Yes to indicate whether the store is in the US or not

R-CODE:

attach(Carseats)
high=ifelse(Carseats$Sales<8,“No”,“Yes”)
Car=cbind(Carseats,high)
Car

attach(Carseats)
high=ifelse(Carseats$Sales<8,"No","Yes")
Car=cbind(Carseats,high)
library(rpart)

 

## Warning: package 'rpart' was built under R version 3.2.5
tree=rpart(high~.-Sales,Carseats,method="class")
tree
## n= 400
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##   1) root 400 164 No (0.59000000 0.41000000)
##     2) ShelveLoc=Bad,Medium 315  98 No (0.68888889 0.31111111)
##       4) Price>=92.5 269  66 No (0.75464684 0.24535316)
##         8) Advertising< 13.5 224  41 No (0.81696429 0.18303571)
##          16) CompPrice< 124.5 96   6 No (0.93750000 0.06250000) *
##          17) CompPrice>=124.5 128  35 No (0.72656250 0.27343750)
##            34) Price>=109.5 107  20 No (0.81308411 0.18691589)
##              68) Price>=126.5 65   6 No (0.90769231 0.09230769) *
##              69) Price< 126.5 42  14 No (0.66666667 0.33333333)
##               138) Age>=49.5 22   2 No (0.90909091 0.09090909) *
##               139) Age< 49.5 20   8 Yes (0.40000000 0.60000000) *
##            35) Price< 109.5 21   6 Yes (0.28571429 0.71428571) *
##         9) Advertising>=13.5 45  20 Yes (0.44444444 0.55555556)
##          18) Age>=54.5 20   5 No (0.75000000 0.25000000) *
##          19) Age< 54.5 25   5 Yes (0.20000000 0.80000000) *
##       5) Price< 92.5 46  14 Yes (0.30434783 0.69565217)
##        10) Income< 57 10   3 No (0.70000000 0.30000000) *
##        11) Income>=57 36   7 Yes (0.19444444 0.80555556) *
##     3) ShelveLoc=Good 85  19 Yes (0.22352941 0.77647059)
##       6) Price>=142.5 12   3 No (0.75000000 0.25000000) *
##       7) Price< 142.5 73  10 Yes (0.13698630 0.86301370) *
summary(tree)
## Call:
## rpart(formula = high ~ . - Sales, data = Carseats, method = "class")
##   n= 400
##
##           CP nsplit rel error    xerror       xstd
## 1 0.28658537      0 1.0000000 1.0000000 0.05997967
## 2 0.10975610      1 0.7134146 0.7134146 0.05547692
## 3 0.04573171      2 0.6036585 0.6158537 0.05298128
## 4 0.03658537      4 0.5121951 0.6097561 0.05280643
## 5 0.02743902      5 0.4756098 0.5975610 0.05244966
## 6 0.02439024      7 0.4207317 0.5853659 0.05208331
## 7 0.01219512      8 0.3963415 0.5975610 0.05244966
## 8 0.01000000     10 0.3719512 0.5609756 0.05132104
##
## Variable importance
##       Price   ShelveLoc         Age Advertising   CompPrice      Income
##          34          25          11          11           9           5
##  Population   Education
##           3           1
##
## Node number 1: 400 observations,    complexity param=0.2865854
##   predicted class=No   expected loss=0.41  P(node) =1
##     class counts:   236   164
##    probabilities: 0.590 0.410
##   left son=2 (315 obs) right son=3 (85 obs)
##   Primary splits:
##       ShelveLoc   splits as  LRL,       improve=28.991900, (0 missing)
##       Price       < 92.5  to the right, improve=19.463880, (0 missing)
##       Advertising < 6.5   to the left,  improve=17.277980, (0 missing)
##       Age         < 61.5  to the right, improve= 9.264442, (0 missing)
##       Income      < 60.5  to the left,  improve= 7.249032, (0 missing)
##
## Node number 2: 315 observations,    complexity param=0.1097561
##   predicted class=No   expected loss=0.3111111  P(node) =0.7875
##     class counts:   217    98
##    probabilities: 0.689 0.311
##   left son=4 (269 obs) right son=5 (46 obs)
##   Primary splits:
##       Price       < 92.5  to the right, improve=15.930580, (0 missing)
##       Advertising < 7.5   to the left,  improve=11.432570, (0 missing)
##       ShelveLoc   splits as  L-R,       improve= 7.543912, (0 missing)
##       Age         < 50.5  to the right, improve= 6.369905, (0 missing)
##       Income      < 60.5  to the left,  improve= 5.984509, (0 missing)
##   Surrogate splits:
##       CompPrice < 95.5  to the right, agree=0.873, adj=0.13, (0 split)
##
## Node number 3: 85 observations,    complexity param=0.03658537
##   predicted class=Yes  expected loss=0.2235294  P(node) =0.2125
##     class counts:    19    66
##    probabilities: 0.224 0.776
##   left son=6 (12 obs) right son=7 (73 obs)
##   Primary splits:
##       Price       < 142.5 to the right, improve=7.745608, (0 missing)
##       US          splits as  LR,        improve=5.112440, (0 missing)
##       Income      < 35    to the left,  improve=4.529433, (0 missing)
##       Advertising < 6     to the left,  improve=3.739996, (0 missing)
##       Education   < 15.5  to the left,  improve=2.565856, (0 missing)
##   Surrogate splits:
##       CompPrice < 154.5 to the right, agree=0.882, adj=0.167, (0 split)
##
## Node number 4: 269 observations,    complexity param=0.04573171
##   predicted class=No   expected loss=0.2453532  P(node) =0.6725
##     class counts:   203    66
##    probabilities: 0.755 0.245
##   left son=8 (224 obs) right son=9 (45 obs)
##   Primary splits:
##       Advertising < 13.5  to the left,  improve=10.400090, (0 missing)
##       Age         < 49.5  to the right, improve= 8.083998, (0 missing)
##       ShelveLoc   splits as  L-R,       improve= 7.023150, (0 missing)
##       CompPrice   < 124.5 to the left,  improve= 6.749986, (0 missing)
##       Price       < 126.5 to the right, improve= 5.646063, (0 missing)
##
## Node number 5: 46 observations,    complexity param=0.02439024
##   predicted class=Yes  expected loss=0.3043478  P(node) =0.115
##     class counts:    14    32
##    probabilities: 0.304 0.696
##   left son=10 (10 obs) right son=11 (36 obs)
##   Primary splits:
##       Income      < 57    to the left,  improve=4.000483, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=3.189762, (0 missing)
##       Advertising < 9.5   to the left,  improve=1.388592, (0 missing)
##       Price       < 80.5  to the right, improve=1.388592, (0 missing)
##       Age         < 64.5  to the right, improve=1.172885, (0 missing)
##
## Node number 6: 12 observations
##   predicted class=No   expected loss=0.25  P(node) =0.03
##     class counts:     9     3
##    probabilities: 0.750 0.250
##
## Node number 7: 73 observations
##   predicted class=Yes  expected loss=0.1369863  P(node) =0.1825
##     class counts:    10    63
##    probabilities: 0.137 0.863
##
## Node number 8: 224 observations,    complexity param=0.02743902
##   predicted class=No   expected loss=0.1830357  P(node) =0.56
##     class counts:   183    41
##    probabilities: 0.817 0.183
##   left son=16 (96 obs) right son=17 (128 obs)
##   Primary splits:
##       CompPrice   < 124.5 to the left,  improve=4.881696, (0 missing)
##       Age         < 49.5  to the right, improve=3.960418, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=3.654633, (0 missing)
##       Price       < 126.5 to the right, improve=3.234428, (0 missing)
##       Advertising < 6.5   to the left,  improve=2.371276, (0 missing)
##   Surrogate splits:
##       Price      < 115.5 to the left,  agree=0.741, adj=0.396, (0 split)
##       Age        < 50.5  to the right, agree=0.634, adj=0.146, (0 split)
##       Population < 405   to the right, agree=0.629, adj=0.135, (0 split)
##       Education  < 11.5  to the left,  agree=0.585, adj=0.031, (0 split)
##       Income     < 22.5  to the left,  agree=0.580, adj=0.021, (0 split)
##
## Node number 9: 45 observations,    complexity param=0.04573171
##   predicted class=Yes  expected loss=0.4444444  P(node) =0.1125
##     class counts:    20    25
##    probabilities: 0.444 0.556
##   left son=18 (20 obs) right son=19 (25 obs)
##   Primary splits:
##       Age       < 54.5  to the right, improve=6.722222, (0 missing)
##       CompPrice < 121.5 to the left,  improve=4.629630, (0 missing)
##       ShelveLoc splits as  L-R,       improve=3.250794, (0 missing)
##       Income    < 99.5  to the left,  improve=3.050794, (0 missing)
##       Price     < 127   to the right, improve=2.933429, (0 missing)
##   Surrogate splits:
##       Population  < 363.5 to the left,  agree=0.667, adj=0.25, (0 split)
##       Income      < 39    to the left,  agree=0.644, adj=0.20, (0 split)
##       Advertising < 17.5  to the left,  agree=0.644, adj=0.20, (0 split)
##       CompPrice   < 106.5 to the left,  agree=0.622, adj=0.15, (0 split)
##       Price       < 135.5 to the right, agree=0.622, adj=0.15, (0 split)
##
## Node number 10: 10 observations
##   predicted class=No   expected loss=0.3  P(node) =0.025
##     class counts:     7     3
##    probabilities: 0.700 0.300
##
## Node number 11: 36 observations
##   predicted class=Yes  expected loss=0.1944444  P(node) =0.09
##     class counts:     7    29
##    probabilities: 0.194 0.806
##
## Node number 16: 96 observations
##   predicted class=No   expected loss=0.0625  P(node) =0.24
##     class counts:    90     6
##    probabilities: 0.938 0.062
##
## Node number 17: 128 observations,    complexity param=0.02743902
##   predicted class=No   expected loss=0.2734375  P(node) =0.32
##     class counts:    93    35
##    probabilities: 0.727 0.273
##   left son=34 (107 obs) right son=35 (21 obs)
##   Primary splits:
##       Price     < 109.5 to the right, improve=9.764582, (0 missing)
##       ShelveLoc splits as  L-R,       improve=6.320022, (0 missing)
##       Age       < 49.5  to the right, improve=2.575061, (0 missing)
##       Income    < 108.5 to the right, improve=1.799546, (0 missing)
##       CompPrice < 143.5 to the left,  improve=1.741982, (0 missing)
##
## Node number 18: 20 observations
##   predicted class=No   expected loss=0.25  P(node) =0.05
##     class counts:    15     5
##    probabilities: 0.750 0.250
##
## Node number 19: 25 observations
##   predicted class=Yes  expected loss=0.2  P(node) =0.0625
##     class counts:     5    20
##    probabilities: 0.200 0.800
##
## Node number 34: 107 observations,    complexity param=0.01219512
##   predicted class=No   expected loss=0.1869159  P(node) =0.2675
##     class counts:    87    20
##    probabilities: 0.813 0.187
##   left son=68 (65 obs) right son=69 (42 obs)
##   Primary splits:
##       Price     < 126.5 to the right, improve=2.9643900, (0 missing)
##       CompPrice < 147.5 to the left,  improve=2.2337090, (0 missing)
##       ShelveLoc splits as  L-R,       improve=2.2125310, (0 missing)
##       Age       < 49.5  to the right, improve=2.1458210, (0 missing)
##       Income    < 60.5  to the left,  improve=0.8025853, (0 missing)
##   Surrogate splits:
##       CompPrice   < 129.5 to the right, agree=0.664, adj=0.143, (0 split)
##       Advertising < 3.5   to the right, agree=0.664, adj=0.143, (0 split)
##       Population  < 53.5  to the right, agree=0.645, adj=0.095, (0 split)
##       Age         < 77.5  to the left,  agree=0.636, adj=0.071, (0 split)
##       US          splits as  RL,        agree=0.626, adj=0.048, (0 split)
##
## Node number 35: 21 observations
##   predicted class=Yes  expected loss=0.2857143  P(node) =0.0525
##     class counts:     6    15
##    probabilities: 0.286 0.714
##
## Node number 68: 65 observations
##   predicted class=No   expected loss=0.09230769  P(node) =0.1625
##     class counts:    59     6
##    probabilities: 0.908 0.092
##
## Node number 69: 42 observations,    complexity param=0.01219512
##   predicted class=No   expected loss=0.3333333  P(node) =0.105
##     class counts:    28    14
##    probabilities: 0.667 0.333
##   left son=138 (22 obs) right son=139 (20 obs)
##   Primary splits:
##       Age         < 49.5  to the right, improve=5.4303030, (0 missing)
##       CompPrice   < 137.5 to the left,  improve=2.1000000, (0 missing)
##       Advertising < 5.5   to the left,  improve=1.8666670, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=1.4291670, (0 missing)
##       Population  < 382   to the right, improve=0.8578431, (0 missing)
##   Surrogate splits:
##       Income      < 46.5  to the left,  agree=0.595, adj=0.15, (0 split)
##       Education   < 12.5  to the left,  agree=0.595, adj=0.15, (0 split)
##       CompPrice   < 131.5 to the right, agree=0.571, adj=0.10, (0 split)
##       Advertising < 5.5   to the left,  agree=0.571, adj=0.10, (0 split)
##       Population  < 221.5 to the left,  agree=0.571, adj=0.10, (0 split)
##
## Node number 138: 22 observations
##   predicted class=No   expected loss=0.09090909  P(node) =0.055
##     class counts:    20     2
##    probabilities: 0.909 0.091
##
## Node number 139: 20 observations
##   predicted class=Yes  expected loss=0.4  P(node) =0.05
##     class counts:     8    12
##    probabilities: 0.400 0.600

 

plot(tree)
text(tree)

 

Leave a Reply

Close Menu