Logistic Regression in R

Logistic Regression in R

Why Logistic Regression?

The linear Regression model assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.

For example eye colour is qualitative taking on values blue, brown or green. Often qualitative variables are referred to as categorical.

Here we study approaches for predicting qualitative response, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression models.

In this blog I have used Titanic dataset which is available at kaggle (https://www.kaggle.com/c/titanic/data). The dataset includes 11 predictors and a response variable.

Performing logistic Regression in R

# getting working directory
getwd()

# setting working directory
setwd("D:\GitHub\Titanic\Data")

# reading data
titanic <- read.csv("train.csv")

# viewing data
View(titanic)

# structure of the data whether the variables are categorical, numeric etc.
str(titanic)

Logistic Regression in R

 

#removing insignificant variables or drawing a subset from the data in which we have variables which can contribute to the response variable. 
titanic <- subset(titanic, select = c(2,3,5,6,7,8,10,12))

# checking if the variable are categorical or not
is.factor(titanic$Sex)

Logistic Regression in R

is.factor(titanic$Embarked)

#missing value treatment using Knn Approach
library(DMwR)

?knnImputation
titanic <- knnImputation(titanic)

#spliting data into train and test 
dim(titanic)

Logistic Regression in R

train <- titanic[1:800,]
test <- titanic[801:891,]

# fitting the logistic regression when considering all the predictors
basemodel <- glm(Survived~., family = binomial(link = "logit"), data = train)

# summary of the fitted model
summary(basemodel)

 

Interpretation:

Clearly we can see that predictors like Parch, fare and Embarked are statistically insignificant. Sinse the p-value is greater than 0.05. So, dropping these predictors from the model gives better result.

#analysis of variance table of the fitted model
anova(basemodel, test = "Chisq")

 

# fitting of logistic regression when considering only the statistically significant predictors
model <- glm(Survived~.-Parch-Fare-Embarked, family = binomial(link = "logit"),data = train)

# summary of the fitted model
summary(model)

# analysis of variance table of the fitted model
anova(model, test = "Chisq")

# prediction of the response on the basis of fitted model
predict <- predict(model,newdata = test,type = "response")

# checking the accuracy
library(caret)
predict <- ifelse(predict > 0.5,1,0)
error <- mean(predict != test$Survived)
print(paste('Accuracy',1-error))

As, we see Accuracy for this model is 82.41 percent which seem to be good.

GitHub Repository for Sample Data and Code Link

Leave a Reply

Close Menu