Why Logistic Regression?
The linear Regression model assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.
For example eye colour is qualitative taking on values blue, brown or green. Often qualitative variables are referred to as categorical.
Here we study approaches for predicting qualitative response, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression models.
In this blog I have used Titanic dataset which is available at kaggle (https://www.kaggle.com/c/titanic/data). The dataset includes 11 predictors and a response variable.
Performing logistic Regression in R
# getting working directory getwd() # setting working directory setwd("D:\GitHub\Titanic\Data") # reading data titanic <- read.csv("train.csv") # viewing data View(titanic) # structure of the data whether the variables are categorical, numeric etc. str(titanic)
#removing insignificant variables or drawing a subset from the data in which we have variables which can contribute to the response variable. titanic <- subset(titanic, select = c(2,3,5,6,7,8,10,12)) # checking if the variable are categorical or not is.factor(titanic$Sex)
#missing value treatment using Knn Approach library(DMwR) ?knnImputation titanic <- knnImputation(titanic) #spliting data into train and test dim(titanic)
train <- titanic[1:800,] test <- titanic[801:891,] # fitting the logistic regression when considering all the predictors basemodel <- glm(Survived~., family = binomial(link = "logit"), data = train) # summary of the fitted model summary(basemodel)
Clearly we can see that predictors like Parch, fare and Embarked are statistically insignificant. Sinse the p-value is greater than 0.05. So, dropping these predictors from the model gives better result.
#analysis of variance table of the fitted model anova(basemodel, test = "Chisq")
# fitting of logistic regression when considering only the statistically significant predictors model <- glm(Survived~.-Parch-Fare-Embarked, family = binomial(link = "logit"),data = train) # summary of the fitted model summary(model)
# analysis of variance table of the fitted model anova(model, test = "Chisq")
# prediction of the response on the basis of fitted model predict <- predict(model,newdata = test,type = "response") # checking the accuracy library(caret) predict <- ifelse(predict > 0.5,1,0) error <- mean(predict != test$Survived) print(paste('Accuracy',1-error))
As, we see Accuracy for this model is 82.41 percent which seem to be good.
GitHub Repository for Sample Data and Code Link