Logistic Regression in R

Logistic Regression in R

Why Logistic Regression?

The linear Regression model assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.

For example eye colour is qualitative taking on values blue, brown or green. Often qualitative variables are referred to as categorical.

Here we study approaches for predicting qualitative response, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression models.

In this blog I have used Titanic dataset which is available at kaggle (https://www.kaggle.com/c/titanic/data). The dataset includes 11 predictors and a response variable.

Performing logistic Regression in R

# getting working directory

# setting working directory

# reading data
titanic <- read.csv("train.csv")

# viewing data

# structure of the data whether the variables are categorical, numeric etc.

Logistic Regression in R


#removing insignificant variables or drawing a subset from the data in which we have variables which can contribute to the response variable. 
titanic <- subset(titanic, select = c(2,3,5,6,7,8,10,12))

# checking if the variable are categorical or not

Logistic Regression in R


#missing value treatment using Knn Approach

titanic <- knnImputation(titanic)

#spliting data into train and test 

Logistic Regression in R

train <- titanic[1:800,]
test <- titanic[801:891,]

# fitting the logistic regression when considering all the predictors
basemodel <- glm(Survived~., family = binomial(link = "logit"), data = train)

# summary of the fitted model



Clearly we can see that predictors like Parch, fare and Embarked are statistically insignificant. Sinse the p-value is greater than 0.05. So, dropping these predictors from the model gives better result.

#analysis of variance table of the fitted model
anova(basemodel, test = "Chisq")


# fitting of logistic regression when considering only the statistically significant predictors
model <- glm(Survived~.-Parch-Fare-Embarked, family = binomial(link = "logit"),data = train)

# summary of the fitted model

# analysis of variance table of the fitted model
anova(model, test = "Chisq")

# prediction of the response on the basis of fitted model
predict <- predict(model,newdata = test,type = "response")

# checking the accuracy
predict <- ifelse(predict > 0.5,1,0)
error <- mean(predict != test$Survived)

As, we see Accuracy for this model is 82.41 percent which seem to be good.

GitHub Repository for Sample Data and Code Link

Leave a Reply

Close Menu