**Hypothesis** **Testing**

The primary objective of any statistical analysis is to gather information about some characteristics of the population. But usually only a part of the population (i.e. sample) can be accessed and hence one needs to make guess about the characteristics of the population from the knowledge obtained from the sample. This process of deducing characteristics of population from known sample is called statistical inference.

**Testing of hypothesis** is one of the principal areas of statistical inference where the concern is to examine the validity of some statement about unknown parameters or about the form of a distribution in the light of a sample.

Following are the common terms used in testing of hypothesis:

**Statistical hypothesis:** It is an assertion about the certain characteristics of a population to be verified on the basis of a sample. For example, probability of getting more than 2 heads in 5 tosses of a coin is 0.7 is an example of a hypothesis. Mean of a Poisson random variable X is 4 is also another example of a hypothesis.

**Test of a hypothesis:** It is a rule which helps in deciding whether to accept or reject the hypothesis under consideration based on the available sample values. For example, to test whether a coin is fair using 10 tosses of the coin, a test may be reject the hypothesis whenever more than 8 heads or less than 3 heads are obtained. The condition stated is the rule which helps in deciding whether or not to accept the hypothesis that the coin is fair based on a sample. If 6 heads are obtained in 10 tosses of the coin then based on this data null hypothesis is accepted.

**Null hypothesis & Alternative hypothesis:** The hypothesis set up for testing based on sample is called Null hypothesis and any hypothesis that contradicts the null hypothesis is called an Alternative hypothesis. In the example given above, null hypothesis is that the coin is fair whereas the alternative hypothesis is that the coin is biased.

Note, rejection of null hypothesis based on sample observations does not necessarily imply that alternative hypothesis is true.

**Simple & Composite hypothesis:** If the hypothesis specifies the population completely then it is called simple hypothesis but if it fails to do so then it is a composite hypothesis. For example, in the coin toss example the coin can be said to be fair if probability (p) of obtaining head in a single toss of the coin is 0.5. Then the hypothesis p=0.5 is a simple hypothesis whereas p≠0.5 is an example of composite hypothesis since p≠0.5 implies it can be 0.4 or 0.3 or 0.7 etc.

**Sample Space:**** **The collection of all possible samples is called sample space. For example, the sample space for number of heads obtained in 3 tosses of a coin is (0,1,2,3)

**Critical Region:**** **The sample space (**W**) can be divided into sub spaces. Suppose **w** is a subspace such that if the observed sample falls in **w** the null hypothesis is rejected. Then **w** is called the critical region. For example, in 10 tosses of the coin the rule to reject null hypothesis that the coin is fair was if more than 8 or less than 3 heads were obtained. If X denotes number of heads in 10 tosses of the coin, then critical region is X>=8 and X<=3. The region **W-w** is the called the **region of** **acceptance**. Hence, in the above example the acceptance region is 3<X<8.

**Test Statistic:** It is a function of the sample values which is used for formulating a test. Critical region is usually described in terms of test statistic. For the example stated above if X (total number of heads in 10 toss) is used to decide whether null hypothesis is to be rejected or accepted then X is the test statistic. Some of the common test statistic: t-statistic, F-statistic etc.

Note that every hypothesis testing is associated with two types of error. They are:

**Type I error:**** **The error committed in rejecting the null hypothesis when actually it is true is called the type I error. The upper bound of the type I error is called the **level of significance** of the test.

**Type II error: **The error committed in accepting the null hypothesis when actually it is false is called the type II error. The complement of type II error i.e. the probability of rejecting null hypothesis when it is false is called the **power of the test.**

A statistical test might be **one-sided** or **two sided** based on the alternative hypothesis. For example, if alternative hypothesis is p>p_{1}, then test will be one sided and if it is p≠p_{1}, then the test will be two sided.

**Ge****neral procedure for testing hypothesis:**

- First step is to state both null and alternative hypothesis based on the problem at hand. The nature of alternate hypothesis will help in deciding whether to use one sided or two sided test.
- Secondly, consider all the statistical assumptions made about the sample if any. For example, independence of the sample observations is often assumed.
- Then state the suitable test statistic and its sampling distribution under null hypothesis. In large sample tests usually the test statistic follows a standard normal distribution.
- Next, choose the level of significance (α). Usually á is taken to be 1% or 5%.
- Then compute critical region of the test at the chosen level of significance. The probability of critical region must be equal to α.
- Compute the value of test statistic based on sample.
- Based on the observed value of test statistic choose whether to reject or accept null hypothesis.

**Alternatively**, the commonly used approach to test hypothesis is

- Compute the value of test statistic based on sample.
- Compute p value of the test. It is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed.

· Reject the null hypothesis if p value of the test is less than α else accept null hypothesis.

## Some of the commonly used test statistics are described below:

**Z-test:**** **It is a statistical test which is used to test whether means of two populations are different or not when their variances are known and the sample size is large. This is a case of two sample test. Z test can be used in single sample case also where it is used to test whether sample mean is equal to a known population mean or not.

**Example: **Suppose marks of class test in a particular subject are available for 10 students and the average marks in that subject for the entire class is known to be 82 with known sample variance 16. Then this is a case of one sample z test. Null hypothesis is to test whether average of the marks of 10 students is equal to 82 or not. Level of significance is set at 5%.

Test statistic for Z-test is: (Ā-µ)/(sigma/sqrt(n)) where Ā is the sample mean, µ is the population mean, sigma/sqrt(n) is the standard deviation of sample mean, n is the number of observations.

R code for the test is –

test = c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81) #creating the data set of marks of 10 students z = function(test, mu, var){ z_stat = (mean(test) - mu) / (sqrt(var / length(test))) return(z_stat) } #created a function called z to calculate test statistic of z test z(test,82,16) #obtaining the observed value of the test statistic by calling the function -8.53815 #observed value of the test statistic

Now, critical value for a two sided test at 5% level of significance is 1.96. Since, absolute value observed Z statistic is greater than the critical value, null hypothesis is rejected at 5% level of significance and can be concluded that average of the sample is not equal to 82.

Suppose marks of class test in a particular subject are available for 10 students of each of two sections in a class. Then the hypothesis of interest is whether average of marks of two sections are equal or not when variances of two sections are known to be 9 & 14. As before, level of significance is set at 5%. Here the test statistic is of the form – (Ā1- Ā2)/( (sigma1/sqrt(n1))+ (sigma2/sqrt(n2))) where 1 & 2 represents the two different samples. R code for the test is –

test1=c(65, 78, 88, 55, 48, 95, 66, 57, 79, 81)#creating the data set of marks of 10 students for section 1 & 2test2=c(64, 52, 68, 75, 82, 66, 95, 83, 67, 76)z = function(test1, test2, var1,var2){z_stat = (mean(test1) – mean(test2) / (sqrt(var 1/ length(test1)+ var2/length(test2)))return(z_stat)#created a function called z to calculate test statistic of z test}#obtaining the observed value of the test statistic by calling the functionz(test1,test2,9,14)-1.055009#observed value of the test statistic

Now, critical value for a two sided test at 5% level of significance is 1.96. Since, absolute value observed Z statistic is less than the critical value, null hypothesis is accepted at 5% level of significance and can be concluded that average of the marks of the two sections are equal.

**T-test**: One sample t-test is used as a location test i.e. to test whether mean of a population is equal to a specified value when variance of the population is not known. Two sample t-test are performed to test whether mean of two populations are equal or not where variance of the two populations are usually assumed to be equal. If variances are not equal, then pooled estimate of variance can be used in such case. Two sample test can be unpaired or paired if the samples are independent or not. Also t- tests are also used to test whether slope of a regression line is significantly differing from 0 or not.

Example: Here we consider the example of one sample Z test. Then R code for the test is –

t.test(test , mu=82) #performs one sample t test (two sided alternative hypothesis) #Note, if argument alternative = “greater” is used then alternative hypothesis would be true mean greater than 82 One Sample t-test data: test t=-2.2255, df = 9, p-value = 0.05309 alternative hypothesis: true mean not equal to 82 95 percent confidence interval : 60.22187 82.17813 sample estimates : mean of x 71.2

Since, p value is slightly greater than 0.05, null hypothesis is accepted i.e. mean of the sample is equal to 82 when variance is unknown.

Consider, example of two sample Z test. Then, R code for the test is –

#performs two sample t test (two sided alternative hypothesis) & also unpairedt.test(test1, test2)

Similar to one sample test if alternative argument is used then one-sided test can be performed

Welch Two Sample t-testdata : test1 and test2 t = -0.25921 , df = 17.049, p-value= 0.7986 alternative hypothesis : true difference in means is not equal to 0 95 percent confidence interval : -14.62042 11.42042 sample estimates : mean of x mean of y 71.2 72.8

Since, p value is greater than 0.05, null hypothesis is accepted i.e. there is no significant difference in the sample means at 5% level of significance.

Consider example of two sample Z test but instead of marks of same subject for two different sections, consider marks of two different subjects for same individuals. Then the observations are paired. Hypothesis of interest is to test means of marks of two subjects are equal or not. Level of significance is set as 5%. Then, R code for the test is –

#performs two sample t test (two sided alternative hypothesis) & pairedt.test(test1, test2, paired=TRUE)

Similar to one sample test if alternative argument is used then one-sided test can be performed

**Paired t-test**

data : test1 and test2 t = -0.21183 , df = 9, p-value= 0.837 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval : -18.68623 15.48623 sample estimates : mean of the differences -1.6

Since, p value is greater than 0.05, null hypothesis is accepted i.e. there is no significant difference in the sample means at 5% level of significance for the paired t-test.

Consider **mtcars** dataset. Let a regression model of **mpg on cyl** is fiited. Then to test whether coefficient of cyl(i.e. slope of the model) is significantly different from 0 t test can be used. As before, level of significance is set as 5%. Then R code for the test is—

model= lm( mtcars$mpg ~ mtcars$cyl , data= mtcars) #regression model is fit summary(model)$coefficients #gives the t-test statistic, p-value, value of the fitted coefficient and its standard error Estimate Std.Error t value Pr(>|t|) (Intercept) 37.88458 2.0738436 18.267808 8.369155e-18 mtcars$cyl -2.87579 0.3224089 -8.919699 6.112687e-10

Since, p value for the mtcars$cyl coefficient is less than 0.05, null hypothesis is rejected and can be concluded that coefficient of cyl variable is significantly different from 0. This is useful usually in the case of selection of variables to be kept in model that significantly explains a part of the response variable.

**F-test:** It is used to test whether variances of two populations are equal or not. It is also used to test means of several groups of populations are equal or not when they have same standard deviation. This approach is used in analysis of variance(ANOVA). Further F-test is also used to test whether a regression model fits the data well or not, to test whether simpler regression model is better fit to data when the models are nested within each other.

**Example:** Suppose there are two samples of 100 observations from two different normal populations.

To test whether they have equal variances or not, F test is used. Level of significance for the test is set at 5%.

R code for the test is –

x=rnorm(100,50,4) #random sample of 100 observations from normal distribution with mean 50 and sd 4 y=rnorm(100,20,6) #random sample of 100 observations from normal distribution with mean 20 and sd 6 var.test(x,y) #performs F test with two sided alternative hypothesis and ratio of the variances #equal to 1 i.e. variances are equal F test to compare two variances data: x and y F = 0.40278, num df = 99, denom df = 99, p-value = 9.058e-06 alternative hypothesis: true ratio of variances is not equal to 1 95% confidence interval: 0.2710058 0.5986221 sample estimates: ratio of variances : 0.4027779

Since p value is less than 0.05, it can be concluded that null hypothesis is rejected based on the sample observations i.e. variances of the two populations are not equal. The result is correct since sample x had been drawn from a population with variance 16 whereas y had been drawn from a population with variance 36. Hence, variance of the two populations are definitely not equal. If x had been drawn from a population with variance of 35.4025 (i.e. sd = 5.95), then null hypothesis would have been accepted (p-value =0.3211).

Let us explore another example of F test. Suppose there are three types of fertilizers(A,B,C) that can be used for growing crops and the farmer wants to use that fertilizer which gives the best yield of crop. This is a case of ANOVA. Here, null hypothesis of interest is that average yield due to three different fertilizers are equal. If the hypothesis is rejected, then one can further use method of multiple comparison to detect which fertilizer gives best yield. Level of significance for the F-test is set at 5%.

R code for the test is –

The yield(in kg) for the three fertilizers are : Fertilizer A : 5.58 4.49 4.62 3.65 2.1 4.2 3.16 4.13 4.78 Fertilizer B : 6.14 8.19 4.31 5.64 4.9 6.27 5.36 8.94 6.72 Fertilizer C : 6.62 7.42 6.01 6.54 7.42 5.17 6.2 5.2 5.3 yield=c(5.58,4.49,4.62,3.65,2.1,4.2,3.16,4.13,4.78,6.14,8.19,4.31,5.64,4.9,6.27,5.36,8.94,6.72,6.62,7.42,6.01,6.54,7.42,5.17,6.2,5.2,5.3 ) fertilizer = c(rep("A",9), rep("B",9), rep("C",9)) sample=data.frame(yield,fertilizer) #created data for the test fit=aov(yield~fertilizer,data=sample) summary(fit) #summary gives the anova table for the model Df Sum Sq Mean Sq F value Pr(>F) fertilizer 2 28.08 14.042 10.43 0.000551*** Residuals 24 32.32 1.347 --- Signif. codes : 0 ‘***’ 0.001’**’ 0.01 ‘*’ 0.05 ‘ . ’ 0.1 ‘ ’ 1

From the ANOVA table it is observed that p value of the F-test is less than 0.05 and thus can be concluded that the mean of the three groups or populations i.e. average yield of the three fertilizers are not equal. Hence, null hypothesis is rejected at 5% level based on the given data.

Let us explore another scenario. Suppose there are two models (nested models). To decide which is the best model among the two, F-test is performed using ANOVA table. Null hypothesis of interest is that the two models are similar. Level of significance for the test is set at 5%. R code for the test is –

Let us use **mtcars** dataset of R for this purpose which contain data on several variables like **mpg**(i.e. mile per gallon), **cyl** (i.e cylinders), **disp**, **wt**, **hp**(i.e. horse power) etc.

model1 = lm( mtcars$mpg ~ mtcars$cyl + mtcars$disp , data=mtcars ) model2 = lm( mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp , data=mtcars ) #created two models using mtcars dataset where model1 is nested within model2 anova(model1, model2) #performs F-test Analysis of Variance Table model1 : mtcars$mpg ~ mtcars$cyl + mtcars$disp model2 : mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 270.74 2 28 261.37 1 9.3709 1.0039 0.325

From the p value in the ANOVA table it can be concluded that null hypothesis is accepted since p value is greater than 0.05. Hence, both the models performs similarly, there is no significant difference between the two models. So in this case the choice of model is done based on business requirement.

**Chi-square test:** It is a statistical test used to test goodness of fit, homogeneity and independence. Goodness of fit test helps to decide whether an observed frequency distribution is close to the theoretical distribution or not. Chi-square homogeneity test is applied to a single categorical variable from two or more populations to determine whether frequency counts are identical across populations or not. Chi-square test of independence is applied to unpaired observations on two variables to detect whether they are independent of each other or not. Contingency table is used for this purpose.

**Example:** Let us explore **survey** data under MASS package in R which contains information on some individuals like their height, pulse, smoking habit, exercising habits etc. Smoking habits have been classified as **never, occasional, regular, heavy** & exercising habits have been classified as **none, some, frequent.** Here, the hypothesis of interest would be smoking habits is **independent** of exercising habits. Level of significance is set as 5%. The R codes for the following test is –

library#need to attach library to access the survey dataset(MASS)#this creates the contingency table Freq None Some Heavy 7 1 3 Never 87 18 84 Occas 12 3 4 Regul 9 1 7table(survey$Smoke, survey$Exer)#this performs the test # often the contingency table can be saved in a variable and that can be used in chisq.testchisq.test(table(survey$Smoke, survey$Exer))

#### Pearson’s Chi-squared test

data: table(survey$Smoke, survey$Exer) X-squared = 5.4885, df = 6, p-value = 0.4828

Warning message :

In chi-sq.test(table(survey$Smoke, survey$Exer)):

Chi-squared approximation may be incorrect

Since p-value is greater than 0.05, it can be concluded that null hypothesis is accepted based on the sample i.e. smoking habits are independent of exercising habits. Note that the warning message is due to small cell values in the contingency table. Some columns or rows may be merged and tested using Chi-squared approximation.

Let us now explore another example. Here a data of **television program viewing habits** have been

studied **across children**. The children(boys and girls) are asked about the program they like most

among **Game of Thrones, Friends or Lone Ranger**. Now the hypothesis of interest is whether the viewing

preferences of boys and girls are similar or not i.e. here the **test of homogeneity** across the two populations

is applicable. The level of significance for the test is set as 5%.

The R codes for the test is —

tv_choice = matrix(c(61,50,39,56,40,54), nrow=2, ncol=3, byrow=TRUE, dimnames = list(c("Boys","Girls"), c("Game of Thrones", "Friends","Lone Ranger"))) #the data has been created as a matrix tv_choice Game of Thrones Friends Lone Ranger Boys 61 50 39 Girls 56 40 54 chisq.test(tv_choice) #performing the test Pearson’s Chi-squared test data : tv_choice X-squared = 3.7441, df = 2, p-value = 0.1538

Since p value is greater than 0.05, it can be concluded that null hypothesis is accepted i.e. viewing preferences of boys and girls are similar according to the observed data.

Let us explore a different example. Data on observed frequency distribution and the expected frequencies are there. Then, to test how well the expected frequencies fit the data, one needs to use chi-square goodness of fit test. Here the null hypothesis is that the fit is good. Level of significance is set at 5%.

R code for the test is –

observed=c(45,52,17,26,49,53) expected=c(39,47,22,30,55,49) #creating the vectors of observed frequencies and expected frequencies chisq.test(x=observed,p=expected, rescale=T) #performing the test #here the argument p is probabilities. The sum of probabilities should be 1. To do this the argument rescale = True as been used Pearson’s Chi-squared test data : observed X-squared = 4.1058, df = 5, p-value = 0.5343

Since p value is greater than 0.05, it can be concluded that the null hypothesis is accepted based on the observed data i.e. the fit is good.