Dispersion
Dispersion means the variability, spread in the data. Average gives a single representative of the data however reliability of average is more if dispersion is less. Consider the following example, suppose there are three screw manufacturing machines each of them is supposed to produce screw with length 3 cm. A sample of size 5 is drawn from the screws produced by each of the 3 machines and the length of each screw is measured. The results obtained are tabulated as below
Machine no. |
Length of screws in each sample |
Sample average | ||||
1 |
1.5 | 2.5 | 3 | 3.5 | 4.5 | 3 |
2 |
1.95 | 2.95 | 3 | 3.05 | 4.05 |
3 |
3 |
1.99 | 2.99 | 3 | 3.01 | 4.01 |
3 |
Each of the machines produced screws of average length 3 cm. but there is much dispersion in the length of screws produced by machine 1. Also variation in the length of screws produced by machine 2 is significant. So from above example it can be understood why measuring dispersion along with averages is important.
To facilitate comparison between the variability of two different groups having different units we must have unit less measures. Measures of dispersion aren’t unit less with every measure of dispersion we have a corresponding measure for comparison which are known as coefficients of dispersion.
Measures of dispersion
Range
The simplest measure of dispersion is range. Suppose L is the largest observation and S is the smallest observation in the data then range is defined as
Range = L-S
Though range is easy to calculate it depends on only two observations in the data so it’s not a reliable measure of dispersion.
Suppose the marks of students of two divisions in the subject statistics are as follows:
Division |
Marks |
||||||
A |
10 | 25 | 30 | 30 | 36 | 30 | 98 |
B |
10 |
45 | 56 | 75 | 85 | 90 | 98 |
Range of marks of students in division A is 98-10=88 and that of students in division B is 98-10=88. Both divisions have same ranges but it can’t be said that they possess the same pattern in variation. The main drawback of range is that it does not depend on all observations.
a) For ungrouped data,
Range =L-S
Coefficient of range = (L-S)/(L+S)
For example,
The height (in cm) of 10 students in class 10^{th} is
5.35, 6.01, 4.59, 4.98, 4.10, 5.02, 6.08, 5.69, 4.84, 5.31
Here, range = 6.08 – 4.10=1.98
Coefficient of range = (6.08-4.10)/(6.08+4.10) = 0.1945
a <- c(5.35, 6.01, 4.59, 4.98, 4.10, 5.02, 6.08, 5.69, 4.84, 5.31) a [1] 5.35 6.01 4.59 4.98 4.10 5.02 6.08 5.69 4.84 5.31 range<-(max(a)-min(a)) range [1] 1.98 coefr <- (max(a)-min(a))/(max(a)+min(a)) coefr [1] 0.194499
b) For discrete frequency distribution, Subtract the last observation from the first observation
X |
Frequency |
10 |
2 |
12 |
6 |
22 |
8 |
26 |
5 |
Range =26-10=16
X <- c(10,12,22,26) range <- max(X)-min(X) range [1] 16 coefr <- (max(X)-min(X))/(max(X)+min(X)) coefr [1] 0.4444444
c) For continuous frequency distribution, Calculate the class marks and subtract last mid value from first
Class interval |
Class mark |
Frequency |
10-20 |
15 |
5 |
20-30 |
25 |
8 |
30-40 |
35 | 4 |
Range =35-15=20
lb <- seq(10,30,10) lb [1] 10 20 30 ub<-seq(20,40,10) ub [1] 20 30 40 X<-(lb+ub)/2 X [1] 15 25 35 range<-max(X)-min(X) range [1] 20 coefr<-(max(X)-min(X))/(max(X)+min(X)) coefr [1] 0.4
Quartile deviation or semi interquartile range
Quartile deviation is based on middle 50% data and middle 50% data lies between first and third quartile.
Quartile deviation is given by:
Quartile deviation = (Q_{3}-Q_{1})/2
Coefficient of quartile deviation = (Q_{3}-Q_{1}) / (Q_{3}+Q_{1}).
Q.D. does not depend on all observations and that’s its demerit.
a) For ungrouped data,
suppose the number of misprints on 11 randomly selected pages in a book are as 2, 2, 4, 0, 8, 8, 6, 9, 2, 5, 2
First arrange these numbers in ascending order, we get 0, 2, 2, 2, 2, 4, 5, 6, 8, 8, 9
Now lets find the first and third quartiles
Q_{1}= the value of ((n+1)/4) ^{th }observation in the ordered arrangement of observations therefore Q_{1}=3^{rd} observation =2
Q3= the value of ((3(n+1))/4) th observation in the ordered arrangement of observations
Q3=9th observation =8
Quartile deviation = (8-2)/2 =3
Coefficient of quartile deviation =(8-2)/(8+2)=0.6
s <- c(9,10,4,2,1,3,5,7,8,6) s [1] 9 10 4 2 1 3 5 7 8 6 QD<-(quantile(s,.75)-quantile(s,.25))/2 QD 75% 2.25 coefqd<-(quantile(s,.75)-quantile(s,.25))/(quantile(s,.75)+quantile(s,.25)) coefqd 75% 0.4090909
b) For discrete frequency distribution, the quartiles can be found by the same way as we found those for ungrouped data by writing each observation corresponding frequency times.
X |
Frequency |
1 |
8 |
2 |
3 |
8 |
1 |
18 |
3 |
12 |
8 |
X1 <- rep(X,f) X1 [1] 1 1 1 1 1 1 1 1 2 2 2 8 18 18 18 12 12 12 12 12 12 12 12 QD<-(quantile(X1,.75)-quantile(X1,.25))/2 QD 75% 5.5 coefqd<-(quantile(X1,.75)-quantile(X1,.25))/(quantile(X1,.75)+quantile(X1,.25)) coefqd 75% 0.8461538
A) Calculate first and third quartiles using the following procedure
First quartile
find N/4
find cumulative frequencies (c.f.)
Find the lower quartile class. It is the class in which (N/4)th observation falls. In other words it is the class whose c.f.
exceeds N/4 for the first time.
Apply the formula
Q1=l+((N/4-c.f.)/f)*h
Where,
l= lower limit of the lower quartile class
N= sum of all frequencies
c.f.= c.f. of the class preceding lower quartile class
f= frequency of lower quartile class
B) Third quartile
1) find 3N/4
2) c.f. we have already calculated above so there is no need to calculate it again
3) Find the upper quartile class. It is the class in which (3N/4)th observation falls. In other words it is the class whose c.f. exceeds 3N/4 for the first time.
4) Apply the formula
Q3=l+((3N/4-c.f.)/f)*h
Where,
l= lower limit of the upper quartile class
N= sum of all frequencies
c.f.= c.f. of the class preceding upper quartile class
f= frequency of upper quartile class
Class interval |
frequency |
c.f. |
10-20 |
65 | 65 |
20-30 | 56 |
121 q1 class |
30-40 |
50 | 171 |
40-50 |
93 |
264 q3 class |
50-60 |
21 |
285 |
Q_{1 }= 20+ ((71.25-65)/56)*10 = 21.11607
Q_{3 }= 40+ ((213.75-171)/93)*10 = 44.59677
Q.D. =(44.59677-21.11607)/2 = 11.74035
lb <- seq(10,50,10) lb [1] 10 20 30 40 50 ub<-seq(20,60,10) ub [1] 20 30 40 50 60 f<-c(65,56,50,93,21) f [1] 65 56 50 93 21 X<-(lb+ub)/2 X [1] 15 25 35 45 55 n<-sum(f) n [1] 285 lcf<-cumsum(f) lcf [1] 65 121 171 264 285 mcq1<-min(which(lcf>=(n/4))) q1<-lb[mcq1]+((n/4)-lcf[mcq1-1])*(h/f[mcq1]) q1 [1] 21.11607 mcq3<-min(which(lcf>=(3*n/4))) q3<-lb[mcq3]+((3*n/4)-lcf[mcq3-1])*(h/f[mcq3]) q3 [1] 44.59677 QD<-(q3-q1)/2 QD [1] 11.74035 coefqd<-(q3-q1)/(q3+q1) coefqd [1] 0.3573229
Mean deviation
Mean deviation about an average gives the arithmetic mean of absolute deviations from the average. It depends on all observations. Since it neglects the signs of deviations by taking absolute values it is not capable of further mathematical treatment.
1. For ungrouped data
a) Mean deviation about mean
The arithmetic mean of absolute deviations from arithmetic mean is called as mean deviation about arithmetic mean.
Suppose Xi ,i=1,2,..,n are n observations.
Step 1.calculate A.M.
Step 2. Find |X-mean|
Step 3. Find Σ|X-mean|
Step 4.obtain M.D. about mean using formula,
M.D. about mean = Σ|X-mean|/n
Example,
Suppose the Observations are 6,10,29,20,25,20,14,17,26,21
Mean=ΣX/n=18.8
X |
X-mean |
|X-mean| |
6 |
-12.8 | 12.8 |
10 | -8.8 |
8.8 |
29 |
10.2 | 10.2 |
20 | 1.2 |
1.2 |
25 |
6.2 | 6.2 |
20 |
1.2 |
1.2 |
14 | -4.8 |
4.8 |
17 |
-1.8 | 1.8 |
26 |
7.2 |
7.2 |
21 |
2.2 |
2.2 |
Total |
56.4 |
Thus mean deviation about mean = 56.4/10=5.64
Coefficient of M.D. about mean = (M.D.about mean)/mean =5.64/18.8=0.3
Similarly mean deviation about median and mode and their respective coefficients can be obtained by replacing mean by corresponding average.
M.D. about mean using R
A <- c(6,10,29,20,25,20,14,17,26,21) A [1] 6 10 29 20 25 20 14 17 26 21 MD1 <- (sum(abs(A-mean(A))))/length(A) MD1 [1] 5.64 coefmd1 <- MD1/mean(A) coefmd1 [1] 0.3 -- M.D. about median using R MD2 <- (sum(abs(A-median(A))))/length(A) MD2 <- (sum(abs(A-median(A))))/length(A) MD2 [1] 5.4 coefmd2 <- MD2/median(A) coefmd2 [1] 0.27 -- M.D. about mode using R -- Like mean and median there is no command to obtain mode directly. In R it can be obtained as below t <- table(A) t A 6 10 14 17 20 21 25 26 29 1 1 1 1 2 1 1 1 1 m <- which(t==max(t)) m 20 5 so <- sort(unique(A)) so [1] 6 10 14 17 20 21 25 26 29 mode <- so[m] mode [1] 20 MD3 <- (sum(abs(A-mode))/length(A)) MD3 [1] 5.4 coefmd3 <- MD3/mode coefmd3 [1] 0.27
b) For discrete frequency distribution
Suppose Xi ,i=1,2,..,n are n observations and fi,i=1,2,…n are their corresponding frequencies.
Step 1.calculate A.M.
Step 2. Find f|X-mean|
Step 3. Find Σf|X-mean|
Step 4.obtain M.D. about mean using formula,
M.D. about mean = Σf|X-mean|/n
Or simply the discrete frequency distribution can be converted into ungrouped data by repeating observations the corresponding frequency times as shown below.
X |
Frequency |
1 |
2 |
2 |
3 |
8 |
9 |
18 |
5 |
12 |
4 |
X <- c(1,2,8,18,12) X [1] 1 2 8 18 12 f <- c(2,3,9,5,4) f [1] 2 3 9 5 4 X1 <- rep(X,f) X1 [1] 1 1 2 2 2 8 8 8 8 8 8 8 8 8 18 18 18 18 18 12 12 12 12 MD1 <- (sum(abs(X1-mean(X1))))/length(X1) MD1 [1] 4.582231 c1 <- MD1/mean(X1) c1 [1] 0.4834464 MD2 <- (sum(abs(X1-median(X1))))/length(X1) MD2 [1] 4.26087 c2 <- MD2/median(X1) c2 [1] 0.5326087 t <- table(X1) t X1 1 2 8 12 18 2 3 9 4 5 m <- which(t==max(t)) m 8 3 st <- sort(unique(X1)) mode <- st[m] mode [1] 8 MD3 <- (sum(abs(X1-mode))/length(X1)) MD3 [1] 4.26087 c3 <- MD3/mode c3 [1] 0.5326087
c) For continuous frequency distribution
Suppose Xi ,i=1,2,..,n are the class marks and fi,i=1,2,…n are the corresponding frequencies.
Step 1.calculate A.M.
Step 2. Find f|X-mean|
Step 3. Find Σf|X-mean|
Step 4.obtain M.D. about mean using formula,
M.D. about mean = Σf|X-mean|/N
The data is as tabulated below
Class interval | Class mark(X_{i})(lower limit+upper limit)/2 | Frequency (f_{i}) |
x_{i}f_{i} | f|X-mean| |
10-20 | 15 | 65 | 975 | 1183.68421 |
20-30 | 25 | 56 | 1400 | 459.78947 |
30-40 | 35 | 50 | 1750 | 89.47368 |
40-50 | 45 | 93 | 4185 | 1096.42105 |
50-60 | 55 | 21 | 1155 | 457.57895 |
Total | N=285 | 9465 | 3286.947 |
M.D. about mean = Σf|X-mean|/N=3286.947/285= 11.53315
Coefficient of M.D. about mean =(M.D.about mean )/mean = 11.53315/33.21053= 0.3472739
lb <- seq(10,50,10) lb [1] 10 20 30 40 50 ub <- seq(20,60,10) ub [1] 20 30 40 50 60 f <- c(65,56,50,93,21) f [1] 65 56 50 93 21 X <- (lb+ub)/2 X [1] 15 25 35 45 55 n <- sum(f) n [1] 285 Mean deviation about mean mean <- sum(X*f)/n mean [1] 33.21053 MD <- (sum(f*abs(X-mean)))/n MD [1] 11.53315 c1 <- MD/mean c1 [1] 0.3472739 Mean deviation about median lcf <- cumsum(f) lcf [1] 65 121 171 264 285 mc <- min(which(lcf>=(n/2))) mc [1] 3 median <- lb[mc]+((n/2)-lcf[mc-1])*(h/f[mc]) median [1] 34.3 MD <- (sum(f*abs(X-median)))/n MD [1] 11.36877 c2 <- MD/median c2 [1] 0.3314511 Mean deviation about mode mo <- which(f==max(f)) mo [1] 4 h <- ub[mo]-lb[mo] mode <- lb[mo]+((f[mo]-f[mo-1])/(2*f[mo]-f[mo-1]-f[mo+1]))*h mode [1] 43.73913 MD <- (sum(f*abs(X-mode)))/n MD [1] 13.01098 c3 <- MD/mode c3 [1] 0.2974678
Variance (σ) and coefficient of variation (C.V.)
Variance
Variance is the arithmetic mean of the squares of deviations taken from arithmetic mean. Though variance is quite hard to understand and calculate it satisfies almost all requisites of an ideal measure of dispersion. Amongst all measures of dispersion variance is least affected by sampling fluctuations. Standard deviation is the positive square root of variance
Coefficient of variation (C.V.)
Whenever we want to compare the variability in two different data sets we cannot use the measures of dispersion as they have units same as that of the quantity being measured whereas we need a unit less measure for the purpose of comparison. In this case one can use the coefficient of measures of dispersion. One more coefficient of dispersion based on sd and mean is coefficient of variation.
C.V. = (standard deviation)/|mean| *100
While comparing variability of two data sets the one which has less C.V. is said to be more consistent.
a) For ungrouped data
Suppose Xi ,i=1,2,..,n are n observations.
Then,
variance=(Σ(x-mean)^2)/n
Example,
Suppose the Observations are 6,10,29,20,25,20,14,17,26,21
Mean=ΣX/n=18.8
X |
(x-mean)^2 |
6 |
163.84 |
10 |
77.44 |
29 |
104.04 |
20 |
1.44 |
25 |
38.44 |
20 |
1.44 |
14 |
23.04 |
17 |
3.24 |
26 |
51.84 |
21 |
4.84 |
Total |
469.6 |
Variance = 469.6/10=46.96
A <- c(6,10,29,20,25,20,14,17,26,21) A [1] 6 10 29 20 25 20 14 17 26 21 var <- ((n-1)/n)*var(A) var [1] 46.96 sd <- sqrt(var) sd [1] 6.852737 cv <- (sd/abs(mean(A)))*100 cv [1] 36.4507
Note that the “var” command in R uses the formula
variance = (Σ(x-mean)^2)/(n-1)
b) For discrete frequency distribution
Suppose Xi ,i=1,2,..,n are n observations and fi,i=1,2,…n are their corresponding frequencies.
variance=(Σf(x-mean)^2)/N
X |
Frequency (f) | X*f |
f(x-mean)^2 |
1 |
2 | 2 | 143.76 |
2 | 3 | 6 |
167.77 |
8 |
9 | 72 | 19.67 |
18 | 5 | 90 |
363.10 |
12 |
4 | 48 | 25.44 |
Total | N=23 | 218 |
719.74 |
Mean= 218/23=9.4783
Variance= 719.74/23 =31.29304
Sd = √variance = 5.594019
C.V. = sd*100/|mean| = 59.0192
x <- c(1,2,8,18,12) x [1] 1 2 8 18 12 f <- c(2,3,9,5,4) f [1] 2 3 9 5 4 N <- sum(f) N [1] 23 mean <- sum(x*f)/N mean [1] 9.478261 var <- sum(f*(x-mean)^2)/N var [1] 31.29301 sd <- sqrt(var) sd [1] 5.594015 CV <- (sd*100)/abs(mean) CV [1] 59.01943
c) For continuous frequency distribution
For given class intervals find corresponding class marks. Remaining procedure is same as that of discrete frequency distribution. Take class marks as X observations.
variance=(Σf(x-mean)^2)/N
Class interval | Class mark(X_{i})(lower limit+upper limit)/2 | Frequency (f_{i}) |
x_{i}f_{i} | f(x-mean)^2 |
10-20 | 15 | 65 | 975 | 21555.5212 |
20-30 | 25 | 56 | 1400 | 3775.1170 |
30-40 | 35 | 50 | 1750 | 160.1101 |
40-50 | 45 | 93 | 4185 | 12926.2191 |
50-60 | 55 | 21 | 1155 | 9970.4011 |
Total | N=285 | 9465 | 48387.37 |
Mean = (Σx*f)/N = 33.21053
Variance = 48387.37/285 =169.7802
Sd = √variance = 13.02997
lb <- seq(10,50,10) lb [1] 10 20 30 40 50 ub <- seq(20,60,10) ub [1] 20 30 40 50 60 f <- c(65,56,50,93,21) f [1] 65 56 50 93 21 X <- (lb+ub)/2 X [1] 15 25 35 45 55 n <- sum(f) n [1] 285 mean <- sum(X*f)/n mean [1] 33.21053 var <- (sum(f*(X-mean)^2))/n var [1] 169.7802 sd <- sqrt(var) sd [1] 13.02997 cv <- (sd/abs(mean))*100 cv [1] 39.23447
Author : Click