When the input data is unlabeled and we have to find hidden patterns or clusters in the data set unsupervised learning comes in the picture. In clustering what we do is given a data set we look for similarities among the data points and make clusters accordingly. The clusters are formed so that data points in the cluster are similar to each other and differ significantly from points in another cluster.

Suppose Meera is a data science student her teacher asks her to make groups of students. Meera can make groups in various ways like a) she can simply divide the students according to gender. b) students can be grouped according to the grades secured by them in last year. Or c) according to hobbies like dancing, singing or anything else. This is how Meera used clustering. However Meera decides to apply k means clustering for making groups. Let’s see how Meera does it. Suppose she has a data of marks of students in the first exam.

It’s as follows:

35,40,72,65,82,73,90,46,87,92,45,76,56,58,78. She decides to make 3 groups. Step 1: Since she has decided to make 3 groups she randomly picks 3 centroids from the data as C1 =40, C2=56, C3=87. Step 2: She calculates distance of each data point from the three centroids and assigns the data point to that cluster for which the distance of the data point from the centroid corresponding to that cluster is minimum. Following the same thing for each data point she gets the following three clusters Cluster 1: 35, 40, 46, 45 Cluster 2: 56, 65, 58 Cluster 3: 72, 82, 73, 90, 87, 92, 76, 78 Step 3: Calculating averages of each cluster she gets the new centroids to be as follows C1=41.5, C2=59.67, C3= 81.75 For new centroids she repeats step 2 and obtains new clusters as below Cluster 1: 35, 40, 46, 45 Cluster 2: 56, 65, 58 Cluster 3: 72, 82, 73, 90, 87, 92, 76, 78 The clusters are same as obtained in step 2 so the procedure stops here. This is how Meera found 3 groups using k means clustering. Let’s have a look at the algorithm of k means clustering Step 1: choose k (i.e. the number of clusters in which you want to divide your data).

Sometimes deciding value of k cannot be done by just looking at it at that time we go for elbow method. It’s discussed after the algorithm. Step 2: Calculate distance of each data point from the centroids. Assign the data point to that cluster for which distance between centroid and data point is minimum. Step 3: Once all data points are assigned calculate average of each cluster take them as new centroids then go to step 2. Continue this loop until clusters stop changing. The problem of deciding value of k can be solved using elbow method. In this method we calculate sum of squares due to error [SSE i.e. the sum of squares of distances (here we consider Euclidean distance) between the data points and the centroids of their corresponding clusters] for a range of k values then plot number of clusters against SSE. The graph obtained looks like an arm. The elbow on arm is the best value of k. Though SSE decreases as number of clusters increases we cannot use those values to be best value of k since for large values of k the data points are assigned to their own clusters.

Let’s see how to do k means clustering using R. Iris data is somewhat easy to understand so I’m going to use it here. Let’s store the data set in variable I

I <- iris

To have a look at some rows in the data we can use the command head()

head(I)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa

Since we apply k means algorithm to find natural groupings in the data we assume that we don’t know how the data is classified and now we are finding it so let us remove the last column

IRIS <- I[,-5]

Now let us see how the new data looks

head(IRIS)

## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 5.1 3.5 1.4 0.2 ## 2 4.9 3.0 1.4 0.2 ## 3 4.7 3.2 1.3 0.2 ## 4 4.6 3.1 1.5 0.2 ## 5 5.0 3.6 1.4 0.2 ## 6 5.4 3.9 1.7 0.4

Let us apply k means clustering to this data. Applying k means clustering using R is somewhat easy you just have to use the kmeans() commands.

km <- kmeans(IRIS,3)

km ## K-means clustering with 3 clusters of sizes 62, 38, 50 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 5.901613 2.748387 4.393548 1.433871 ## 2 6.850000 3.073684 5.742105 2.071053 ## 3 5.006000 3.428000 1.462000 0.246000 ## ## Clustering vector: ## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 ## [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 ## [141] 2 2 1 2 2 2 1 2 2 1 ## ## Within cluster sum of squares by cluster: ## [1] 39.82097 23.87947 15.15100 ## (between_SS / total_SS = 88.4 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"

Here we already knew that there are 3 clusters in the data. What if we don’t know in how many clusters we should divide the data? as discussed above in such situations we use elbow mothod.For applying it we find SSE and then plot k against SSE and take elbow on the arm as k

sse <- sapply(1:10,function(k){kmeans(IRIS,k, nstart= 20, iter.max = 15)$tot.withinss}) sse ## [1] 681.37060 152.34795 78.85144 57.22847 46.44618 39.03999 34.61250 ## [8] 29.98894 27.86026 26.27656 plot(1:10,sse,type="b",main="Elbow method",xlab="Number of clusters (k)",ylab="Sum of squares due to error (SSE)")

We choose the value of k as the elbow on the arm however we can see two such points in the above graph one at 2 and another at 3 we go for the one with small SSE so k=3 is more accurate. Thus we select k as 3 using elbow method.