dplyr package in R
dplyr cheat sheet

dplyr package in R

dplyr package in R

WHAT IS “dplyr”?

“dplyr” package, widely used in R, is basically a grammar of data manipulation. It is written and maintained by Hadley Wickham.

WHY “dplyr”?

The package helps in transformation and summarization of data frames (i.e., data recorded in tabular form with rows and columns). It provides the most important verbs available to the users to work on R. Besides, it also allows the users to use the same interface while working data in different forms, be it in data frame or a table or from a database itself.

INSTALLATION AND USE

The code to install this package:

install.packages('dplyr', repos ='https://cran.rstudio.com/bin/windows/contrib/3.4/dplyr_0.7.4.zip')
library(dplyr)

Now we will discuss about a set of functions in the package, which performs common data manipulation operations.

The functions discussed below are:
1. Filter ()
2. Select ()
3. Mutate ()
4. Arrange ()
5. Summarize ()
6. Group_by ()
7. Piping ()

IMPORT DATA

We explain the above functions using a data set available in R – “flights”.
To get this data set we have to install and then call two packages,

install.packages('nycflights13', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/nycflights13_0.2.2.zip')
install.packages('tidyverse', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/tidyverse_1.2.1.zip')
library(nycflights13)
library(tidyverse)

Storing the data set ‘flights’ with the name ‘data’,

data <- nycflights13::flights

This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics.

DATA CLEANING

This is a vital step for all the analytical models. The data imported is not clean, as we can see huge amounts of missing values. So we have to do missing value treatment.
To check if there are missing values,

sum(is.na(data))
## [1] 46595

We see there are 46595 NA values, which are considered missing. We remove these and store the rest dataset in a new ‘data1’,

data1 <- na.omit(data)

Now we are left with 327346 records to work on.
Secondly, we see that the variables like, “dep_delay” and “arr_delay” has negetive values, which ultimately makes no sense. The logic being,
‘dep_delay’ = ‘dep_time’ – ‘sched_dep_time’ , should be the case.
But if we have ‘dep_delay’ negetive means that ‘sched_dep_time’ is more than ‘dep_time’, which means that the flight departed before time, so it is ethically not delayed. Hence negetive means nothing relevant under “dep_delay”.
Same logic goes for “arr_delay”. Hence negetive means nothing relevant for this as well.
So we replace the negative values with 0 for both the fields, showing that there is no delay in those flights.

data1$dep_delay <- ifelse(data1$dep_delay < 0, 0, data1$dep_delay)
data1$arr_delay <- ifelse(data1$arr_delay < 0, 0, data1$arr_delay)

Thirdly, we see the structure of the data is not appropriate,

str(data1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    327346 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 0 0 0 0 0 0 0 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 0 0 12 19 0 0 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
##   .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...

We have to correct the data types of few fields,

data1$carrier <- as.factor(data1$carrier)
data1$flight <- as.factor(data1$flight)
data1$tailnum <- as.factor(data1$tailnum)
data1$origin <- as.factor(data1$origin)
data1$dest <- as.factor(data1$dest)

Fourthly, we convert the data set into data frame, to work easily with it,

data2 <- as.data.frame(data1)
str(data2)
## 'data.frame':    327346 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 0 0 0 0 0 0 0 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 0 0 12 19 0 0 8 ...
##  $ carrier       : Factor w/ 16 levels "9E","AA","AS",..: 12 12 2 4 5 12 4 6 4 2 ...
##  $ flight        : Factor w/ 3835 levels "1","2","3","4",..: 1381 1544 1041 676 424 1526 468 3693 69 265 ...
##  $ tailnum       : Factor w/ 4037 levels "D942DN","N0EGMQ",..: 180 524 2400 3201 2660 1141 1828 3297 2206 1177 ...
##  $ origin        : Factor w/ 3 levels "EWR","JFK","LGA": 1 3 2 2 3 1 1 3 2 3 ...
##  $ dest          : Factor w/ 104 levels "ABQ","ACK","ALB",..: 44 44 58 13 5 69 36 43 54 69 ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
##   .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...

Now the data looks fine to be used for our analysis. Data cleaning part is now over.

Starting with the function of ‘dplyr’

Filter ()
This functions returns only the rows that matches with the condition entered by the user. It is called the filtering process, where the rows returned as output holds the given condition true. There can be one or more than one condition given by the user at a time.
Some useful filter functions are – . == , > , >= , < , <= . & , | , ! . xor() , is.na() . between() , near()
The syntax being: filter (dataset name, conditions)
For example, if we want the records of students whose age is more than 15, from student dataset: filter (student, age > 15)
For multiple conditions, we mention all the conditions using filter functions in between them. Example, if we want the records of female students having age more than 15: filter (student, sex == “F” & age > 15)
Working with our data set, first showing with a single condition,

f1 <- filter(data2, origin == 'EWR')
head(f1)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      554            558         0      740            728
## 3 2013     1   1      555            600         0      913            854
## 4 2013     1   1      558            600         0      923            937
## 5 2013     1   1      559            600         0      854            902
## 6 2013     1   1      601            600         1      844            850
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        12      UA   1696  N39463    EWR  ORD      150      719    5
## 3        19      B6    507  N516JB    EWR  FLL      158     1065    6
## 4         0      UA   1124  N53441    EWR  SFO      361     2565    6
## 5         0      UA   1187  N76515    EWR  LAS      337     2227    6
## 6         0      B6    343  N644JB    EWR  PBI      147     1023    6
##   minute           time_hour
## 1     15 2013-01-01 05:00:00
## 2     58 2013-01-01 05:00:00
## 3      0 2013-01-01 06:00:00
## 4      0 2013-01-01 06:00:00
## 5      0 2013-01-01 06:00:00
## 6      0 2013-01-01 06:00:00

Here we get only the flight records whose ‘origin’ is ‘EWR’. there are 117127 such records.
For multiple conditions,

f2 <- filter(data2, origin == 'EWR' & dest == 'IAH')
head(f2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      739            739         0     1104           1038
## 3 2013     1   1      908            908         0     1228           1219
## 4 2013     1   1     1044           1045         0     1352           1351
## 5 2013     1   1     1205           1200         5     1503           1505
## 6 2013     1   1     1356           1350         6     1659           1640
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        26      UA   1479  N37408    EWR  IAH      249     1400    7
## 3         9      UA   1220  N12216    EWR  IAH      233     1400    9
## 4         1      UA    455  N667UA    EWR  IAH      229     1400   10
## 5         0      UA   1461  N39418    EWR  IAH      221     1400   12
## 6        19      UA   1258  N26906    EWR  IAH      218     1400   13
##   minute           time_hour
## 1     15 2013-01-01 05:00:00
## 2     39 2013-01-01 07:00:00
## 3      8 2013-01-01 09:00:00
## 4     45 2013-01-01 10:00:00
## 5      0 2013-01-01 12:00:00
## 6     50 2013-01-01 13:00:00

Here there are two conditions mentioned, and the output records must fulfil both the conditions simultaneously. That is, the flight ‘origin’ has to be ‘EWR’ and simultaneously the flight ‘dest’ has to be ‘IAH’. We get 3923 such records.
Also we notice that the number of output records for multiple conditions is less than that for single condition. Hence the simplest observation is that more the conditions imposed to come true, lesser is the number of output records.

Select ()

This function is usually called when there is a large data set, i.e., the number of variables considered and the observations are both huge. It often happens that when we work on any data set, we are not interested in the whole set of observations, instead we want to work on a particular set of observations only. Hence it helps us to extract a part of the original large data set of our interest and work on it.
There are few functions / arguments which work only inside select (). These are: . starts_with () , ends_with () , contains () . matches () , num_range ()
The syntax being: select (table name, the columns we want to display separated by commas)
For example, if we want to extract name, sex and age of the students, select (student, name, sex, age)
We can also use the functions inside the select statement to extract the desired records. Example, select (student, starts_with (“total”)) Here we get the records of all the students whose column name starts with “total”. Example, it can be the column “total marks”.
Now if we use a minus sign before the column names then it means that we want to drop those particular columns from the extracted table. For example, select (student, -name)
It will extract a table from the data frame student without the column “name”. In other words it will extract the whole of the student table, after dropping the column “name”.
Showing with our present data set,

s1 <- select(data2, sched_dep_time, sched_arr_time, flight)
head(s1)
##   sched_dep_time sched_arr_time flight
## 1            515            819   1545
## 2            529            830   1714
## 3            540            850   1141
## 4            545           1022    725
## 5            600            837    461
## 6            558            728   1696

Here we extract records of only three columns from the dataset- sched_dep_time, sched_arr_time and flight. There shows 327346 records with only 3 variables.
Now if we want to extract all the columns that has column name containing “arr”, then,

s2 <- select(data2, contains ("arr"))
head(s2)
##   arr_time sched_arr_time arr_delay carrier
## 1      830            819        11      UA
## 2      850            830        20      UA
## 3      923            850        33      AA
## 4     1004           1022         0      B6
## 5      812            837         0      DL
## 6      740            728        12      UA

This will show the 4 columns that starts with “arr”- arr_time, sched_arr_time, arr_delay and carrier.
Similarly we can use other embedded functions also as mentioned above.
Next if we want to extract the whole data set, except the column ‘year’, ‘month’, and ‘day’ in it,

s3 <- select(data2, -year, -month, -day)
head(s3)
##   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## 1      517            515         2      830            819        11
## 2      533            529         4      850            830        20
## 3      542            540         2      923            850        33
## 4      544            545         0     1004           1022         0
## 5      554            600         0      812            837         0
## 6      554            558         0      740            728        12
##   carrier flight tailnum origin dest air_time distance hour minute
## 1      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00

We see that these three columns does not show up in the output. Hence we can work with the rest of the dataset. So we see only 16 variables are there now, as desired.

Mutate ()

This function creates a new column to the existing data frame. The column thus created should essentially be the function of the existing variables in the concerned data frame.
There are few useful functions which are used in mutate (): . Arithmetical operators like, – , + , * , / . log () . Cumulative functions like, cumsum (), cummin (), cummax (), etc. . if_else (), etc…
The syntax being: mutate (table name, derived column name = the calculations with the existing column)
For example, if we want to find the average marks of all the students, mutate (student, avg_marks = (maths_marks + eng_marks)/2)
This will eventually create a new column “avg_marks” containing marks of the individual students.
Another feature of this function is that we can drop a particular variable by setting its value as NULL.
Example, mutate (student, address = NULL)
This command will set the address column will NULL. In this way we can drop the unrequired variables.
Explaining this by using our dataset.
If we want to specify the flights whose ‘arr_delay’ is more than 100 to be “Bad rated flight” and the rest to be “Average rated flight”,

m1 <- mutate(data2, Flight_Remarks = ifelse(arr_delay > 100, "Bad rated flight", "Average rated flight"))
head(m1)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545         0     1004           1022
## 5 2013     1   1      554            600         0      812            837
## 6 2013     1   1      554            558         0      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4         0      B6    725  N804JB    JFK  BQN      183     1576    5
## 5         0      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute           time_hour       Flight_Remarks
## 1     15 2013-01-01 05:00:00 Average rated flight
## 2     29 2013-01-01 05:00:00 Average rated flight
## 3     40 2013-01-01 05:00:00 Average rated flight
## 4     45 2013-01-01 05:00:00 Average rated flight
## 5      0 2013-01-01 06:00:00 Average rated flight
## 6     58 2013-01-01 05:00:00 Average rated flight

We see that a new column is added to the dataset, which shows ‘Flight_Remarks’ for individual records. Hence we now get 327346 records for 20 variables.
Now suppose if we want, we can drop the column “time_hour” and work with rest of the columns, as it is just the concatenation of ‘year’, ‘day’ and ‘month’ columns in the same data set. This can be done by,

m2 <- mutate(data2, time_hour = NULL)
head(m2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545         0     1004           1022
## 5 2013     1   1      554            600         0      812            837
## 6 2013     1   1      554            558         0      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
## 4         0      B6    725  N804JB    JFK  BQN      183     1576    5
## 5         0      DL    461  N668DN    LGA  ATL      116      762    6
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5
##   minute
## 1     15
## 2     29
## 3     40
## 4     45
## 5      0
## 6     58

As the value of the column “time_hour” is set NULL, it implies that the particular column is dropped. Hence the whole data set, except the column “time_hour” is show as an output. so, We see 18 variables now.

Arrange ()

This function is used to re-order rows according to the variable specified by the user. The default re-arranging pattern is ascending. To make it descending we need to mention desc (). This function also allows group_by () in it, for arranging records according to groups.
The syntax is: arrange (table name, col names by which we want to arrange separated by comas)
For example, if we want student records to be arranged in order of total marks, arrange (student, total_marks)
If we want to order the students according to highest to lowest marks, arrange (table name, desc (total_marks))
Explaining this by using our dataset.
If we want to arrange the data set in the order of column “distance”, so that it is easy for us to identify which flights are for short trips or which one is for long trips. Also we can easily determine the distance between the origin and destination countries sorted,

a1 <- arrange(data2, distance)
head(a1)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   3     2127           2129         0     2222           2224
## 2 2013     1   4     1240           1200        40     1333           1306
## 3 2013     1   4     1829           1615       134     1937           1721
## 4 2013     1   4     2128           2129         0     2218           2224
## 5 2013     1   5     1155           1200         0     1241           1306
## 6 2013     1   6     2125           2129         0     2224           2224
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1         0      EV   3833  N13989    EWR  PHL       30       80   21
## 2        27      EV   4193  N14972    EWR  PHL       30       80   12
## 3       136      EV   4502  N15983    EWR  PHL       28       80   16
## 4         0      EV   4645  N27962    EWR  PHL       32       80   21
## 5         0      EV   4193  N14902    EWR  PHL       29       80   12
## 6         0      EV   4619  N22909    EWR  PHL       22       80   21
##   minute           time_hour
## 1     29 2013-01-03 21:00:00
## 2      0 2013-01-04 12:00:00
## 3     15 2013-01-04 16:00:00
## 4     29 2013-01-04 21:00:00
## 5      0 2013-01-05 12:00:00
## 6     29 2013-01-06 21:00:00

We see that the records are arranged according to “distance”, but by default in ascending order of the distance amount.
Now if we want the same records but by descending order of the “distance”, to identify the longest route flights easily,

a2 <- arrange(data2, desc(distance))
head(a2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      857            900         0     1516           1530
## 2 2013     1   2      909            900         9     1525           1530
## 3 2013     1   3      914            900        14     1504           1530
## 4 2013     1   4      900            900         0     1516           1530
## 5 2013     1   5      858            900         0     1519           1530
## 6 2013     1   6     1019            900        79     1558           1530
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1         0      HA     51  N380HA    JFK  HNL      659     4983    9
## 2         0      HA     51  N380HA    JFK  HNL      638     4983    9
## 3         0      HA     51  N380HA    JFK  HNL      616     4983    9
## 4         0      HA     51  N384HA    JFK  HNL      639     4983    9
## 5         0      HA     51  N381HA    JFK  HNL      635     4983    9
## 6        28      HA     51  N385HA    JFK  HNL      611     4983    9
##   minute           time_hour
## 1      0 2013-01-01 09:00:00
## 2      0 2013-01-02 09:00:00
## 3      0 2013-01-03 09:00:00
## 4      0 2013-01-04 09:00:00
## 5      0 2013-01-05 09:00:00
## 6      0 2013-01-06 09:00:00

Now we get our desired output.

Summarise ()

This function draws a summary statistics from a particular column in a data frame. In other words, it brings down to a single value from multiple values. The function works more significantly when used on group level data, created by the function group by (). The output thus formed after applying this function will be one row per group.
The aggregate functions used in summarise () are: . mean (), median () . max() , min() . n () , first (), last (), distinct (), etc.
The syntax is:
summarise (table name, aggregate functions function of the existing variables separated by commas)
For example, if we want to know the minimum, maximum and average marks of the student dataset,
summarise (student, min(total_marks), max(total_marks), mean(total_marks))
Explaining this by using our dataset.
Suppose we want to know the maximum, minimum and average ‘distance’ of the flights in 2013,

su1 <- summarise(data2, minimum= min(distance), maximum= max(distance), average= mean(distance))
su1
##   minimum maximum  average
## 1      80    4983 1048.371

Here we see that three single values are derived from a whole data set, the three values showing the maximum, minimum and average ‘distance’ covered by all the flights in 2013, as desired.
Group_by ()
This is used when we want to group the dataset with respect to a particular attribute.
From our data set if we want to group the records according to the year first, then month and then day,

g1 <- group_by(data2, year, month, day)
head(g1)
## # A tibble: 6 x 19
## # Groups:   year, month, day [1]
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515        2.      830
## 2  2013     1     1      533            529        4.      850
## 3  2013     1     1      542            540        2.      923
## 4  2013     1     1      544            545        0.     1004
## 5  2013     1     1      554            600        0.      812
## 6  2013     1     1      554            558        0.      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <fct>, flight <fct>, tailnum <fct>, origin <fct>, dest <fct>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Here we get all the records but these are sorted and grouped according to the year first, then month and then by day.

Piping ()

This is used mainly when we have multiple operations to execute. Writing multiple commands in separate lines make the program look clumsy. So we can write multiple commands in a single line, connecting all with piping function. The pipe looks like, %>%.
Now working with our data set if we want to group by the data set on the basis of ‘year’, then ‘month’ and then ‘day’. Also we want to extract only certain fields like, ‘arr_delay’, ‘dep_delay’, ‘flight’, ‘origin’, ‘dest’ and ‘distance’. We can create this using a single command,

p1 <- data2 %>% group_by(year, month, day) %>% select(arr_delay, dep_delay, flight, origin, dest, distance)
head(p1)
## # A tibble: 6 x 9
## # Groups:   year, month, day [1]
##    year month   day arr_delay dep_delay flight origin dest  distance
##   <int> <int> <int>     <dbl>     <dbl> <fct>  <fct>  <fct>    <dbl>
## 1  2013     1     1       11.        2. 1545   EWR    IAH      1400.
## 2  2013     1     1       20.        4. 1714   LGA    IAH      1416.
## 3  2013     1     1       33.        2. 1141   JFK    MIA      1089.
## 4  2013     1     1        0.        0. 725    JFK    BQN      1576.
## 5  2013     1     1        0.        0. 461    LGA    ATL       762.
## 6  2013     1     1       12.        0. 1696   EWR    ORD       719.

So we get 327346 records and only 9 variables, just like we wanted.

Combination of Group_by () and Piping ()

Suppose the task is to find the number of flights falling under each ‘carrier’.
So we can start it, by grouping the data set on the basis of ‘carrier’. Creating a dummy column ‘count’ and then adding the value with every occurrence of unique ‘carrier’ codes.

data3 <- data2
data3$count <- 1
c1 <- data3 %>% group_by(carrier) %>% mutate(Num_of_flights = sum(count))

We get 327346 records showing 21 attributes. The two extra columns shown are for ‘count’ and the new column created by mutate function, ‘Num_of_flights’.
Now we see that a lot of repeatations in the ‘Num_of_flights’ column. It is not easy for us to understand, exactly which ‘carrier’ has how many flights in total.
So we extract two columns, ‘carrier’ and ‘Num_of_flights’ in a new data, ‘c2’.

c2 <- c1[, c(10,21)]

We get all the records of only these two above mentioned columns.
Now we take only the unique codes under ‘carrier’ column, to make the look of what we want easier,

c2 <- c2[!duplicated(c2),]

Finally we reach our destination. We see 16 different ‘carrier’ codes and the number of flights under them.
Now, suppose if we want to cross-check to whether the grouping is done properly or not, we can add the number of flights under each ‘carrier’ and see if we get our original number of records we had in data2,

sum(c2$Num_of_flights)
## [1] 327346

So we see the value comes to 327346, exactly the same number of records we were originally working with. Hence the grouping done is correct.
Hence to conclude “dplyr” is a very powerful package that can make easy calculations and manipulations on data sets, which can actually make our life easier.

Leave a Reply

Close Menu