The post dplyr package in R appeared first on StepUp Analytics.
]]>“dplyr” package, widely used in R, is basically a grammar of data manipulation. It is written and maintained by Hadley Wickham.
The package helps in transformation and summarization of data frames (i.e., data recorded in tabular form with rows and columns). It provides the most important verbs available to the users to work on R. Besides, it also allows the users to use the same interface while working data in different forms, be it in data frame or a table or from a database itself.
The code to install this package:
install.packages('dplyr', repos ='https://cran.rstudio.com/bin/windows/contrib/3.4/dplyr_0.7.4.zip')
library(dplyr)
Now we will discuss about a set of functions in the package, which performs common data manipulation operations.
We explain the above functions using a data set available in R – “flights”.
To get this data set we have to install and then call two packages,
install.packages('nycflights13', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/nycflights13_0.2.2.zip')
install.packages('tidyverse', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/tidyverse_1.2.1.zip')
library(nycflights13)
library(tidyverse)
Storing the data set ‘flights’ with the name ‘data’,
data <- nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics.
This is a vital step for all the analytical models. The data imported is not clean, as we can see huge amounts of missing values. So we have to do missing value treatment.
To check if there are missing values,
sum(is.na(data)) ## [1] 46595
We see there are 46595 NA values, which are considered missing. We remove these and store the rest dataset in a new ‘data1’,
data1 <- na.omit(data)
Now we are left with 327346 records to work on.
Secondly, we see that the variables like, “dep_delay” and “arr_delay” has negetive values, which ultimately makes no sense. The logic being,
‘dep_delay’ = ‘dep_time’ – ‘sched_dep_time’ , should be the case.
But if we have ‘dep_delay’ negetive means that ‘sched_dep_time’ is more than ‘dep_time’, which means that the flight departed before time, so it is ethically not delayed. Hence negetive means nothing relevant under “dep_delay”.
Same logic goes for “arr_delay”. Hence negetive means nothing relevant for this as well.
So we replace the negative values with 0 for both the fields, showing that there is no delay in those flights.
data1$dep_delay <- ifelse(data1$dep_delay < 0, 0, data1$dep_delay) data1$arr_delay <- ifelse(data1$arr_delay < 0, 0, data1$arr_delay)
Thirdly, we see the structure of the data is not appropriate,
str(data1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 327346 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 0 0 0 0 0 0 0 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 0 0 12 19 0 0 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ... ## - attr(*, "na.action")=Class 'omit' Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ... ## .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...
We have to correct the data types of few fields,
data1$carrier <- as.factor(data1$carrier) data1$flight <- as.factor(data1$flight) data1$tailnum <- as.factor(data1$tailnum) data1$origin <- as.factor(data1$origin) data1$dest <- as.factor(data1$dest)
Fourthly, we convert the data set into data frame, to work easily with it,
data2 <- as.data.frame(data1) str(data2)
## 'data.frame': 327346 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 0 0 0 0 0 0 0 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 0 0 12 19 0 0 8 ... ## $ carrier : Factor w/ 16 levels "9E","AA","AS",..: 12 12 2 4 5 12 4 6 4 2 ... ## $ flight : Factor w/ 3835 levels "1","2","3","4",..: 1381 1544 1041 676 424 1526 468 3693 69 265 ... ## $ tailnum : Factor w/ 4037 levels "D942DN","N0EGMQ",..: 180 524 2400 3201 2660 1141 1828 3297 2206 1177 ... ## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 1 3 2 2 3 1 1 3 2 3 ... ## $ dest : Factor w/ 104 levels "ABQ","ACK","ALB",..: 44 44 58 13 5 69 36 43 54 69 ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ... ## - attr(*, "na.action")=Class 'omit' Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ... ## .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...
Now the data looks fine to be used for our analysis. Data cleaning part is now over.
Filter ()
This functions returns only the rows that matches with the condition entered by the user. It is called the filtering process, where the rows returned as output holds the given condition true. There can be one or more than one condition given by the user at a time.
Some useful filter functions are – . == , > , >= , < , <= . & , | , ! . xor() , is.na() . between() , near()
The syntax being: filter (dataset name, conditions)
For example, if we want the records of students whose age is more than 15, from student dataset: filter (student, age > 15)
For multiple conditions, we mention all the conditions using filter functions in between them. Example, if we want the records of female students having age more than 15: filter (student, sex == “F” & age > 15)
Working with our data set, first showing with a single condition,
f1 <- filter(data2, origin == 'EWR') head(f1)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 554 558 0 740 728 ## 3 2013 1 1 555 600 0 913 854 ## 4 2013 1 1 558 600 0 923 937 ## 5 2013 1 1 559 600 0 854 902 ## 6 2013 1 1 601 600 1 844 850 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 12 UA 1696 N39463 EWR ORD 150 719 5 ## 3 19 B6 507 N516JB EWR FLL 158 1065 6 ## 4 0 UA 1124 N53441 EWR SFO 361 2565 6 ## 5 0 UA 1187 N76515 EWR LAS 337 2227 6 ## 6 0 B6 343 N644JB EWR PBI 147 1023 6 ## minute time_hour ## 1 15 2013-01-01 05:00:00 ## 2 58 2013-01-01 05:00:00 ## 3 0 2013-01-01 06:00:00 ## 4 0 2013-01-01 06:00:00 ## 5 0 2013-01-01 06:00:00 ## 6 0 2013-01-01 06:00:00
Here we get only the flight records whose ‘origin’ is ‘EWR’. there are 117127 such records.
For multiple conditions,
f2 <- filter(data2, origin == 'EWR' & dest == 'IAH') head(f2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 739 739 0 1104 1038 ## 3 2013 1 1 908 908 0 1228 1219 ## 4 2013 1 1 1044 1045 0 1352 1351 ## 5 2013 1 1 1205 1200 5 1503 1505 ## 6 2013 1 1 1356 1350 6 1659 1640 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 26 UA 1479 N37408 EWR IAH 249 1400 7 ## 3 9 UA 1220 N12216 EWR IAH 233 1400 9 ## 4 1 UA 455 N667UA EWR IAH 229 1400 10 ## 5 0 UA 1461 N39418 EWR IAH 221 1400 12 ## 6 19 UA 1258 N26906 EWR IAH 218 1400 13 ## minute time_hour ## 1 15 2013-01-01 05:00:00 ## 2 39 2013-01-01 07:00:00 ## 3 8 2013-01-01 09:00:00 ## 4 45 2013-01-01 10:00:00 ## 5 0 2013-01-01 12:00:00 ## 6 50 2013-01-01 13:00:00
Here there are two conditions mentioned, and the output records must fulfil both the conditions simultaneously. That is, the flight ‘origin’ has to be ‘EWR’ and simultaneously the flight ‘dest’ has to be ‘IAH’. We get 3923 such records.
Also we notice that the number of output records for multiple conditions is less than that for single condition. Hence the simplest observation is that more the conditions imposed to come true, lesser is the number of output records.
This function is usually called when there is a large data set, i.e., the number of variables considered and the observations are both huge. It often happens that when we work on any data set, we are not interested in the whole set of observations, instead we want to work on a particular set of observations only. Hence it helps us to extract a part of the original large data set of our interest and work on it.
There are few functions / arguments which work only inside select (). These are: . starts_with () , ends_with () , contains () . matches () , num_range ()
The syntax being: select (table name, the columns we want to display separated by commas)
For example, if we want to extract name, sex and age of the students, select (student, name, sex, age)
We can also use the functions inside the select statement to extract the desired records. Example, select (student, starts_with (“total”)) Here we get the records of all the students whose column name starts with “total”. Example, it can be the column “total marks”.
Now if we use a minus sign before the column names then it means that we want to drop those particular columns from the extracted table. For example, select (student, -name)
It will extract a table from the data frame student without the column “name”. In other words it will extract the whole of the student table, after dropping the column “name”.
Showing with our present data set,
s1 <- select(data2, sched_dep_time, sched_arr_time, flight) head(s1)
## sched_dep_time sched_arr_time flight ## 1 515 819 1545 ## 2 529 830 1714 ## 3 540 850 1141 ## 4 545 1022 725 ## 5 600 837 461 ## 6 558 728 1696
Here we extract records of only three columns from the dataset- sched_dep_time, sched_arr_time and flight. There shows 327346 records with only 3 variables.
Now if we want to extract all the columns that has column name containing “arr”, then,
s2 <- select(data2, contains ("arr")) head(s2)
## arr_time sched_arr_time arr_delay carrier ## 1 830 819 11 UA ## 2 850 830 20 UA ## 3 923 850 33 AA ## 4 1004 1022 0 B6 ## 5 812 837 0 DL ## 6 740 728 12 UA
This will show the 4 columns that starts with “arr”- arr_time, sched_arr_time, arr_delay and carrier.
Similarly we can use other embedded functions also as mentioned above.
Next if we want to extract the whole data set, except the column ‘year’, ‘month’, and ‘day’ in it,
s3 <- select(data2, -year, -month, -day) head(s3)
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay ## 1 517 515 2 830 819 11 ## 2 533 529 4 850 830 20 ## 3 542 540 2 923 850 33 ## 4 544 545 0 1004 1022 0 ## 5 554 600 0 812 837 0 ## 6 554 558 0 740 728 12 ## carrier flight tailnum origin dest air_time distance hour minute ## 1 UA 1545 N14228 EWR IAH 227 1400 5 15 ## 2 UA 1714 N24211 LGA IAH 227 1416 5 29 ## 3 AA 1141 N619AA JFK MIA 160 1089 5 40 ## 4 B6 725 N804JB JFK BQN 183 1576 5 45 ## 5 DL 461 N668DN LGA ATL 116 762 6 0 ## 6 UA 1696 N39463 EWR ORD 150 719 5 58 ## time_hour ## 1 2013-01-01 05:00:00 ## 2 2013-01-01 05:00:00 ## 3 2013-01-01 05:00:00 ## 4 2013-01-01 05:00:00 ## 5 2013-01-01 06:00:00 ## 6 2013-01-01 05:00:00
We see that these three columns does not show up in the output. Hence we can work with the rest of the dataset. So we see only 16 variables are there now, as desired.
This function creates a new column to the existing data frame. The column thus created should essentially be the function of the existing variables in the concerned data frame.
There are few useful functions which are used in mutate (): . Arithmetical operators like, – , + , * , / . log () . Cumulative functions like, cumsum (), cummin (), cummax (), etc. . if_else (), etc…
The syntax being: mutate (table name, derived column name = the calculations with the existing column)
For example, if we want to find the average marks of all the students, mutate (student, avg_marks = (maths_marks + eng_marks)/2)
This will eventually create a new column “avg_marks” containing marks of the individual students.
Another feature of this function is that we can drop a particular variable by setting its value as NULL.
Example, mutate (student, address = NULL)
This command will set the address column will NULL. In this way we can drop the unrequired variables.
Explaining this by using our dataset.
If we want to specify the flights whose ‘arr_delay’ is more than 100 to be “Bad rated flight” and the rest to be “Average rated flight”,
m1 <- mutate(data2, Flight_Remarks = ifelse(arr_delay > 100, "Bad rated flight", "Average rated flight")) head(m1)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 0 1004 1022 ## 5 2013 1 1 554 600 0 812 837 ## 6 2013 1 1 554 558 0 740 728 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 ## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 ## 4 0 B6 725 N804JB JFK BQN 183 1576 5 ## 5 0 DL 461 N668DN LGA ATL 116 762 6 ## 6 12 UA 1696 N39463 EWR ORD 150 719 5 ## minute time_hour Flight_Remarks ## 1 15 2013-01-01 05:00:00 Average rated flight ## 2 29 2013-01-01 05:00:00 Average rated flight ## 3 40 2013-01-01 05:00:00 Average rated flight ## 4 45 2013-01-01 05:00:00 Average rated flight ## 5 0 2013-01-01 06:00:00 Average rated flight ## 6 58 2013-01-01 05:00:00 Average rated flight
We see that a new column is added to the dataset, which shows ‘Flight_Remarks’ for individual records. Hence we now get 327346 records for 20 variables.
Now suppose if we want, we can drop the column “time_hour” and work with rest of the columns, as it is just the concatenation of ‘year’, ‘day’ and ‘month’ columns in the same data set. This can be done by,
m2 <- mutate(data2, time_hour = NULL) head(m2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 0 1004 1022 ## 5 2013 1 1 554 600 0 812 837 ## 6 2013 1 1 554 558 0 740 728 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 ## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 ## 4 0 B6 725 N804JB JFK BQN 183 1576 5 ## 5 0 DL 461 N668DN LGA ATL 116 762 6 ## 6 12 UA 1696 N39463 EWR ORD 150 719 5 ## minute ## 1 15 ## 2 29 ## 3 40 ## 4 45 ## 5 0 ## 6 58
As the value of the column “time_hour” is set NULL, it implies that the particular column is dropped. Hence the whole data set, except the column “time_hour” is show as an output. so, We see 18 variables now.
This function is used to re-order rows according to the variable specified by the user. The default re-arranging pattern is ascending. To make it descending we need to mention desc (). This function also allows group_by () in it, for arranging records according to groups.
The syntax is: arrange (table name, col names by which we want to arrange separated by comas)
For example, if we want student records to be arranged in order of total marks, arrange (student, total_marks)
If we want to order the students according to highest to lowest marks, arrange (table name, desc (total_marks))
Explaining this by using our dataset.
If we want to arrange the data set in the order of column “distance”, so that it is easy for us to identify which flights are for short trips or which one is for long trips. Also we can easily determine the distance between the origin and destination countries sorted,
a1 <- arrange(data2, distance) head(a1)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 3 2127 2129 0 2222 2224 ## 2 2013 1 4 1240 1200 40 1333 1306 ## 3 2013 1 4 1829 1615 134 1937 1721 ## 4 2013 1 4 2128 2129 0 2218 2224 ## 5 2013 1 5 1155 1200 0 1241 1306 ## 6 2013 1 6 2125 2129 0 2224 2224 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 0 EV 3833 N13989 EWR PHL 30 80 21 ## 2 27 EV 4193 N14972 EWR PHL 30 80 12 ## 3 136 EV 4502 N15983 EWR PHL 28 80 16 ## 4 0 EV 4645 N27962 EWR PHL 32 80 21 ## 5 0 EV 4193 N14902 EWR PHL 29 80 12 ## 6 0 EV 4619 N22909 EWR PHL 22 80 21 ## minute time_hour ## 1 29 2013-01-03 21:00:00 ## 2 0 2013-01-04 12:00:00 ## 3 15 2013-01-04 16:00:00 ## 4 29 2013-01-04 21:00:00 ## 5 0 2013-01-05 12:00:00 ## 6 29 2013-01-06 21:00:00
We see that the records are arranged according to “distance”, but by default in ascending order of the distance amount.
Now if we want the same records but by descending order of the “distance”, to identify the longest route flights easily,
a2 <- arrange(data2, desc(distance)) head(a2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 857 900 0 1516 1530 ## 2 2013 1 2 909 900 9 1525 1530 ## 3 2013 1 3 914 900 14 1504 1530 ## 4 2013 1 4 900 900 0 1516 1530 ## 5 2013 1 5 858 900 0 1519 1530 ## 6 2013 1 6 1019 900 79 1558 1530 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 0 HA 51 N380HA JFK HNL 659 4983 9 ## 2 0 HA 51 N380HA JFK HNL 638 4983 9 ## 3 0 HA 51 N380HA JFK HNL 616 4983 9 ## 4 0 HA 51 N384HA JFK HNL 639 4983 9 ## 5 0 HA 51 N381HA JFK HNL 635 4983 9 ## 6 28 HA 51 N385HA JFK HNL 611 4983 9 ## minute time_hour ## 1 0 2013-01-01 09:00:00 ## 2 0 2013-01-02 09:00:00 ## 3 0 2013-01-03 09:00:00 ## 4 0 2013-01-04 09:00:00 ## 5 0 2013-01-05 09:00:00 ## 6 0 2013-01-06 09:00:00
Now we get our desired output.
This function draws a summary statistics from a particular column in a data frame. In other words, it brings down to a single value from multiple values. The function works more significantly when used on group level data, created by the function group by (). The output thus formed after applying this function will be one row per group.
The aggregate functions used in summarise () are: . mean (), median () . max() , min() . n () , first (), last (), distinct (), etc.
The syntax is:
summarise (table name, aggregate functions function of the existing variables separated by commas)
For example, if we want to know the minimum, maximum and average marks of the student dataset,
summarise (student, min(total_marks), max(total_marks), mean(total_marks))
Explaining this by using our dataset.
Suppose we want to know the maximum, minimum and average ‘distance’ of the flights in 2013,
su1 <- summarise(data2, minimum= min(distance), maximum= max(distance), average= mean(distance)) su1
## minimum maximum average ## 1 80 4983 1048.371
Here we see that three single values are derived from a whole data set, the three values showing the maximum, minimum and average ‘distance’ covered by all the flights in 2013, as desired.
Group_by ()
This is used when we want to group the dataset with respect to a particular attribute.
From our data set if we want to group the records according to the year first, then month and then day,
g1 <- group_by(data2, year, month, day) head(g1)
## # A tibble: 6 x 19 ## # Groups: year, month, day [1] ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2. 830 ## 2 2013 1 1 533 529 4. 850 ## 3 2013 1 1 542 540 2. 923 ## 4 2013 1 1 544 545 0. 1004 ## 5 2013 1 1 554 600 0. 812 ## 6 2013 1 1 554 558 0. 740 ## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, ## # carrier <fct>, flight <fct>, tailnum <fct>, origin <fct>, dest <fct>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>
Here we get all the records but these are sorted and grouped according to the year first, then month and then by day.
This is used mainly when we have multiple operations to execute. Writing multiple commands in separate lines make the program look clumsy. So we can write multiple commands in a single line, connecting all with piping function. The pipe looks like, %>%.
Now working with our data set if we want to group by the data set on the basis of ‘year’, then ‘month’ and then ‘day’. Also we want to extract only certain fields like, ‘arr_delay’, ‘dep_delay’, ‘flight’, ‘origin’, ‘dest’ and ‘distance’. We can create this using a single command,
p1 <- data2 %>% group_by(year, month, day) %>% select(arr_delay, dep_delay, flight, origin, dest, distance)
head(p1)
## # A tibble: 6 x 9 ## # Groups: year, month, day [1] ## year month day arr_delay dep_delay flight origin dest distance ## <int> <int> <int> <dbl> <dbl> <fct> <fct> <fct> <dbl> ## 1 2013 1 1 11. 2. 1545 EWR IAH 1400. ## 2 2013 1 1 20. 4. 1714 LGA IAH 1416. ## 3 2013 1 1 33. 2. 1141 JFK MIA 1089. ## 4 2013 1 1 0. 0. 725 JFK BQN 1576. ## 5 2013 1 1 0. 0. 461 LGA ATL 762. ## 6 2013 1 1 12. 0. 1696 EWR ORD 719.
So we get 327346 records and only 9 variables, just like we wanted.
Suppose the task is to find the number of flights falling under each ‘carrier’.
So we can start it, by grouping the data set on the basis of ‘carrier’. Creating a dummy column ‘count’ and then adding the value with every occurrence of unique ‘carrier’ codes.
data3 <- data2 data3$count <- 1 c1 <- data3 %>% group_by(carrier) %>% mutate(Num_of_flights = sum(count))
We get 327346 records showing 21 attributes. The two extra columns shown are for ‘count’ and the new column created by mutate function, ‘Num_of_flights’.
Now we see that a lot of repeatations in the ‘Num_of_flights’ column. It is not easy for us to understand, exactly which ‘carrier’ has how many flights in total.
So we extract two columns, ‘carrier’ and ‘Num_of_flights’ in a new data, ‘c2’.
c2 <- c1[, c(10,21)]
We get all the records of only these two above mentioned columns.
Now we take only the unique codes under ‘carrier’ column, to make the look of what we want easier,
c2 <- c2[!duplicated(c2),]
Finally we reach our destination. We see 16 different ‘carrier’ codes and the number of flights under them.
Now, suppose if we want to cross-check to whether the grouping is done properly or not, we can add the number of flights under each ‘carrier’ and see if we get our original number of records we had in data2,
sum(c2$Num_of_flights)
## [1] 327346
So we see the value comes to 327346, exactly the same number of records we were originally working with. Hence the grouping done is correct.
Hence to conclude “dplyr” is a very powerful package that can make easy calculations and manipulations on data sets, which can actually make our life easier.
The post dplyr package in R appeared first on StepUp Analytics.
]]>The post SEGMENTATION OF CUSTOMERS USING RFM MODEL appeared first on StepUp Analytics.
]]>Customer Lifetime Value (CLV), also known as Life-time Value (LTV), is the present value of the future cash flows from the customer during his or her entire relationship with the company. In other words, the dollar value of a customer relationship which is based on the present value of the estimated future cash flows from that particular customer’s relationship with the company.
CLV is mainly used by companies for customer segmentation. Because each and every customer is not important for a company, hence CLV-based segmentation model allows the company to predict its most profitable group of customers, understand their nature, and focus more on them rather than wasting time, money and resource on other less profitable customers.
Recency-Frequency-Monetary (RFM) model is one of the predictive models to calculate CLV. By segmenting customers using RFM model, we can analyze each group of customers individually and determine which set of customers has the highest CLV, hence contributing to the profitability of the company.
RFM analysis determines quantitatively which customers are the best ones by examining the following factors :
It follows the axiom that 80% of business comes from 20% of customers (set that are most important).
To start with the RFM model, install the following packages first.
install.packages('ggplot2', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/ggplot2_2.2.1.zip')
## Installing package into 'C:/Users/Dell/Documents/R/win-library/3.4' ## (as 'lib' is unspecified)
install.packages('ggvis', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/ggvis_0.4.3.zip')
## Installing package into 'C:/Users/Dell/Documents/R/win-library/3.4' ## (as 'lib' is unspecified)
install.packages('dplyr', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/dplyr_0.7.4.zip')
## Installing package into 'C:/Users/Dell/Documents/R/win-library/3.4' ## (as 'lib' is unspecified)
install.packages('plotly', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/plotly_4.7.1.zip')
## Installing package into 'C:/Users/Dell/Documents/R/win-library/3.4' ## (as 'lib' is unspecified)
install.packages('lubridate', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/lubridate_1.7.2.zip')
## Installing package into 'C:/Users/Dell/Documents/R/win-library/3.4' ## (as 'lib' is unspecified)
Now loading these above packages,
library(ggplot2) library(ggvis) library(dplyr) library(plotly) library(lubridate)
Now, importing the dataset in R,
setwd("C:/Users/Dell/Desktop/Step Up Analytics") data1 <- read.csv("Online Retail.csv", stringsAsFactors = FALSE)
The link to view the whole dataset “Online Retail” in csv format is,
Workable Data: Download
To have a look at the preview of the dataset and see its structure,
head (data1) ## InvoiceNo StockCode Description Quantity ## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 ## 2 536365 71053 WHITE METAL LANTERN 6 ## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 ## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 ## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 ## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 ## InvoiceDate UnitPrice CustomerID Country ## 1 01-12-2010 08:26 2.55 17850 United Kingdom ## 2 01-12-2010 08:26 3.39 17850 United Kingdom ## 3 01-12-2010 08:26 2.75 17850 United Kingdom ## 4 01-12-2010 08:26 3.39 17850 United Kingdom ## 5 01-12-2010 08:26 3.39 17850 United Kingdom ## 6 01-12-2010 08:26 7.65 17850 United Kingdom
str(data1) ## 'data.frame': 541909 obs. of 8 variables: ## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ... ## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ... ## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ... ## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ... ## $ InvoiceDate: chr "01-12-2010 08:26" "01-12-2010 08:26" "01-12-2010 08:26" "01-12-2010 08:26" ... ## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ... ## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ... ## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
It is an invoice dataset of online retail. It has 541909 customer records taking into account 8 variables.
We see there are few misinterpreted datatypes of the variables. Also we see in the data that there are negetive values for the variables, Quantity and Unit Price. This logically cannot happen. So we need to do treatment for all these.
Firstly, correcting the datatypes of required variables,
data1$Country <- as.factor(data1$Country) data1$InvoiceNo <- as.factor(data1$InvoiceNo) data1$StockCode <- as.factor(data1$StockCode) data1$CustomerID <- as.factor(data1$CustomerID) data1$InvoiceDate <- as.POSIXct(data1$InvoiceDate, format = '%d-%m-%Y %H:%M')
Also, to treat the illogically placed negetive values in Quantity and Unit Price,
data1$Quantity <- replace(data1$Quantity, data1$Quantity <= 0, NA) data1$UnitPrice <- replace(data1$UnitPrice, data1$UnitPrice <= 0, NA) data1 <- na.omit(data1)
What we do is, we replace the negetive values with NA, and them omitting the NA’s.
Checking the structure again,
str(data1) ## 'data.frame': 397884 obs. of 8 variables: ## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ... ## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ... ## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ... ## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ... ## $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ... ## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ... ## $ CustomerID : Factor w/ 4372 levels "12346","12347",..: 4049 4049 4049 4049 4049 4049 4049 4049 4049 541 ... ## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:144025] 142 155 236 237 238 239 240 241 242 623 ... ## .. ..- attr(*, "names")= chr [1:144025] "142" "155" "236" "237" ...
Now the records decreased to 397884 observations, explained by the same 8 variables. And this data is a cleaned data, including no illogical records or datatypes in it.
Let us now create a calculated column showing total price, so that it is easy for us to calculate the monetary value of customers.
data1$TotalPrice <- data1$Quantity * data1$UnitPrice
Now firstly, we calculate Recency of a customer in number of days,
data1$Recency <- round(difftime(now(),data1$InvoiceDate, units = "days"), 0)
Here we take the difference in days between customer’s purchase and today, and finally rounding the figure. This shows the recency of the purchase done by the customer.
Next calculating Frequency of a customer,
data1$Count <- 1 data1 <- data1 %>% group_by(CustomerID) %>% mutate(Freq = sum(Count))
To find how frequent a customer is, we create a dummy column ‘Count’ giving a fixed value of 1. So we group by customer ID and sum the number of ‘Count’(s) in each group. This will give us the total number of occurances of purchase for a particular customer. Hence, keeping this as Frequency.
Next calculating Monetary Value of a customer,
data1 <- data1 %>% group_by(CustomerID) %>% mutate(MonetaryValue = sum(TotalPrice))
To find monetary value of a customer, we group by customer ID and sum the Total Price in each group. This will give us the total money spent on purchase, for a particular customer. Hence, keeping this as Monetary Value.
Now viewing the preview of the data,
head(data1) ## # A tibble: 6 x 13 ## # Groups: CustomerID [1] ## InvoiceNo StockCode Description Quantity ## <fctr> <fctr> <chr> <int> ## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 ## 2 536365 71053 WHITE METAL LANTERN 6 ## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 ## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 ## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 ## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 ## # ... with 9 more variables: InvoiceDate <dttm>, UnitPrice <dbl>, ## # CustomerID <fctr>, Country <fctr>, TotalPrice <dbl>, Recency <time>, ## # Count <dbl>, Freq <dbl>, MonetaryValue <dbl>
So we get 397884 records with its 13 variables. But we just need only three variables to do the process of segmentation with respect to each customer, Recency, Frequency and Monetary value. So separating these three variables and creating another dataset with these.
data2 <- data1[,c(7, 10, 12, 13)]
Extracting the Customer ID, Recency, Frequency and Monetary Value in data2.
Checking its structure now,
We find that it is in table form. Also the variable datatypes are not as is desired to be. So doing the necessary treatment on it,
data2$Recency <- as.numeric(data2$Recency) data2$Freq <- as.numeric(data2$Freq) data2$MonetaryValue <- as.numeric(data2$MonetaryValue) data2 <- as.data.frame(data2)
Checking the structure again
str(data2) ## 'data.frame': 397884 obs. of 4 variables: ## $ CustomerID : Factor w/ 4372 levels "12346","12347",..: 4049 4049 4049 4049 4049 4049 4049 4049 4049 541 ... ## $ Recency : num 2667 2667 2667 2667 2667 ... ## $ Freq : num 297 297 297 297 297 297 297 297 297 172 ... ## $ MonetaryValue: num 5391 5391 5391 5391 5391 ...
To start with segmentation modelling, we have to remove the duplicate combinations of recency, frequency and monetary value, so that we get an unique set of records to do our segmentation. We keep these unique records in another new data set.
data3 <- data2[!duplicated(data2),]
Now we have to create a new dataset with only the scaled down values of Recency, Frequency and Monetary values, so that there exist no bias in the unit of measurement. So three columns are present here storing it in another new dataset,
data4 <- scale(data3[,2:4])
To convert dataset into a dataframe structure, for our ease in work, we convert it,
data4 <- as.data.frame(data4)
Checking the structure of the dataset now,
str(data4) ## 'data.frame': 16763 obs. of 3 variables: ## $ Recency : num 1.88 1.88 1.88 1.88 1.88 ... ## $ Freq : num -0.0644 -0.197 -0.1174 -0.3498 -0.3763 ... ## $ MonetaryValue: num -0.1343 -0.2168 -0.0618 -0.3045 -0.3073 ...
Now we get the required structure on which we can now segment our customers, using the algorithm of clustering.
Firstly, we install and run few libraries to work on clustering,
install.packages('factoextra', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/factoextra_1.0.5.zip')
install.packages('factoextra', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/factoextra_1.0.5.zip') install.packages('cluster', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/cluster_2.0.6.zip') install.packages('magrittr', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/magrittr_1.5.zip') install.packages('NbClust', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/NbClust_3.0.zip') install.packages('fpc', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/fpc_2.1-11.zip') install.packages('clValid', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/clValid_0.6-6.zip') install.packages('clustertend', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/clustertend_1.4.zip')
library("cluster") library("factoextra") library("magrittr") library("NbClust") library("fpc") library("clValid") library("clustertend")
To start with segementation, we use the very famous K-means clustering. We divide our customers into three groups, based on their three features- recency, frequency and monetary value.
fit <- kmeans(data4, 3) fit
## K-means clustering with 3 clusters of sizes 7435, 623, 8705 ## ## Cluster means: ## Recency Freq MonetaryValue ## 1 0.9466706 -0.1438961 -0.1228018 ## 2 -0.0238894 4.1049257 3.7164380 ## 3 -0.8068481 -0.1708789 -0.1610924
Running this code we get to se that the data is being divided into three groups, each size of 8704, 623 and 7436, respectively. Their individual cluster means are also calculated with respect to each feature.Also a vector containing the cluster number for each record is provided. The accuracy comes about 64%, which is good.
To see the distance within points in a cluster and between two clusters are,
fit$withinss ## [1] 4462.834 10108.185 3535.715
fit$betweenss ## [1] 32179.27
We see that between cluster distance is much more than within points distance in a each clusters.
We can plot he clusters,
plot(data4, col=fit$cluster, pch=16)
Now, we export out outputs in csv files, to read it better,
write.csv(data4, "values.csv") write.csv(fit$cluster, "clusters.csv")
My output file containing the customers segmented in three clusters is attached below. Click to the link to view the output,
Output file: Download
Customer ID in pink coloured cells are from Cluster 1, with blue cells are in Cluster 2 and in yellow cells are in Cluster 3.
For Cluster 1 – Recency is observed to be 2294 days to 2463 days. Variability of the customers is very very less.
Number of times the purchase is done ranges from 1 to 2700 times. Variability captured is very less.
Monetary value of the purchases made ranges from Rs.6.2 to Rs.91062 . Variability observed is very less.
For Cluster 2 – Recency is observed to be 2294 days to 2667 days. Hence it takes into account little more older customers. Variability of the customers is very very less.
Number of times the purchase is done ranges from 3 to 7847 times. Hence it takes into account customers who purchases more number of times. Variability captured is moderately higher than the previous cluster.
Monetary value of the purchases made ranges from Rs.33720 to Rs.280206 . Hence it takes into account customers who adds to lot more valuation to their purchase. Variability observed is moderately higher than the previous cluster.
For Cluster 3 – Recency is observed to be 2458 days to 2667 days. Hence it takes into account little more older customers. Variability of the customers is very very less.
Number of times the purchase is done ranges from 1 to 2700 times. Hence it takes into account customers who purchases moderate number of times. Variability captured is moderately higher than the previous cluster. Few outliers are observed in the higher end.
Monetary value of the purchases made ranges from Rs.3.75 to Rs.91062 . Hence it takes into account customers who adds to moderate valuation to their purchase. Variability observed is moderately higher than the previous cluster. High variability at the upper end. Infact lots of outliers could be identified in higher purchase amounts.
So we can conclude that:
The post SEGMENTATION OF CUSTOMERS USING RFM MODEL appeared first on StepUp Analytics.
]]>The post How F-tests works in Analysis of Variance (ANOVA) appeared first on StepUp Analytics.
]]>value <- c(2.3,2.1,2.4,2.5,2.2,2.0,1.9,2.1,2.2,2.3,2.4,2.6,2.4,2.7,2.6,2.7,2.3,2.5,2.3,2.4)
brand <- rep(c("ACME","AJAX","CHAMP","TUFFY","XTRA"),each=4)
data <- data.frame(brand,value)
summary(aov(value ~ brand , data = data))
qf(0.95,4,15)
Output will be as follows,
value <- c(2.3,2.1,2.4,2.5,2.2,2.0,1.9,2.1,2.2,2.3,2.4,2.6,2.4,2.7,2.6,2.7,2.3,2.5,2.3,2.4)
brand <- rep(c("ACME","AJAX","CHAMP","TUFFY","XTRA"),each=4)
data <- data.frame(brand,value)
summary(aov(value ~ brand , data = data))
Df Sum Sq Mean Sq F value Pr(>F)
brand 4 0.6170 0.15425 7.404 0.00168 **
Residuals 15 0.3125 0.02083
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> qf(0.95,4,15)
[1] 3.055568
So, from the above we can see the observed value of the F-statistic is 7.404 and the value of F-crit is 3.055568. As a conclusion, we reject (can’t accept) the null hypothesis at 5% level of significance.
So, this is how F-statistic is used in ANOVA technique.
The post How F-tests works in Analysis of Variance (ANOVA) appeared first on StepUp Analytics.
]]>The post Beginner to Advance level – Steps to make regression model part 2 appeared first on StepUp Analytics.
]]>Lets start with ANOVA:
A basic idea about ANOVA, that of partitioning variation, is a fundamental idea of experimental idea of experimental statistics. The ANOVA belies its name in that it is not concerned about analyzing variances but rather with analyzing the variances of mean.
There are two types of ANOVA:
I have explained One way and Two way ANOVA respectively.
Now lets discuss Coefficient Of Determination
Coefficient of determination denoted by R² or r² and pronounced as R-squared, it is a ratio of sum of squared.
R² or r²=SS(reg)/SS(t)
R² increase or decrease on adding of any extra regressor variable, so we can not much dependent on R².
If this isn’t a solution then tere might be other way to find coefficient of determination of model. Yes, there is a solution known as Adjusted R².
The above properties for R² and Adjusted R² will remain same.
The adjusted R^{2} is defined as
where
Adjusted R^{2} can also be written as
Next is Model Adequacy checking, Multicollinearity and selecting significant explanatory variables.
We will discuss these remaining topics in the next article of this series. Till then, if you have any doubt or suggestion please feel free to shoot me an email on khanirfan.khan21@gmail.com or mention in comment.
Article originally posted.
The post Beginner to Advance level – Steps to make regression model part 2 appeared first on StepUp Analytics.
]]>The post Beginner to Advance level: Steps to make regression model part 1 appeared first on StepUp Analytics.
]]>Assumptions in terms of Y[ (ε(Y))] :- Here i am not going in details to write the equation, I will tell you what to do just replace the ε(i) to ε(Y).
And Mean will become = β(0)+β(1)*x and variance will e same and equal to σ^2
Here is the Least Square Estimation of the Parameter we are going to discuss further.
Least Square Estimation(LSE):-
(x1,y1),(x2,y2), – – – – , (xn,yn)
So, the estimator β_0 ̂,β_1 ̂ are the solution of the equation
∑(Y(i)-β0ˆ-βiˆ*xi) = 0 —–(1),
∑(Y(i)-β0ˆ-βiˆ*xi)*xi = 0 —-(2)
Solution of the above equations are attached as image, I am attaching image here because i can’t write mathematical eqaution here, So enjoy snapshots
Equation Solution:
Equation 2 solution:
Above we have calculated the parameters using least square estimator.
We have not discussed the benefits of using LSE , in one line the the most important benefits of using LSE is the solution will be most correct almost 95% accuracy.
Properties of Least Square Estimator:-
∑y(i) = ∑(y(i)−y_iˆ) = 0 (perfect regression line)
Statistical properties of LS estimation:-
β_1ˆ= [∑(x(i)-mean(x))*(y(i)-mean(y))]/∑((x(i)-mean(x))^2
= [∑(x(i)-mean(x))*y(i)]/∑(x(i)-mean(x))^2]
β_1ˆ = ∑(e(i)*y(i)
Similarly we can also do for β_0ˆ
β_0ˆ = mean(y)−β_1ˆ*mean(x)
= (1/n)∑y(i) – β_iˆ.(mean(x))
Take the value of beta 1 parameter from above equation.
Note:- I am not going to prove this, if you proof need please message me @ irrfankhann29@gmail.com i will personally send my documents.
Similarly we will calculate the variance of beta_0 and beta_1
v(β_1ˆ) = [σ^2/S(xx)] :where S(xx) = ∑[(x_i – mean(x)]^2
v(β_0ˆ) = σ^2[(1/n)+{mean(x)^2}/S(xx)]
NOTE:-β_0 & β_1 are unbiased estimator of σ^2
Estimation of σ^2:- is obtained from the residual sum of square
SS(res) = SS(yy) – (β_1ˆ)^2.
We got the values of Coefficients, sum of squared errors (Regressor, Regresson and total) using these we can calculate the the null hypothesis which is based on t and z test.
The t and z test value can be calculated using this formula:
Usually varinace(σ²) is unknow, if variance(σ²) is unknow then we will follow t-test hypothesis
t = (β_1ˆ-β_1)/√(MS_res)/S(xx)
if |t|>t[(α/2), (n-2)] we reject null hypothesis.
∴ |t| is calculated value and t[(α/2), (n-2)] is tabulated value.
And when (σ²) is know we will follow z-test hypothesis
z = (β_1ˆ-β_1)/√(σ²)/S(xx) which follow random normal distribution(0,1)
if |z|>z(α/2) we reject null hypothesis.
∴ |z| is calculated value and z(α/2) is tabulated value.
Now we have all the values to calculate ANOVA table which will describe in next article, so stay tuned.
Queries or Docs/ notes related please shoot me an email on khanirfan.khan21@gmail.com.
The post Beginner to Advance level: Steps to make regression model part 1 appeared first on StepUp Analytics.
]]>The post Python pandas library for DataScience appeared first on StepUp Analytics.
]]>pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
import pandas as pd df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50}, {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}, {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}], index=['Store 1', 'Store 1', 'Store 2']) df
Cost | Item Purchased | Name | |
Store 1 | 22.5 | Sponge | Chris |
Store 1 | 2.5 | Kitty Litter | Kevyn |
Store 2 | 5 | Spoon | Filip |
df['Date'] = ['December 1', 'January 1', 'mid-May'] df
Cost | Item Purchased | Name | Date | |
Store 1 | 22.5 | Sponge | Chris | 1-Dec |
Store 1 | 2.5 | Kitty Litter | Kevyn | 1-Jan |
Store 2 | 5 | Spoon | Filip | mid-May |
df['Delivered'] = True df
Cost | Item Purchased | Name | Date | Delivered | |
Store 1 | 22.5 | Sponge | Chris | 1-Dec | TRUE |
Store 1 | 2.5 | Kitty Litter | Kevyn | 1-Jan | TRUE |
Store 2 | 5 | Spoon | Filip | mid-May | TRUE |
df['Feedback'] = ['Positive', None, 'Negative'] df
Cost | Item Purchased | Name | Date | Delivered | Feedback | |
Store 1 | 22.5 | Sponge | Chris | 1-Dec | TRUE | Positive |
Store 1 | 2.5 | Kitty Litter | Kevyn | 1-Jan | TRUE | None |
Store 2 | 5 | Spoon | Filip | mid-May | TRUE | Negative |
adf = df.reset_index() adf['Date'] = pd.Series({0: 'December 1', 2: 'mid-May'}) adf
index | Cost | Item Purchased | Name | Date | Delivered | Feedback | |
0 | Store 1 | 22.5 | Sponge | Chris | 1-Dec | TRUE | Positive |
1 | Store 1 | 2.5 | Kitty Litter | Kevyn | NaN | TRUE | None |
2 | Store 2 | 5 | Spoon | Filip | mid-May | TRUE | Negative |
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'}, {'Name': 'Sally', 'Role': 'Course liasion'}, {'Name': 'James', 'Role': 'Grader'}]) staff_df = staff_df.set_index('Name') student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'}, {'Name': 'Mike', 'School': 'Law'}, {'Name': 'Sally', 'School': 'Engineering'}]) student_df = student_df.set_index('Name') print(staff_df.head()) print() print(student_df.head())
Name Role
Kelly Director of HR
Sally Course liasion
James Grader
Name School
James Business
Mike Law
Sally Engineering
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
Role | School | |
Name | ||
James | Grader | Business |
Kelly | Director of HR | NaN |
Mike | NaN | Law |
Sally | Course liasion | Engineering |
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)
Role | School | |
Name | ||
Sally | Course liasion | Engineering |
James | Grader | Business |
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
Role | School | |
Name | ||
Kelly | Director of HR | NaN |
Sally | Course liasion | Engineering |
James | Grader | Business |
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)
Role | School | |
Name | ||
James | Grader | Business |
Mike | NaN | Law |
Sally | Course liasion | Engineering |
staff_df = staff_df.reset_index() student_df = student_df.reset_index() pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
Name | Role | School | |
0 | Kelly | Director of HR | NaN |
1 | Sally | Course liasion | Engineering |
2 | James | Grader | Business |
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'}, {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'}, {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}]) student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'}, {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'}, {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}]) pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
Location_x | Name | Role | Location_y | School | |
0 | State Street | Kelly | Director of HR | NaN | NaN |
1 | Washington Avenue | Sally | Course liasion | 512 Wilson Crescent | Engineering |
2 | Washington Avenue | James | Grader | 1024 Billiard Avenue | Business |
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'}, {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'}, {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}]) student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'}, {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'}, {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}]) staff_df student_df pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
First Name | Last Name | Role | School | |
0 | Sally | Brooks | Course liasion | Engineering |
Output of the below code is not shared. Download the data and execute the below code to check the output.
import pandas as pd df = pd.read_csv('census.csv') df
(df.where(df['SUMLEV']==50) .dropna() .set_index(['STNAME','CTYNAME']) .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
df = df[df['SUMLEV']==50] df.set_index(['STNAME','CTYNAME'], inplace=True) df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})
import numpy as np def min_max(row): data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']] return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis=1)
import numpy as np def min_max(row): data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']] row['max'] = np.max(data) row['min'] = np.min(data) return row df.apply(min_max, axis=1)
rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015'] df.apply(lambda x: np.max(x[rows]), axis=1)
By “group by” It refers to a process involving one or more of the following steps
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, One might wish to one of the following:
Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts
Transformation: perform some group-specific computations and return a like-indexed. Some examples:
- Standardizing data (zscore) within group
- Filling NAs within groups with a value derived from each group
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discarding data that belongs to groups with only a few members
- Filtering out data based on the group sum or mean
Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories
import pandas as pd import numpy as np df = pd.read_csv('census.csv') df = df[df['SUMLEV']==50] df
%%timeit -n 10 for state in df['STNAME'].unique(): avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP']) print('Counties in state ' + state + ' have an average population of ' + str(avg))
%%timeit -n 10 for group, frame in df.groupby('STNAME'): avg = np.average(frame['CENSUS2010POP']) print('Counties in state ' + group + ' have an average population of ' + str(avg))
df.head()
df = df.set_index('STNAME') def fun(item): if item[0]<'M': return 0 if item[0]<'Q': return 1 return 2 for group, frame in df.groupby(fun): print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
df = pd.read_csv('census.csv') df = df[df['SUMLEV']==50]
df.groupby('STNAME').agg({'CENSUS2010POP': np.average})
print(type(df.groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'])) print(type(df.groupby(level=0)['POPESTIMATE2010']))
(df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'] .agg({'avg': np.average, 'sum': np.sum}))
(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'] .agg({'avg': np.average, 'sum': np.sum}))
(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'] .agg({'POPESTIMATE2010': np.average, 'POPESTIMATE2011': np.sum}))
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'], index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor']) df.rename(columns={0: 'Grades'}, inplace=True) df
Grades | |
excellent | A+ |
excellent | A |
excellent | A- |
good | B+ |
good | B |
good | B- |
ok | C+ |
ok | C |
ok | C- |
poor | D+ |
poor | D |
df['Grades'].astype('category').head()
excellent A+
excellent A
excellent A-
good B+
good B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, …, C+, C-, D, D+]
grades = df['Grades'].astype('category', categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], ordered=True) grades.head()
excellent A+
excellent A
excellent A-
good B+
good B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C … B+ < A- < A < A+]
grades > 'C'
excellent True
excellent True
excellent True
good True
good True
good True
ok True
ok False
ok False
poor False
poor False
Name: Grades, dtype: bool
df = pd.read_csv('census.csv') df = df[df['SUMLEV']==50] df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average}) pd.cut(df['avg'],10)
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
Download the data to perform below exercise
#http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64 df = pd.read_csv('cars.csv')
df.head()
df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)
df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=[np.mean,np.min], margins=True)
pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space. With the 0.8 release, It has further improved the time series API in pandas by leaps and bounds. Using the new NumPy datetime64 dtype
import pandas as pd import numpy as np
pd.Timestamp('9/1/2016 10:05AM')
Timestamp(‘2016-09-01 10:05:00’)
pd.Period('1/2016')
Period(‘2016-01’, ‘M’)
pd.Period('3/5/2016')
Period(‘2016-03-05’, ‘D’)
t1 = pd.Series(list('abc'), [pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), pd.Timestamp('2016-09-03')]) t1
2016-09-01 a
2016-09-02 b
2016-09-03 c
dtype: object
type(t1.index)
pandas.core.indexes.datetimes.DatetimeIndex
t2 = pd.Series(list('def'), [pd.Period('2016-09'), pd.Period('2016-10'), pd.Period('2016-11')]) t2
2016-09 d
2016-10 e
2016-11 f
Freq: M, dtype: object
type(t2.index)
pandas.core.indexes.period.PeriodIndex
Below are some good references for Python Datatime conversion.
strptime
: Python 2, Python 3strftime
format mask: Python 2, Python 3d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16'] ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=d1, columns=list('ab')) ts3
a | b | |
2-Jun-13 | 51 | 19 |
29-Aug-14 | 32 | 68 |
6/26/2015 | 59 | 18 |
7/12/2016 | 54 | 58 |
ts3.index = pd.to_datetime(ts3.index) ts3
a b
2013-06-02 51 19
2014-08-29 32 68
2015-06-26 59 18
2016-07-12 54 58
pd.to_datetime('4.7.12', dayfirst=True)
Timestamp(‘2012-07-04 00:00:00’)
Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative
pd.Timestamp('9/3/2016')-pd.Timestamp('9/1/2016')
Timedelta(‘2 days 00:00:00’)
pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')
Timestamp(‘2016-09-14 11:10:00’)
pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space.
dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN') dates
DatetimeIndex([‘2016-10-02’, ‘2016-10-16’, ‘2016-10-30’, ‘2016-11-13’,
‘2016-11-27’, ‘2016-12-11’, ‘2016-12-25’, ‘2017-01-08’,
‘2017-01-22′],
dtype=’datetime64[ns]’, freq=’2W-SUN’)
df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(), 'Count 2': 120 + np.random.randint(-5, 10, 9)}, index=dates) df
Count 1 | Count 2 | |
10/2/2016 | 107 | 116 |
10/16/2016 | 115 | 120 |
10/30/2016 | 110 | 125 |
11/13/2016 | 112 | 118 |
11/27/2016 | 111 | 117 |
12/11/2016 | 117 | 115 |
12/25/2016 | 113 | 121 |
1/8/2017 | 110 | 129 |
1/22/2017 | 117 | 120 |
df.index.weekday_name
Index([‘Sunday’, ‘Sunday’, ‘Sunday’, ‘Sunday’, ‘Sunday’, ‘Sunday’, ‘Sunday’,
‘Sunday’, ‘Sunday’],
dtype=’object’)
df.diff()
Count 1 | Count 2 | |
10/2/2016 | NaN | NaN |
10/16/2016 | 8 | 4 |
10/30/2016 | -5 | 5 |
11/13/2016 | 2 | -7 |
11/27/2016 | -1 | -1 |
12/11/2016 | 6 | -2 |
12/25/2016 | -4 | 6 |
1/8/2017 | -3 | 8 |
1/22/2017 | 7 | -9 |
df.resample('M').mean()
Count 1 | Count 2 | |
10/31/2016 | 110.666667 | 120.333333 |
11/30/2016 | 111.5 | 117.5 |
12/31/2016 | 115 | 118 |
1/31/2017 | 113.5 | 124.5 |
df['2017']
Count 1 | Count 2 | |
1/8/2017 | 110 | 129 |
1/22/2017 | 117 | 120 |
df['2016-12']
Count 1 | Count 2 | |
12/11/2016 | 117 | 115 |
12/25/2016 | 113 | 121 |
df['2016-12':]
Count 1 | Count 2 | |
12/11/2016 | 117 | 115 |
12/25/2016 | 113 | 121 |
1/8/2017 | 110 | 129 |
1/22/2017 | 117 | 120 |
df.asfreq('W', method='ffill')
Check The Result
import matplotlib.pyplot as plt %matplotlib inline df.plot()
For Complete Article on Visualization Refer:
The post Python pandas library for DataScience appeared first on StepUp Analytics.
]]>The post Beginners guide to Statistical Cluster Analysis in detail part-2 appeared first on StepUp Analytics.
]]>Non-HCA methods starts from either-
or
NOTE: One way to start is to randomly select seed points from among the items or to randomly partition the date(i.e. items/objects) into initial groups.
K-means method or the method of iterative relocation:-
<strong>K-means is an algorithm that assigns each object to cluster having the nearest centroid/nuclei</strong>
Algorithm:
NOTE:- Rather than starting with a partition of all items into k initial groups( as in step1 ) we can also assign ‘K’ initial centroid/nuclei(seed points) and then proceed to step 2 after a walk through the data.
Hand written solved example is attached below go through the steps and try to understand if not please shoot me an email @ irrfankhann29@gmail.com.
statistical non herarchical cluster analysis This is the pdf file of example
CLUSTER CRITERIA:- Comparing different partitions-
Objective:- is to have a criteria for optimum partition of the data such that given set of cases of given clusters problem reduces to partition the data into ‘g’ clusters so that the clustering criterion is optimized.
We can write as following
Then within cluster sum of square (SS) & cross product matrix
==> pooled within cluster scatter matrix ‘g’ cluster
The between cluster SS and cross product matrix
Popular clustering criterian are based on univariate function S(b), S(w) or ∑.
Will share the criterians in the next part of the cluster analysis series. Till then stay tuned or practice you skills on cluster analysis, if you get any doubts please ask me by shooting an email @ irrfankhann29@gmail.com.
Article Originally posted Here
The post Beginners guide to Statistical Cluster Analysis in detail part-2 appeared first on StepUp Analytics.
]]>The post Beginners guide to Statistical Cluster Analysis in detail part-1 appeared first on StepUp Analytics.
]]>Hierarchical cluster Analysis(HCA):
There are two types of approaches for HCA:
Agglomerative HCA:
Divisive HCA:
Note: Result of both the approaches are displayed through the dendrogram tree.
Steps Involved in Agglomerative HCA:
Possible distance measures between two clusters:
Here i∈k1, j∈k2
Distance between cluster 1 &2 ?
min[d(1,2),(3,4,5)] =
Under single linkage approach min[d(1,2),(3,4,5)] = d(2,5)
Here is the example of single linkage attached in pdf
New Doc 2017-12-19
d(1,3), d(1,4),d(1,5) | d(2,3), d(2,4),d(2,5)
complete linkage distance between cluster 1 and 2 = d(1,4)
Here is the complete linkage example attached
New Doc 2017-12-19 (1)
Average linkage distance between clusters
=1/6∑d(i,j) where i,j is 1 to n
Hierarchical cluster analysis ends here, in the next tutorial article I will explain Non-Hierarchical cluster analysis.
Till then stay tuned and keep visiting for learning tutorials which you won’t get anywhere.
If you have any doubts please mention in comments or shoot me an email @ irrfankhann29@gmail.com.
This article posted here .
The post Beginners guide to Statistical Cluster Analysis in detail part-1 appeared first on StepUp Analytics.
]]>The post Applied Data Science with Python – Part 2 appeared first on StepUp Analytics.
]]>We will quickly start with, non-comprehensive overview of the fundamental data structures in pandas. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import numpy and load pandas into your namespace
import pandas as pd pd.Series?
We’ll create variable animals of string type and convert it into pandas series.
animals = ['Tiger', 'Bear', 'Moose'] pd.Series(animals)
animals = ['Tiger', 'Bear', None] pd.Series(animals)
We’ll create pandas series of Integer type
numbers = [1, 2, 3] pd.Series(numbers)
numbers = [1, 2, None] pd.Series(numbers)
Import numpy
To construct a DataFrame with missing data, use np.nan
for those values which are missing. Alternatively, you may pass a numpy.MaskedArray
as the data argument to the DataFrame constructor, and its masked entries will be considered missing
import numpy as np np.nan == None
NaN (not a number) is the standard missing data marker used in pandas
np.nan == np.nan
To make detecting missing values easier (and across different array dtypes), pandas provides the isna()
and notna()
functions, which are also methods on Series
and DataFrame
objects
np.isnan(np.nan)
sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'} s = pd.Series(sports) s
s.index
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada']) s
sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'} s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey']) s
sports = {'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'} s = pd.Series(sports) s
s.iloc[3]
‘South Korea’
s.loc['Golf']
‘Scotland’
s[3]
‘South Korea’
s['Golf']
‘Scotland’
sports = {99: 'Bhutan', 100: 'Scotland', 101: 'Japan', 102: 'South Korea'} s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
s = pd.Series([100.00, 120.00, 101.00, 3.00]) s
0 100.0
1 120.0
2 101.0
3 3.0
dtype: float64
total = 0 for item in s: total+=item print(total)
324.0
import numpy as np total = np.sum(s) print(total)
324.0
#this creates a big series of random numbers s = pd.Series(np.random.randint(0,1000,10000)) s.head()
0 486
1 951
2 111
3 142
4 457
dtype: int64
len(s)
10000
%%timeit -n 100 summary = 0 for item in s: summary+=item
100 loops, best of 3: 1.31 ms per loop
%%timeit -n 100 summary = np.sum(s)
100 loops, best of 3: 74.3 µs per loop
s+=2 #adds two to each item in s using broadcasting s.head()
0 488
1 953
2 113
3 144
4 459
dtype: int64
for label, value in s.iteritems(): s.set_value(label, value+2) s.head()
0 490
1 955
2 115
3 146
4 461
dtype: int64
%%timeit -n 10 s = pd.Series(np.random.randint(0,1000,10000)) for label, value in s.iteritems(): s.loc[label]= value+2
10 loops, best of 3: 966 ms per loop
%%timeit -n 10 s = pd.Series(np.random.randint(0,1000,10000)) s+=2
10 loops, best of 3: 317 µs per loop
s = pd.Series([1, 2, 3]) s.loc['Animal'] = 'Bears' s
0 1
1 2
2 3
Animal Bears
dtype: object
original_sports = pd.Series({'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'}) cricket_loving_countries = pd.Series(['Australia', 'Barbados', 'Pakistan', 'England'], index=['Cricket', 'Cricket', 'Cricket', 'Cricket']) all_countries = original_sports.append(cricket_loving_countries)
original_sports
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
cricket_loving_countries
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
all_countries
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
all_countries.loc['Cricket']
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. Below is some example covering different scenarios.
import pandas as pd purchase_1 = pd.Series({'Name': 'Chris', 'Item Purchased': 'Dog Food', 'Cost': 22.50}) purchase_2 = pd.Series({'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}) purchase_3 = pd.Series({'Name': 'Vinod', 'Item Purchased': 'Bird Seed', 'Cost': 5.00}) df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2']) df.head()
Cost | Item Purchased | Name | |
Store 1 | 22.5 | Dog Food | Chris |
Store 1 | 2.5 | Kitty Litter | Kevyn |
Store 2 | 5 | Bird Seed | Vinod |
df.loc['Store 2']
Cost 5
Item Purchased Bird Seed
Name Vinod
Name: Store 2, dtype: object
type(df.loc['Store 2'])
pandas.core.series.Series
df.loc['Store 1']
Cost | Item Purchased | Name | |
Store 1 | 22.5 | Dog Food | Chris |
Store 1 | 2.5 | Kitty Litter | Kevyn |
df.loc['Store 1', 'Cost']
Store 1 22.5
Store 1 2.5
Name: Cost, dtype: float64
DataFrame.T is used to transpose the DataFrame
df.T
Store 1 | Store 1 | Store 2 | |
Cost | 22.5 | 2.5 | 5 |
Item Purchased | Dog Food | Kitty Litter | Bird Seed |
Name | Chris | Kevyn | Vinod |
df.T.loc['Cost']
Store 1 22.5
Store 1 2.5
Store 2 5
Name: Cost, dtype: object
df['Cost']
Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64
df.loc['Store 1']['Cost']
Store 1 22.5
Store 1 2.5
Name: Cost, dtype: float64
df.loc[:,['Name', 'Cost']]
Name | Cost | |
Store 1 | Chris | 22.5 |
Store 1 | Kevyn | 2.5 |
Store 2 | Vinod | 5 |
df.drop('Store 1')
Cost | Item Purchased | Name | |
Store 2 | 5 | Bird Seed | Vinod |
df
Cost | Item Purchased | Name | |
Store 1 | 22.5 | Dog Food | Chris |
Store 1 | 2.5 | Kitty Litter | Kevyn |
Store 2 | 5 | Bird Seed | Vinod |
copy_df = df.copy() copy_df = copy_df.drop('Store 1') copy_df
Cost | Item Purchased | Name | |
Store 2 | 5 | Bird Seed | Vinod |
copy_df.drop?
Cost | Item Purchased | |
Store 2 | 5 | Bird Seed |
df['Location'] = None df
Cost | Item Purchased | Name | Location | |
Store 1 | 22.5 | Dog Food | Chris | None |
Store 1 | 2.5 | Kitty Litter | Kevyn | None |
Store 2 | 5 | Bird Seed | Vinod | None |
costs = df['Cost'] costs
costs+=2 costs
df
Cost | Item Purchased | Name | Location | |
Store 1 | 24.5 | Dog Food | Chris | None |
Store 1 | 4.5 | Kitty Litter | Kevyn | None |
Store 2 | 7 | Bird Seed | Vinod | None |
Download the Data set and Run the below commands to check the output
Data Set Used: Download
df = pd.read_csv('olympics.csv') df.head()
df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1) df.head()
df.columns
for col in df.columns: if col[:2]=='01': df.rename(columns={col:'Gold' + col[4:]}, inplace=True) if col[:2]=='02': df.rename(columns={col:'Silver' + col[4:]}, inplace=True) if col[:2]=='03': df.rename(columns={col:'Bronze' + col[4:]}, inplace=True) if col[:1]=='№': df.rename(columns={col:'#' + col[1:]}, inplace=True) df.head()
df['Gold'] > 0
only_gold = df.where(df['Gold'] > 0) only_gold.head()
only_gold['Gold'].count()
df['Gold'].count()
only_gold = only_gold.dropna() only_gold.head()
only_gold = df[df['Gold'] > 0] only_gold.head()
len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])
df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]
df.head()
df['country'] = df.index df = df.set_index('Gold') df.head()
df = df.reset_index() df.head()
Download the Data
df = pd.read_csv('census.csv') df.head()
df['SUMLEV'].unique()
df=df[df['SUMLEV'] == 50] df.head()
columns_to_keep = ['STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015'] df = df[columns_to_keep] df.head()
df = df.set_index(['STNAME', 'CTYNAME']) df.head()
df.loc['Michigan', 'Washtenaw County']
df.loc[ [('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')] ]
In this section, we will discuss missing (also referred to as NA) values in pandas
Download the Data
df = pd.read_csv('log.csv') df
df.fillna?
df = df.set_index('time') df = df.sort_index() df
df = df.reset_index() df = df.set_index(['time', 'user']) df
df = df.fillna(method='ffill') df.head()
The post Applied Data Science with Python – Part 2 appeared first on StepUp Analytics.
]]>The post Applied Data Science with Python – Part 1 appeared first on StepUp Analytics.
]]>x = 1 y = 2 x + y
add_numbers
is a function that takes two numbers and adds them together.
def add_numbers(x, y): return x + y add_numbers(1, 2)
add_numbers
updated to take an optional 3rd parameter. Using print
allows printing of multiple expressions within a single cell.
def add_numbers(x,y,z=None): if (z==None): return x+y else: return x+y+z print(add_numbers(1, 2)) print(add_numbers(1, 2, 3))
add_numbers
updated to take an optional flag parameter.
def add_numbers(x, y, z=None, flag=False): if (flag): print('Flag is true!') if (z==None): return x + y else: return x + y + z print(add_numbers(1, 2, flag=True))
Assign function add_numbers
to variable a
def add_numbers(x,y): return x+y a = add_numbers a(1,2)
Use type
to return the object’s type.
type('This is a string')
type(None)
type(1)
type(1.0)
type(add_numbers)
Tuples are an immutable data structure (cannot be altered).
x = (1, 'a', 2, 'b') type(x)
Lists are a mutable data structure.
x = [1, 'a', 2, 'b'] type(x)
Use append
to append an object to a list.
x.append(3.3) print(x)
This is an example of how to loop through each item in the list.
for item in x: print(item)
Or using the indexing operator:
i=0 while( i != len(x) ): print(x[i]) i = i + 1
Use +
to concatenate lists.
[1,2] + [3,4]
Use *
to repeat lists.
[1]*3
Use the in
operator to check if something is inside a list.
1 in [1, 2, 3]
Now let’s look at strings. Use bracket notation to slice a string.
x = 'This is a string' print(x[0]) #first character print(x[0:1]) #first character, but we have explicitly set the end character print(x[0:2]) #first two characters
This will return the last element of the string.
x[-1]
This will return the slice starting from the 4th element from the end and stopping before the 2nd element from the end.
x[-4:-2]
This is a slice from the beginning of the string and stopping before the 3rd element.
x[:3]
And this is a slice starting from the 3rd element of the string and going all the way to the end.
x[3:]
firstname = 'Christopher' lastname = 'Brooks' print(firstname + ' ' + lastname) print(firstname*3) print('Chris' in firstname)
split
returns a list of all the words in a string, or a list split on a specific character.
firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0] # [0] selects the first element of the list lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1] # [-1] selects the last element of the list print(firstname) print(lastname)
Make sure you convert objects to strings before concatenating.
'Chris' + 2
'Chris' + str(2)
Dictionaries associate keys with values.
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'} x['Christopher Brooks'] # Retrieve a value by using the indexing operator
x['Kevyn Collins-Thompson'] = None x['Kevyn Collins-Thompson']
Iterate over all of the keys:
for name in x: print(x[name])
Iterate over all of the values:
for email in x.values(): print(email)
Iterate over all of the items in the list:
for name, email in x.items(): print(name) print(email)
You can unpack a sequence into different variables:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu') fname, lname, email = x
fname
lname
Make sure the number of values you are unpacking matches the number of variables being assigned.
x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Arbor') fname, lname, email = x
print('Chris' + 2)
print('Chris' + str(2))
Python has a built in method for convenient string formatting.
sales_record = { 'price': 3.24, 'num_items': 4, 'person': 'Chris'} sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}' print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))
Let’s import our datafile mpg.csv, which contains fuel economy data for 234 cars. Download
import csv %precision 2 with open('mpg.csv') as csvfile: mpg = list(csv.DictReader(csvfile)) mpg[:3] # The first three dictionaries in our list.
csv.Dictreader
has read in each row of our csv file as a dictionary. len
shows that our list is comprised of 234 dictionaries.
len(mpg)
keys
gives us the column names of our csv.
mpg[0].keys()
This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.
sum(float(d['cty']) for d in mpg) / len(mpg)
Similarly this is how to find the average hwy fuel economy across all cars.
sum(float(d['hwy']) for d in mpg) / len(mpg)
Use set
to return the unique values for the number of cylinders the cars in our dataset have.
cylinders = set(d['cyl'] for d in mpg) cylinders
Here’s a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.
CtyMpgByCyl = [] for c in cylinders: # iterate over all the cylinder levels summpg = 0 cyltypecount = 0 for d in mpg: # iterate over all dictionaries if d['cyl'] == c: # if the cylinder level type matches, summpg += float(d['cty']) # add the cty mpg cyltypecount += 1 # increment the count CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg') CtyMpgByCyl.sort(key=lambda x: x[0]) CtyMpgByCyl
Use set
to return the unique values for the class types in our dataset.
vehicleclass = set(d['class'] for d in mpg) # what are the class types vehicleclass
And here’s an example of how to find the average hwy mpg for each class of vehicle in our dataset.
HwyMpgByClass = [] for t in vehicleclass: # iterate over all the vehicle classes summpg = 0 vclasscount = 0 for d in mpg: # iterate over all dictionaries if d['class'] == t: # if the cylinder amount type matches, summpg += float(d['hwy']) # add the hwy mpg vclasscount += 1 # increment the count HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg') HwyMpgByClass.sort(key=lambda x: x[1]) HwyMpgByClass
import datetime as dt import time as tm
time
returns the current time in seconds since the Epoch. (January 1st, 1970)
tm.time()
Convert the timestamp to datetime.
dtnow = dt.datetime.fromtimestamp(tm.time()) dtnow
Handy datetime attributes:
dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second # get year, month, day, etc.from a datetime
timedelta
is a duration expressing the difference between two dates.
delta = dt.timedelta(days = 100) # create a timedelta of 100 days delta
date.today
returns the current local date.
today = dt.date.today()
today - delta # the date 100 days ago
today > today-delta # compare dates
An example of a class in python:
class Person: department = 'School of Information' #a class variable def set_name(self, new_name): #a method self.name = new_name def set_location(self, new_location): self.location = new_location
person = Person() person.set_name('Christopher Brooks') person.set_location('Ann Arbor, MI, USA') print('{} live in {} and works in the department {}'.format(person.name, person.location, person.department))
Here’s an example of mapping the min
function between two lists.
store1 = [10.00, 11.00, 12.34, 2.34] store2 = [9.00, 11.10, 12.34, 2.01] cheapest = map(min, store1, store2) cheapest
Now let’s iterate through the map object to see the values.
for item in cheapest: print(item)
Here’s an example of lambda that takes in three parameters and adds the first two.
my_function = lambda a, b, c : a + b
my_function(1, 2, 3)
Let’s iterate from 0 to 999 and return the even numbers.
my_list = [] for number in range(0, 1000): if number % 2 == 0: my_list.append(number) my_list
Now the same thing but with list comprehension.
my_list = [number for number in range(0,1000) if number % 2 == 0] my_list
import numpy as np
Create a list and convert it to a numpy array
mylist = [1, 2, 3] x = np.array(mylist) x
Or just pass in a list directly
y = np.array([4, 5, 6]) y
Pass in a list of lists to create a multidimensional array.
m = np.array([[7, 8, 9], [10, 11, 12]]) m
Use the shape method to find the dimensions of the array. (rows, columns)
m.shape
arange
returns evenly spaced values within a given interval.
n = np.arange(0, 30, 2) # start at 0 count up by 2, stop before 30 n
reshape
returns an array with the same data with a new shape.
n = n.reshape(3, 5) # reshape array to be 3x5 n
linspace
returns evenly spaced numbers over a specified interval.
o = np.linspace(0, 4, 9) # return 9 evenly spaced values from 0 to 4 o
resize
changes the shape and size of array in-place.
o.resize(3, 3) o
ones
returns a new array of given shape and type, filled with ones.
np.ones((3, 2))
zeros
returns a new array of given shape and type, filled with zeros.
np.zeros((2, 3))
eye
returns a 2-D array with ones on the diagonal and zeros elsewhere.
np.eye(3)
diag
extracts a diagonal or constructs a diagonal array.
np.diag(y)
Create an array using repeating list (or see np.tile
)
np.array([1, 2, 3] * 3)
Repeat elements of an array using repeat
.
np.repeat([1, 2, 3], 3)
p = np.ones([2, 3], int) p
Use vstack
to stack arrays in sequence vertically (row wise).
np.vstack([p, 2*p])
Use hstack
to stack arrays in sequence horizontally (column wise).
np.hstack([p, 2*p])
Use +
, -
, *
, /
and **
to perform element wise addition, subtraction, multiplication, division and power.
print(x + y) # elementwise addition [1 2 3] + [4 5 6] = [5 7 9] print(x - y) # elementwise subtraction [1 2 3] - [4 5 6] = [-3 -3 -3]
print(x * y) # elementwise multiplication [1 2 3] * [4 5 6] = [4 10 18] print(x / y) # elementwise divison [1 2 3] / [4 5 6] = [0.25 0.4 0.5]
print(x**2) # elementwise power [1 2 3] ^2 = [1 4 9]
x.dot(y) # dot product 1*4 + 2*5 + 3*6
z = np.array([y, y**2]) print(len(z)) # number of rows of array
Let’s look at transposing arrays. Transposing permutes the dimensions of the array.
z = np.array([y, y**2]) z
The shape of array z
is (2,3)
before transposing.
z.shape
Use .T
to get the transpose.
z.T
The number of rows has swapped with the number of columns.
z.T.shape
Use .dtype
to see the data type of the elements in the array.
z.dtype
Use .astype
to cast to a specific type.
z = z.astype('f') z.dtype
Numpy has many built in math functions that can be performed on arrays.
a = np.array([-4, -2, 1, 3, 5])
a.sum()
a.max()
a.min()
a.mean()
a.std()
argmax
and argmin
return the index of the maximum and minimum values in the array.
a.argmax()
a.argmin()
s = np.arange(13)**2 s
Use bracket notation to get the value at a specific index. Remember that indexing starts at 0.
s[0], s[4], s[-1]
Use :
to indicate a range. array[start:stop]
Leaving start
or stop
empty will default to the beginning/end of the array.
s[1:5]
Use negatives to count from the back.
s[-4:]
A second :
can be used to indicate step-size. array[start:stop:stepsize]
Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.
s[-5::-2]
Let’s look at a multidimensional array.
r = np.arange(36) r.resize((6, 6)) r
Use bracket notation to slice: array[row, column]
r[2, 2]
And use : to select a range of rows or columns
r[3, 3:6]
r[:2, :-1]
This is a slice of the last row, and only every other element.
r[-1, ::2]
We can also perform conditional indexing. Here we are selecting values from the array that are greater than 30. (Also see np.where
)
r[r > 30]
Here we are assigning all values in the array that are greater than 30 to the value of 30.
r[r > 30] = 30 r
Be careful with copying and modifying arrays in NumPy!
r2
is a slice of r
r2 = r[:3,:3] r2
Set this slice’s values to zero ([:] selects the entire array)
r2[:] = 0 r2
r
has also been changed!
r
To avoid this, use r.copy
to create a copy that will not affect the original array
r_copy = r.copy() r_copy
Now when r_copy is modified, r will not be changed.
r_copy[:] = 10 print(r_copy, '\n') print(r)
Let’s create a new 4 by 3 array of random numbers 0-9.
test = np.random.randint(0, 10, (4,3)) test
Iterate by row:
for row in test: print(row)
Iterate by index:
for i in range(len(test)): print(test[i])
Iterate by row and index:
for i, row in enumerate(test): print('row', i, 'is', row)
Use zip
to iterate over multiple iterables.
test2 = test**2 test2
for i, j in zip(test, test2): print(i,'+',j,'=',i+j)
Next On Data Science Using Python:
The Series Data Structure
The DataFrame Data Structure
Dataframe Indexing and Loading
Querying a DataFrame
Indexing Dataframes
Missing values
The post Applied Data Science with Python – Part 1 appeared first on StepUp Analytics.
]]>