Market Basket Analysis
Market Basket Analysis using Apriori Algorithm

Market Basket Analysis

Market Basket Analysis in R: From Sellers to Intelligent Sellers

The Main Idea:
The term ‘E-commerce’ is well known to all of us. Well, it means trade and business through the means of internet, popularly known as ‘online shopping’. Nowadays, retailers who traditionally used to sell their products strictly in ‘Brick-And-Mortar’ stores, resort to online display of their products and hence facilitate online purchase of their products through various platforms.
In doing so, both the customers and the sellers are benefitted intelligently. The customers can search for their desired products and compare their prices online, whereas the sellers can effortlessly conduct their merchandise trade in a cost effective but intelligent manner.
The biggest perks of having an online presence for a seller is that it enables them to correct their past mistakes in the business policies, by merely looking at the recorded sales data and understanding the customer behaviour largely.
However, this data generation and delving deep into the data to get useful insights is a logical task, which requires some scientific algorithm. One such algorithm widely used is the ‘Apriori’ algorithm. But, the thing is, such algorithms require trained marketing analysts, to be executed and inferred on.
So, thus comes the term ‘Market Basket Analysis’. Nowadays this is very common a procedure performed by not only online retailers but also the sellers who prefer to sell in physical ‘Brick-and Mortar’ stores.
The term ‘market-basket’ implies any consumption bundle taken up by the customers for final purchase. However, such bundles do not necessarily mean bundles of the same product, but also comprises the possibility of a customer buying up multiple product items in the same go, which together build up his ‘market-basket’.
How such an analysis helps:
Market basket analysis may provide the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer’s needs and rewrite the store’s layout accordingly, develop cross-promotional programs, or even capture new buyers, all of which are necessary to survive in the market.
Most relevant and well-known examples include ‘Amazon’, ‘Flipkart’, ‘Ebay’, etc.

Market basket analysis: The Basics

Terminology:
Items are the objects that we are identifying associations between. For an online retailer, each item is a product in the shop. For a publisher, each item might be an article, a blog post, a video etc. A group of items is an item set.
Transactions are instances of groups of items occurring together. For an online retailer, a transaction is generally a monetary transaction. For a publisher, a transaction might be the group of articles read in a single visit to the website. (It is up to the analyst to define over what period to measure a transaction.) For each transaction, then, we have an item set.

Rules are statements of the form

i.e. if you have the items in item set (on the left-hand side (LHS) of the rule
i.e. {i_1, i_2, …}, then it is likely that a visitor will be interested in the item on the right-hand side (RHS i.e. {i_k}. In our example above, our rule would be:

The output of a market basket analysis is generally a set of rules, that we can then exploit to make business decisions (related to marketing or product placement, for example).
The support of an item or an item set is the fraction of transactions in our data set that contain that item or the item set. In general, it is nice to identify rules that have a high support, as these will be applicable to a large number of transactions. For super market retailers, this is likely to involve basic products that are popular across an entire user base (e.g. bread, milk). A printer cartridge retailer, for example, may not have products with a high support, because each customer only buys cartridges that are specific to his / her own printer.
The confidence of a rule is the likelihood that it is true for a new transaction that contains the items on the LHS of the rule (i.e. it is the probability that the transaction also contains the item(s) on the RHS.). Formally:

The lift of a rule is the ratio of the support of the items on the LHS of the rule co-occurring with items on the RHS, to the probability that the LHS and RHS co-occur if the two are independent.

If lift is greater than 1, it suggests that the presence of the items on the LHS has increased the probability that the items on the RHS will occur on this transaction. If the lift is below 1, it suggests that the presence of the items on the LHS make the probability that the items on the RHS will be part of the transaction, lower.
If the lift is 1, it suggests that the presence of items on the LHS and RHS really are independent, i.e. knowing that the items on the LHS are present makes no difference to the probability that items will occur on the RHS.
When we perform market basket analysis, then, we are looking for rules with a lift of more than 1. Rules with higher confidence are ones where the probability of an item appearing on the RHS is high given the presence of the items on the LHS. It is also preferable to action rules that have a high support – as these will be applicable to a larger number of transactions.
However, in the case of long-tail retailers, this may not be possible. Practically it has been seen in many cases that maximizing support and confidence at the same time is not possible. In businesses that are dealing with products with relatively low demand, it is advisable to maximize confidence maintaining the support parameter at a threshold acceptable level.
The following steps take us through the exact analytical process of dealing with Market Basket Analysis using R: –

Implementing Market Basket Analysis using Apriori Algorithm

At first we read the data set on transactions.
The name of the required data set in my analysis is “AprioriTransactionsReduced.csv”, i.e. a CSV file.
If anyone needs to get access to this data set, get it from the link below.
data set – AprioriTransactionsReduced.csv
We now set the file path and then import the csv file in R.
After importing the data file we look at its initial structure.

setwd('G:\StepUpAnalytics.com\Arka')
df <- read.csv("AprioriTransactionsReduced.csv")
str(df)
## 'data.frame':    40236 obs. of  2 variables:
##  $ Invoice: Factor w/ 10246 levels "110002","110003",..: 294 637 637 637 822 2814 2814 2965 3173 5232 ...
##  $ Item   : Factor w/ 6822 levels "001-0012","001-0013",..: 4579 4719 4827 5217 23 4719 4827 23 4 23 ...

We sort the data set by the ascending order of the ‘Invoice’ and have a brief look at the sorted data set.

df_sorted <- df[order(df$Invoice),]

 

head(df_sorted)
##    Invoice     Item
## 24  110002 340-7800
## 25  110003 086-0604
## 26  110003 086-1568
## 27  110003 086-3126
## 28  110003 138-0116
## 29  110003 352-4505
tail(df_sorted)
##       Invoice     Item
## 40059   41102 084-3926
## 40060   41102 404-0420
## 40061   41102 404-0438
## 40062   41102 436-1265
## 40063   41102 436-1275
## 40064 INVOICE     Part

 
We convert ‘Invoice’ to numeric and check its data nature.

df_sorted$Invoice <- as.numeric(df_sorted$Invoice)
class(df_sorted$Invoice)

 

## [1] "numeric"

We then convert Item to categorical format and look at its data type and structure.

df_sorted$Item <- as.factor(df_sorted$Item)
class(df_sorted$Item)

 

## [1] "factor"
str(df_sorted)
## 'data.frame':    40236 obs. of  2 variables:
##  $ Invoice: num  1 2 2 2 2 2 2 2 3 4 ...
##  $ Item   : Factor w/ 6822 levels "001-0012","001-0013",..: 4863 2019 2042 2107 2935 4935 5974 5987 6041 1 ...

Now, we have to convert dataframe to transaction format using ddply and #group all the items that were bought together by the same customer on the same date.

library(plyr)
df_itemList <- ddply(df,'Invoice', function(df1)paste(df1$Item,collapse = ","))
head(df_itemList)

 

##   Invoice                                                             V1
## 1  110002                                                       340-7800
## 2  110003 086-0604,086-1568,086-3126,138-0116,352-4505,462-5602,462-5724
## 3  110004                                                       471-2812
## 4  110005                                                       001-0012
## 5  110006                                              053-9626,110-4368
## 6  110007          148-0294,148-0736,452-1691,462-3070,462-5406,462-5726

 
Now, we remove the column ‘Invoice’.

df_itemList$Invoice <- NULL
head(df_itemList)

 

##                                                               V1
## 1                                                       340-7800
## 2 086-0604,086-1568,086-3126,138-0116,352-4505,462-5602,462-5724
## 3                                                       471-2812
## 4                                                       001-0012
## 5                                              053-9626,110-4368
## 6          148-0294,148-0736,452-1691,462-3070,462-5406,462-5726

 
Next, we have to rename the only column head left in the data set.

colnames(df_itemList) <- c("Item List")
head(df_itemList)

 

##                                                        Item List
## 1                                                       340-7800
## 2 086-0604,086-1568,086-3126,138-0116,352-4505,462-5602,462-5724
## 3                                                       471-2812
## 4                                                       001-0012
## 5                                              053-9626,110-4368
## 6          148-0294,148-0736,452-1691,462-3070,462-5406,462-5726

 
We now export the data set to be worked upon to a csv format file, for a back up.

write.csv(df_itemList,"ItemList.csv", quote = FALSE, row.names = TRUE)

 

We bring in the association rule mining algorithm : ‘apriori’

We load the packages required.

library(arules)

 

## Warning: package 'arules' was built under R version 3.4.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
##     abbreviate, write

We now convert csv file to basket format and inspect it.

txn = read.transactions(file="ItemList.csv", rm.duplicates= FALSE, format="basket",sep=",",cols=1);

 

## Warning in asMethod(object): removing duplicated items in transactions
inspect(head(txn))
##     items       transactionID
## [1] {Item List}
## [2] {340-7800}              1
## [3] {086-0604,
##      086-1568,
##      086-3126,
##      138-0116,
##      352-4505,
##      462-5602,
##      462-5724}              2
## [4] {471-2812}              3
## [5] {001-0012}              4
## [6] {053-9626,
##      110-4368}              5

For our convenience we remove the quotes from transactions.

txn@itemInfo$labels <- gsub(""","",txn@itemInfo$labels)

 
Finally, the next step is to run the apriori algorithm.

basket_rules1 <- apriori(txn,parameter = list(sup = 0.001, conf = 0.8, target="rules"))

 

## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 10
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6823 item(s), 10247 transaction(s)] done [0.01s].
## sorting and recoding items ... [855 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(txn, parameter = list(sup = 0.001, conf = 0.8, target =
## "rules")): Mining stopped (maxlen reached). Only patterns up to a length of
## 10 returned!
##  done [0.00s].
## writing ... [37190 rule(s)] done [0.01s].
## creating S4 object  ... done [0.02s].
basket_rules2 <- apriori(txn,parameter = list(sup = 0.01, conf = 0.25, target="rules"))

 

## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 102
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6823 item(s), 10247 transaction(s)] done [0.00s].
## sorting and recoding items ... [14 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Note: Here we have created 2 different basket rules. One with high confidence and low support and the other with high support and low confidence.
Now we view the rules that we have created.

summary(basket_rules1)
## set of 37190 rules
##
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6    7    8    9   10
##   15  866 3630 7246 9433 8422 5126 2002  450
##
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    2.00    5.00    6.00    6.25    7.00   10.00
##
## summary of quality measures:
##     support           confidence          lift            count
##  Min.   :0.001073   Min.   :0.8000   Min.   : 43.72   Min.   :11.00
##  1st Qu.:0.001073   1st Qu.:0.9167   1st Qu.: 86.18   1st Qu.:11.00
##  Median :0.001073   Median :1.0000   Median :121.99   Median :11.00
##  Mean   :0.001150   Mean   :0.9569   Mean   :134.11   Mean   :11.78
##  3rd Qu.:0.001171   3rd Qu.:1.0000   3rd Qu.:167.40   3rd Qu.:12.00
##  Max.   :0.004294   Max.   :1.0000   Max.   :640.44   Max.   :44.00
##
## mining info:
##  data ntransactions support confidence
##   txn         10247   0.001        0.8
summary(basket_rules2)
## set of 2 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 2
##
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##       2       2       2       2       2       2
##
## summary of quality measures:
##     support         confidence          lift           count
##  Min.   :0.0122   Min.   :0.6477   Min.   :35.87   Min.   :125
##  1st Qu.:0.0122   1st Qu.:0.6547   1st Qu.:35.87   1st Qu.:125
##  Median :0.0122   Median :0.6617   Median :35.87   Median :125
##  Mean   :0.0122   Mean   :0.6617   Mean   :35.87   Mean   :125
##  3rd Qu.:0.0122   3rd Qu.:0.6687   3rd Qu.:35.87   3rd Qu.:125
##  Max.   :0.0122   Max.   :0.6757   Max.   :35.87   Max.   :125
##
## mining info:
##  data ntransactions support confidence
##   txn         10247    0.01       0.25

We now convert to the basket rules to dataframe and view them. Also, we give suitable transformations to the ‘confidence’ and ‘support’ parameters.

df_basket1 <- as(basket_rules1,"data.frame")
head(df_basket1)
##                      rules     support confidence     lift count
## 1 {300-3512} => {300-3535} 0.001073485  1.0000000 365.9643    11
## 2 {036-8303} => {036-8304} 0.001073485  1.0000000 640.4375    11
## 3 {300-3532} => {300-3535} 0.001366254  0.8750000 320.2188    14
## 4 {161-1271} => {339-6516} 0.001366254  0.8235294 122.3001    14
## 5 {338-4760} => {338-4715} 0.001366254  0.9333333 222.4155    14
## 6 {338-4760} => {339-6516} 0.001268664  0.8666667 128.7063    13
tail(df_basket1)
##                                                                                                  rules
## 37185 {092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-2684}
## 37186 {092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7603}
## 37187 {092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-0526}
## 37188 {092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-1480}
## 37189 {092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510} => {552-9860}
## 37190 {092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7620}
##           support confidence     lift count
## 37185 0.001073485  1.0000000 121.9881    11
## 37186 0.001073485  1.0000000 129.7089    11
## 37187 0.001073485  1.0000000 133.0779    11
## 37188 0.001073485  1.0000000 150.6912    11
## 37189 0.001073485  0.9166667 138.1336    11
## 37190 0.001073485  1.0000000 204.9400    11
df_basket1$confidence <- df_basket1$confidence * 100
df_basket1$support <- df_basket1$support * nrow(df)
head(df_basket1)
##                      rules  support confidence     lift count
## 1 {300-3512} => {300-3535} 43.19274  100.00000 365.9643    11
## 2 {036-8303} => {036-8304} 43.19274  100.00000 640.4375    11
## 3 {300-3532} => {300-3535} 54.97258   87.50000 320.2188    14
## 4 {161-1271} => {339-6516} 54.97258   82.35294 122.3001    14
## 5 {338-4760} => {338-4715} 54.97258   93.33333 222.4155    14
## 6 {338-4760} => {339-6516} 51.04596   86.66667 128.7063    13
tail(df_basket1)
##                                                                                                  rules
## 37185 {092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-2684}
## 37186 {092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7603}
## 37187 {092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-0526}
## 37188 {092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-1480}
## 37189 {092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510} => {552-9860}
## 37190 {092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7620}
##        support confidence     lift count
## 37185 43.19274  100.00000 121.9881    11
## 37186 43.19274  100.00000 129.7089    11
## 37187 43.19274  100.00000 133.0779    11
## 37188 43.19274  100.00000 150.6912    11
## 37189 43.19274   91.66667 138.1336    11
## 37190 43.19274  100.00000 204.9400    11
df_basket2 <- as(basket_rules2,"data.frame")
head(df_basket2)
##                      rules    support confidence     lift count
## 1 {462-5406} => {462-5726} 0.01219869  0.6756757 35.87383   125
## 2 {462-5726} => {462-5406} 0.01219869  0.6476684 35.87383   125
tail(df_basket2)
##                      rules    support confidence     lift count
## 1 {462-5406} => {462-5726} 0.01219869  0.6756757 35.87383   125
## 2 {462-5726} => {462-5406} 0.01219869  0.6476684 35.87383   125
df_basket2$confidence <- df_basket2$confidence * 100
df_basket2$support <- df_basket2$support * nrow(df)
head(df_basket2)

 

##                      rules  support confidence     lift count
## 1 {462-5406} => {462-5726} 490.8266   67.56757 35.87383   125
## 2 {462-5726} => {462-5406} 490.8266   64.76684 35.87383   125
tail(df_basket2)
##                      rules  support confidence     lift count
## 1 {462-5406} => {462-5726} 490.8266   67.56757 35.87383   125
## 2 {462-5726} => {462-5406} 490.8266   64.76684 35.87383   125

Mining rules for recommendations:

We split lhs and rhs into two columns.

library(reshape2)
df_basket_split1 <- transform(df_basket1, rules = colsplit(rules, pattern = " => ", names = c("lhs","rhs")))
head(df_basket_split1)
##    rules.lhs  rules.rhs  support confidence     lift count
## 1 {300-3512} {300-3535} 43.19274  100.00000 365.9643    11
## 2 {036-8303} {036-8304} 43.19274  100.00000 640.4375    11
## 3 {300-3532} {300-3535} 54.97258   87.50000 320.2188    14
## 4 {161-1271} {339-6516} 54.97258   82.35294 122.3001    14
## 5 {338-4760} {338-4715} 54.97258   93.33333 222.4155    14
## 6 {338-4760} {339-6516} 51.04596   86.66667 128.7063    13

 

tail(df_basket_split1)
##                                                                                rules.lhs
## 37185 {092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860}
## 37186 {092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860}
## 37187 {092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860}
## 37188 {092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860}
## 37189 {092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510}
## 37190 {092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860}
##        rules.rhs  support confidence     lift count
## 37185 {092-2684} 43.19274  100.00000 121.9881    11
## 37186 {092-7603} 43.19274  100.00000 129.7089    11
## 37187 {092-0526} 43.19274  100.00000 133.0779    11
## 37188 {092-1480} 43.19274  100.00000 150.6912    11
## 37189 {552-9860} 43.19274   91.66667 138.1336    11
## 37190 {092-7620} 43.19274  100.00000 204.9400    11
df_basket_split2 <- transform(df_basket2, rules = colsplit(rules, pattern = " => ", names = c("lhs","rhs")))
head(df_basket_split2)

 

##    rules.lhs  rules.rhs  support confidence     lift count
## 1 {462-5406} {462-5726} 490.8266   67.56757 35.87383   125
## 2 {462-5726} {462-5406} 490.8266   64.76684 35.87383   125
tail(df_basket_split2)
##    rules.lhs  rules.rhs  support confidence     lift count
## 1 {462-5406} {462-5726} 490.8266   67.56757 35.87383   125
## 2 {462-5726} {462-5406} 490.8266   64.76684 35.87383   125

Next, we remove curly brackets around the rules.

df_basket_split1$rules$lhs <- gsub("\{|\}", "", df_basket_split1$rules$lhs)
df_basket_split1$rules$rhs <- gsub("\{|\}", "", df_basket_split1$rules$rhs)
head(df_basket_split1)

 

##   rules.lhs rules.rhs  support confidence     lift count
## 1  300-3512  300-3535 43.19274  100.00000 365.9643    11
## 2  036-8303  036-8304 43.19274  100.00000 640.4375    11
## 3  300-3532  300-3535 54.97258   87.50000 320.2188    14
## 4  161-1271  339-6516 54.97258   82.35294 122.3001    14
## 5  338-4760  338-4715 54.97258   93.33333 222.4155    14
## 6  338-4760  339-6516 51.04596   86.66667 128.7063    13
tail(df_basket_split1)
##                                                                              rules.lhs
## 37185 092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37186 092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37187 092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37188 092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37189 092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510
## 37190 092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860
##       rules.rhs  support confidence     lift count
## 37185  092-2684 43.19274  100.00000 121.9881    11
## 37186  092-7603 43.19274  100.00000 129.7089    11
## 37187  092-0526 43.19274  100.00000 133.0779    11
## 37188  092-1480 43.19274  100.00000 150.6912    11
## 37189  552-9860 43.19274   91.66667 138.1336    11
## 37190  092-7620 43.19274  100.00000 204.9400    11
df_basket_split2$rules$lhs <- gsub("\{|\}", "", df_basket_split2$rules$lhs)
df_basket_split2$rules$rhs <- gsub("\{|\}", "", df_basket_split2$rules$rhs)
head(df_basket_split2)

 

##   rules.lhs rules.rhs  support confidence     lift count
## 1  462-5406  462-5726 490.8266   67.56757 35.87383   125
## 2  462-5726  462-5406 490.8266   64.76684 35.87383   125
tail(df_basket_split2)
##   rules.lhs rules.rhs  support confidence     lift count
## 1  462-5406  462-5726 490.8266   67.56757 35.87383   125
## 2  462-5726  462-5406 490.8266   64.76684 35.87383   125

Next, we convert the rules to character format to make it presentable

df_basket_split1$rules$lhs <- as.character(df_basket_split1$rules$lhs)
df_basket_split1$rules$rhs <- as.character(df_basket_split1$rules$rhs)
head(df_basket_split1)

 

##   rules.lhs rules.rhs  support confidence     lift count
## 1  300-3512  300-3535 43.19274  100.00000 365.9643    11
## 2  036-8303  036-8304 43.19274  100.00000 640.4375    11
## 3  300-3532  300-3535 54.97258   87.50000 320.2188    14
## 4  161-1271  339-6516 54.97258   82.35294 122.3001    14
## 5  338-4760  338-4715 54.97258   93.33333 222.4155    14
## 6  338-4760  339-6516 51.04596   86.66667 128.7063    13
tail(df_basket_split1)
##                                                                              rules.lhs
## 37185 092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37186 092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37187 092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37188 092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37189 092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510
## 37190 092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860
##       rules.rhs  support confidence     lift count
## 37185  092-2684 43.19274  100.00000 121.9881    11
## 37186  092-7603 43.19274  100.00000 129.7089    11
## 37187  092-0526 43.19274  100.00000 133.0779    11
## 37188  092-1480 43.19274  100.00000 150.6912    11
## 37189  552-9860 43.19274   91.66667 138.1336    11
## 37190  092-7620 43.19274  100.00000 204.9400    11
str(df_basket_split1)
## 'data.frame':    37190 obs. of  5 variables:
##  $ rules     :'data.frame':  37190 obs. of  2 variables:
##   ..$ lhs: chr  "300-3512" "036-8303" "300-3532" "161-1271" ...
##   ..$ rhs: chr  "300-3535" "036-8304" "300-3535" "339-6516" ...
##  $ support   : num  43.2 43.2 55 55 55 ...
##  $ confidence: num  100 100 87.5 82.4 93.3 ...
##  $ lift      : num  366 640 320 122 222 ...
##  $ count     : num  11 11 14 14 14 13 27 20 18 23 ...
df_basket_split2$rules$lhs <- as.character(df_basket_split2$rules$lhs)
df_basket_split2$rules$rhs <- as.character(df_basket_split2$rules$rhs)
head(df_basket_split2)

 

##   rules.lhs rules.rhs  support confidence     lift count
## 1  462-5406  462-5726 490.8266   67.56757 35.87383   125
## 2  462-5726  462-5406 490.8266   64.76684 35.87383   125
tail(df_basket_split2)
##   rules.lhs rules.rhs  support confidence     lift count
## 1  462-5406  462-5726 490.8266   67.56757 35.87383   125
## 2  462-5726  462-5406 490.8266   64.76684 35.87383   125
str(df_basket_split2)
## 'data.frame':    2 obs. of  5 variables:
##  $ rules     :'data.frame':  2 obs. of  2 variables:
##   ..$ lhs: chr  "462-5406" "462-5726"
##   ..$ rhs: chr  "462-5726" "462-5406"
##  $ support   : num  491 491
##  $ confidence: num  67.6 64.8
##  $ lift      : num  35.9 35.9
##  $ count     : num  125 125

Now, we create a copy of the basket outputs.

df_basket_output1 <- df_basket_split1
head(df_basket_output1)
##   rules.lhs rules.rhs  support confidence     lift count
## 1  300-3512  300-3535 43.19274  100.00000 365.9643    11
## 2  036-8303  036-8304 43.19274  100.00000 640.4375    11
## 3  300-3532  300-3535 54.97258   87.50000 320.2188    14
## 4  161-1271  339-6516 54.97258   82.35294 122.3001    14
## 5  338-4760  338-4715 54.97258   93.33333 222.4155    14
## 6  338-4760  339-6516 51.04596   86.66667 128.7063    13
str(df_basket_output1)
## 'data.frame':    37190 obs. of  5 variables:
##  $ rules     :'data.frame':  37190 obs. of  2 variables:
##   ..$ lhs: chr  "300-3512" "036-8303" "300-3532" "161-1271" ...
##   ..$ rhs: chr  "300-3535" "036-8304" "300-3535" "339-6516" ...
##  $ support   : num  43.2 43.2 55 55 55 ...
##  $ confidence: num  100 100 87.5 82.4 93.3 ...
##  $ lift      : num  366 640 320 122 222 ...
##  $ count     : num  11 11 14 14 14 13 27 20 18 23 ...

 

nrow(df_basket_output1)
## [1] 37190
df_basket_output2 <- df_basket_split2
head(df_basket_output2)
##   rules.lhs rules.rhs  support confidence     lift count
## 1  462-5406  462-5726 490.8266   67.56757 35.87383   125
## 2  462-5726  462-5406 490.8266   64.76684 35.87383   125

 

str(df_basket_output2)
## 'data.frame':    2 obs. of  5 variables:
##  $ rules     :'data.frame':  2 obs. of  2 variables:
##   ..$ lhs: chr  "462-5406" "462-5726"
##   ..$ rhs: chr  "462-5726" "462-5406"
##  $ support   : num  491 491
##  $ confidence: num  67.6 64.8
##  $ lift      : num  35.9 35.9
##  $ count     : num  125 125

 

nrow(df_basket_output2)
## [1] 2

We change the variable heads for rules for simplicity.

str(df_basket_output1$rules)
## 'data.frame':    37190 obs. of  2 variables:
##  $ lhs: chr  "300-3512" "036-8303" "300-3532" "161-1271" ...
##  $ rhs: chr  "300-3535" "036-8304" "300-3535" "339-6516" ...

 

colnames(df_basket_output1$rules)<- c("Lhs","Rhs")
head(df_basket_output1$rules)
##        Lhs      Rhs
## 1 300-3512 300-3535
## 2 036-8303 036-8304
## 3 300-3532 300-3535
## 4 161-1271 339-6516
## 5 338-4760 338-4715
## 6 338-4760 339-6516

 

str(df_basket_output1$rules)
## 'data.frame':    37190 obs. of  2 variables:
##  $ Lhs: chr  "300-3512" "036-8303" "300-3532" "161-1271" ...
##  $ Rhs: chr  "300-3535" "036-8304" "300-3535" "339-6516" ...

 

str(df_basket_output2$rules)
## 'data.frame':    2 obs. of  2 variables:
##  $ lhs: chr  "462-5406" "462-5726"
##  $ rhs: chr  "462-5726" "462-5406"
colnames(df_basket_output2$rules)<- c("Lhs","Rhs")
head(df_basket_output2$rules)
##        Lhs      Rhs
## 1 462-5406 462-5726
## 2 462-5726 462-5406
str(df_basket_output2$rules)
## 'data.frame':    2 obs. of  2 variables:
##  $ Lhs: chr  "462-5406" "462-5726"
##  $ Rhs: chr  "462-5726" "462-5406"

Next, we go for creating the final output and look into its structure.
So, first we create an empty data frame with suitable number of rows and columns and then copy and paste the required variable columns from the ‘df_basket_output1’ data frame.
We first do this for the 1st basket output and then repeat the process for the 2nd basket output.

output1 <- data.frame(Lhs=character(nrow(df_basket_output1)),
                      Rhs=character(nrow(df_basket_output1)),
                      Support=double(nrow(df_basket_output1)),
                      Confidence=double(nrow(df_basket_output1)),
                      Lift=double(nrow(df_basket_output1)),
                      Count=double(nrow(df_basket_output1)),stringsAsFactors=FALSE)

 

str(output1)
## 'data.frame':    37190 obs. of  6 variables:
##  $ Lhs       : chr  "" "" "" "" ...
##  $ Rhs       : chr  "" "" "" "" ...
##  $ Support   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Confidence: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Lift      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Count     : num  0 0 0 0 0 0 0 0 0 0 ...
output1$Lhs <- df_basket_output1$rules$Lhs
output1$Rhs <- df_basket_output1$rules$Rhs
output1$Support <- df_basket_output1$support
output1$Confidence <- df_basket_output1$confidence
output1$Lift <- df_basket_output1$lift
output1$Count <- df_basket_output1$count

 

str(output1)
## 'data.frame':    37190 obs. of  6 variables:
##  $ Lhs       : chr  "300-3512" "036-8303" "300-3532" "161-1271" ...
##  $ Rhs       : chr  "300-3535" "036-8304" "300-3535" "339-6516" ...
##  $ Support   : num  43.2 43.2 55 55 55 ...
##  $ Confidence: num  100 100 87.5 82.4 93.3 ...
##  $ Lift      : num  366 640 320 122 222 ...
##  $ Count     : num  11 11 14 14 14 13 27 20 18 23 ...

 

head(output1)
##        Lhs      Rhs  Support Confidence     Lift Count
## 1 300-3512 300-3535 43.19274  100.00000 365.9643    11
## 2 036-8303 036-8304 43.19274  100.00000 640.4375    11
## 3 300-3532 300-3535 54.97258   87.50000 320.2188    14
## 4 161-1271 339-6516 54.97258   82.35294 122.3001    14
## 5 338-4760 338-4715 54.97258   93.33333 222.4155    14
## 6 338-4760 339-6516 51.04596   86.66667 128.7063    13

 

tail(output1)
##                                                                                    Lhs
## 37185 092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37186 092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37187 092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37188 092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860
## 37189 092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510
## 37190 092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860
##            Rhs  Support Confidence     Lift Count
## 37185 092-2684 43.19274  100.00000 121.9881    11
## 37186 092-7603 43.19274  100.00000 129.7089    11
## 37187 092-0526 43.19274  100.00000 133.0779    11
## 37188 092-1480 43.19274  100.00000 150.6912    11
## 37189 552-9860 43.19274   91.66667 138.1336    11
## 37190 092-7620 43.19274  100.00000 204.9400    11

 
Doing the same with the 2nd basket output.

output2 <- data.frame(Lhs=character(nrow(df_basket_output2)),
                      Rhs=character(nrow(df_basket_output2)),
                      Support=double(nrow(df_basket_output2)),
                      Confidence=double(nrow(df_basket_output2)),
                      Lift=double(nrow(df_basket_output2)),
                      Count=double(nrow(df_basket_output2)),stringsAsFactors=FALSE)

 

str(output2)
## 'data.frame':    2 obs. of  6 variables:
##  $ Lhs       : chr  "" ""
##  $ Rhs       : chr  "" ""
##  $ Support   : num  0 0
##  $ Confidence: num  0 0
##  $ Lift      : num  0 0
##  $ Count     : num  0 0
output2$Lhs <- df_basket_output2$rules$Lhs
output2$Rhs <- df_basket_output2$rules$Rhs
output2$Support <- df_basket_output2$support
output2$Confidence <- df_basket_output2$confidence
output2$Lift <- df_basket_output2$lift
output2$Count <- df_basket_output2$count

 

str(output2)
## 'data.frame':    2 obs. of  6 variables:
##  $ Lhs       : chr  "462-5406" "462-5726"
##  $ Rhs       : chr  "462-5726" "462-5406"
##  $ Support   : num  491 491
##  $ Confidence: num  67.6 64.8
##  $ Lift      : num  35.9 35.9
##  $ Count     : num  125 125
head(output2)
##        Lhs      Rhs  Support Confidence     Lift Count
## 1 462-5406 462-5726 490.8266   67.56757 35.87383   125
## 2 462-5726 462-5406 490.8266   64.76684 35.87383   125
tail(output2)
##        Lhs      Rhs  Support Confidence     Lift Count
## 1 462-5406 462-5726 490.8266   67.56757 35.87383   125
## 2 462-5726 462-5406 490.8266   64.76684 35.87383   125

 
Next we write the outputs to csv format files.

write.csv(output1,"Output1.csv")
write.csv(output2,"Output2.csv")
Thank You !!!

Leave a Reply

Close Menu