In-depth explanation of R language association rules

Before using R language to analyze the association rules, we first understand the relevant definitions and explanations of the association rules.

The purpose of association rules is to discover possible associations or connections between things from behind the data. They are unsupervised machine learning methods for knowledge discovery rather than prediction.

The process of mining association rules mainly includes two stages: the first stage finds all high-frequency project groups from the data set, and the second stage generates association rules from these high-frequency project groups.

Next, we understand the two main parameters of the association rule: support and confidence.

To understand these two indicators in a simplified way, support is the probability that two related items appear at the same time, while confidence is the probability that when one item appears, the other item also appears.
If there is a rule: beef -> chicken, then the proportion of customers who buy beef and chicken is 3/7, while the proportion of customers who buy beef also buy chicken is 3/4. These two proportional parameters are important metrics, and they are called support and confidence in the association rules. For rules: Beef -> Chicken, its support level is 3/7, which means that 3/7 of all customers buy beef and chicken at the same time, which reflects the coverage of customers who buy beef and chicken at the same time among all customers; its confidence level is 3/4, which means that 3/4 of the customers who buy beef have bought chicken, which reflects the predictable degree, that is, how likely it is to buy chicken if the customers buy beef.

The most commonly used association rule algorithm is the Apriori algorithm.

Next, we use R to make an algorithm example of the association rule. There is an arules package in R that we can use the dataset Groceries as an example.

library(arules)
data(Groceries) #Loading the datasetinspect(Groceries) #View data content

After completing the basic actions, we need to find the frequent item set, that is, the number of subsets of the associated relationship data that meet the minimum support.

freq=eclat(Groceries,parameter = list(support=0.05,maxlen=10))
inspect(freq) #Check the frequent item sets

' items support
[1] {whole milk,yogurt} 0.05602440
[2] {whole milk,rolls/buns} 0.05663447
[3] {other vegetables,whole milk} 0.07483477
[4] {whole milk} 0.25551601
[5] {other vegetables} 0.19349263
[6] {rolls/buns} 0.18393493
[7] {yogurt} 0.13950178
[8] {soda} 0.17437722
[9] {root vegetables} 0.10899847
[10] {tropical fruit} 0.10493137
[11] {bottled water} 0.11052364
[12] {sausage} 0.09395018
[13] {shopping bags} 0.09852567
[14] {citrus fruit} 0.08276563
[15] {pastry} 0.08896797
[16] {pip fruit} 0.07564820
[17] {whipped/sour cream} 0.07168277
[18] {fruit/vegetable juice} 0.07229283
[19] {domestic eggs} 0.06344687
[20] {newspapers} 0.07981698
[21] {butter} 0.05541434
[22] {margarine} 0.05856634
[23] {brown bread} 0.06487036
[24] {bottled beer} 0.08052872
[25] {frankfurter} 0.05897306
[26] {pork} 0.05765125
[27] {napkins} 0.05236401
[28] {curd} 0.05327911
[29] {beef} 0.05246568
[30] {coffee} 0.05805796
[31] {canned beer} 0.07768175'

Judging from the results, there are 31 frequent item sets in total, many of which have only one entry, and the minimum support may be too high.
Next, we choose a smaller support level and use the Apriori function to build a model

model<-apriori(Groceries,parameter=list(support=0.01,confidence=0.5))
summary(model)

set of 15 rules

rule length distribution (lhs + rhs):sizes
3
15

Min. 1st Qu. Median Mean 3rd Qu. Max.
3 3 3 3 3 3

summary of quality measures:
support confidence lift
Min. :0.01007 Min. :0.5000 Min. :1.984
1st Qu.:0.01174 1st Qu.:0.5151 1st Qu.:2.036
Median :0.01230 Median :0.5245 Median :2.203
Mean :0.01316 Mean :0.5411 Mean :2.299
3rd Qu.:0.01403 3rd Qu.:0.5718 3rd Qu.:2.432
Max. :0.02227 Max. :0.5862 Max. :3.030

mining info:
data ntransactions support confidence
Groceries 9835 0.01 0.5

Next, check the specific rules

inspect(model)

< lhs rhs support
[1] {curd,yogurt} => {whole milk} 0.01006609
[2] {other vegetables,butter} => {whole milk} 0.01148958
[3] {other vegetables,domestic eggs} => {whole milk} 0.01230300
[4] {yogurt,whipped/sour cream} => {whole milk} 0.01087951
[5] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159
[6] {pip fruit,other vegetables} => {whole milk} 0.01352313
[7] {citrus fruit,root vegetables} => {other vegetables} 0.01037112
[8] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
[9] {tropical fruit,root vegetables} => {whole milk} 0.01199797
[10] {tropical fruit,yogurt} => {whole milk} 0.01514997
[11] {root vegetables,yogurt} => {other vegetables} 0.01291307
[12] {root vegetables,yogurt} => {whole milk} 0.01453991
[13] {root vegetables,rolls/buns} => {other vegetables} 0.01220132
[14] {root vegetables,rolls/buns} => {whole milk} 0.01270971
[15] {other vegetables,yogurt} => {whole milk} 0.02226741
confidence lift
[1] 0.5823529 2.279125
[2] 0.5736041 2.244885
[3] 0.5525114 2.162336
[4] 0.5245098 2.052747
[5] 0.5070423 1.984385
[6] 0.5175097 2.025351
[7] 0.5862069 3.029608
[8] 0.5845411 3.020999
[9] 0.5700483 2.230969
[10] 0.5173611 2.024770
[11] 0.5000000 2.584078
[12] 0.5629921 2.203354
[13] 0.5020921 2.594890
[14] 0.5230126 2.046888
[15] 0.5128806 2.007235>

We can rank and view each association rule according to support

inspect(sort(model,by="support")[1:10])

< lhs rhs support
[1] {other vegetables,yogurt} => {whole milk} 0.02226741
[2] {tropical fruit,yogurt} => {whole milk} 0.01514997
[3] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159
[4] {root vegetables,yogurt} => {whole milk} 0.01453991
[5] {pip fruit,other vegetables} => {whole milk} 0.01352313
[6] {root vegetables,yogurt} => {other vegetables} 0.01291307
[7] {root vegetables,rolls/buns} => {whole milk} 0.01270971
[8] {other vegetables,domestic eggs} => {whole milk} 0.01230300
[9] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
[10] {root vegetables,rolls/buns} => {other vegetables} 0.01220132
confidence lift
[1] 0.5128806 2.007235
[2] 0.5173611 2.024770
[3] 0.5070423 1.984385
[4] 0.5629921 2.203354
[5] 0.5175097 2.025351
[6] 0.5000000 2.584078
[7] 0.5230126 2.046888
[8] 0.5525114 2.162336
[9] 0.5845411 3.020999
[10] 0.5020921 2.594890>

You can see that when there are other vegetables and yogurt in the shopping basket, the support level of whole milk is the best, reaching 0.02.

We also need to filter the specific correlation rules based on the actual situation of the business, and we can also remove those obviously useless rules in the process of establishing the correlation rule model.
For example, in the result we require that the associated item is whole mile and the lift value must be greater than 2.2

inspect(subset(model,subset=rhs%in%"whole milk"&lift>=2.2))

< lhs rhs support confidence lift
[1] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125
[2] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885
[3] {tropical fruit,root vegetables} => {whole milk} 0.01199797 0.5700483 2.230969
[4] {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354>

Looking at the results, there are only 4 correlation rules with higher lift values left.
lift=P(L,R)/(P(L)P(R)) is an index similar to correlation coefficient. When lift=1, L and R are independent. The larger this number, the more it means that L and R are in a shopping basket is not accidental.

Interpretation of relevant filtering rules:

%in%It's an exact match
%pin%It is a partial match, that is, as long as item like '%A%' or item like '%B%'
%ain%It is an exact match, that is, itemset has 'A' and itemset has 'B'
At the same time, you can add filter conditions of support, confidence, and lift through the condition operator (&, |, !).

This is the end of this article about the in-depth explanation of R language association rules. For more relevant R language association rules, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!