I encountered such a problem in the project. The sorted result table does not have member ID information, only member phone number information, and each behavior is unique to a member.
Another member table contains member ID and member phone number information, but this table has duplicate member information, which means that a member appears many times in the member table, and the member table is much larger than the sorted table.
illustrate:
1. During the project process, the data volume is large and the package is used. The statements in SQL need to be used to reduce the amount of code, so the SQLDF package is needed.
2. In the following example, a1 in xx has corresponding values in yy.
A1 and b in the same value of a1 correspond to one-to-one relationship, and there is no different value of b corresponding to the same value of a1.
4. The purpose of this program is to use the yy table to match the xx table, that is, the final result should be to add the information of the column b on the basis of xx, and the number of rows in the data set xx remains unchanged.
As shown below:
<span style="font-size:14px;">> library() > library(sqldf) > xx <- (a1=1:6,c=c(8,5,8,6,23,7),d=c('adf','af','sdf','utr','af','ruti'))</span> <span style="font-family:SimSun;font-size:14px;">> xx a1 c d 1: 1 8 adf 2: 2 5 af 3: 3 8 sdf 4: 4 6 utr 5: 5 23 af 6: 6 7 ruti</span>
<span style="font-size:14px;">yy <- (a1=c(2,3,1,4,5,5,7,6,8,9,2,2,3,6),b=c('b','c','a','d','e','e','g', 'f','h','i','b','b','c','f'))</span>
<span style="font-size:14px;">> yy a1 b 1: 2 b 2: 3 c 3: 1 a 4: 4 d 5: 5 e 6: 5 e 7: 7 g 8: 6 f 9: 8 h 10: 9 i 11: 2 b 12: 2 b 13: 3 c 14: 6 f </span>
<span style="font-size:14px;">#According to a1 to connect all information to include> dataxy<- merge(xx, yy, by = "a1", = TRUE) > dataxy a1 c d b 1: 1 8 adf a 2: 2 5 af b 3: 2 5 af b 4: 2 5 af b 5: 3 8 sdf c 6: 3 8 sdf c 7: 4 6 utr d 8: 5 23 af e 9: 5 23 af e 10: 6 7 ruti f 11: 6 7 ruti f</span>
<span style="font-size:14px;">#Delete duplicate lines*/> sqldf("select DISTINCT a1,b,c,d from left1") a1 b c d 1 1 a 8 adf 2 2 b 5 af 3 3 c 8 sdf 4 4 d 6 utr 5 5 e 23 af 6 6 f 7 ruti</span>
Supplement: Several ways to filter out non-repetitive lines in R language
During the process of working on the project, we encountered filtering non-duplicate member information to match. This time, we introduced five types of data that filter non-duplicate rows:
The five methods are as follows:
>>> library(dplyr) >>> library(sqldf)
Method 1:
>>> data1 <- data7_0 %>% group_by(CELLPHONE,MEMBERID) %>% filter(row_number() == 1) %>% ungroup()
Method 2:
>>> data2 <- data7_0 %>% distinct(CELLPHONE,MEMBERID, .keep_all = TRUE)
Method 3:
>>> data3 <- sqldf("select DISTINCT CELLPHONE,MEMBERID from data7_0")
Method 4:
>>> data4 <- base::unique(data7_0)
Method 5:
>>> data5 <- (data7_0[!duplicated(data7_0$CELLPHONE), ])
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.