The merge function in R is similar to Vlookup in Excel, which can implement the function of matching and splicing two data tables.
merge(x, y, by = intersect(names(x), names(y)), = by, = by, all = FALSE, = all, = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)
x,y: Two data frames used for merge
by,,: Used to connect the columns of two data sets. Intersect(a,b) value vectors a and b intersection. names(x) refers to extracting the column name of the data set x by = intersect(names(x), names(y)) is to obtain the column names of the data set x and y, and extract its common column names as the joining columns of the two data sets. When there are multiple common columns, the subscript needs to be pointed out with the public column, such as names(x)[1], specifying the first column of the x data set as a public column. It can also be directly written as by = 'public column name', provided that both data sets have the column name and the upper and lower case is completely consistent. R language is case sensitive.
all,,: Specifies whether the lines of x and y should all be in the output file.
sort: by whether the column specified by the specified column is to be sorted.
suffixes: Specifies the suffix of the same column name except by.
incomparables: Specifies which units in by are not merged.
There are 4 matching splicing patterns for merge function, namely inner, left, right and outer modes. where inner is the default matching pattern. all=T represents full connection, =T represents left connection; =T represents right connection
inner pattern matching, only rows in both common columns of the dataset are displayed
# When there are multiple public columns, you need to indicate which column is used as the join column merge(x,y,by=intersect(names(x)[1],names(y)[1]))
# When the names of two datasets are connected to the columns at the same time, use them directly, specify the connection column merge(x,y, ='name', ='name')
# When both datasets have connected columns, directly specify the name of the connected column merge(x,y,by='name')
Outer mode, summarize the data of two tables, and set the data that was originally not in the table to empty
merge(x, y, all=TRUE, sort=TRUE)
# all = TRUE means select all rows of the x and y dataset, sort = TRUE means sort by by column, default ascending order
left matching pattern
merge(x ,y,=TRUE,sort=TRUE)
# Multiple common columns At the end of the connection column, left connection, set = TRUE, the result only shows the columns of data x and columns that x does not have in the y data set
merge(x, y, by = 'name', = TRUE, sort = TRUE) # Multiple common columns Specify the connection column reference, left join, set = TRUE, the result only displays x all names(x)[1] values
right matching pattern
merge(x ,y ,by='name',=TRUE,sort=TRUE)
# Multiple common columns specify the connection column # Left join, set =TRUE, the result only shows records of all names(y) [1] values
Supplement: Use of subset and merge functions in R language
1. The operation of the merge function on the data frame
Select rows with equal conditions from the two data frames to combine them into a new data frame
df1=(name=c("aa","bb","cc"),age=c(20,29,30),sex=c("f","m","f")) df2=(name=c("dd","bb","cc"),age=c(40,35,36),sex=c("f","m","f")) mergedf=merge(df1,df2,by="name")
2. Subset function
Select data or related columns that meet certain criteria from a certain data frame
(1) Single condition query
> selectresult=subset(df1,name=="aa") > selectresult name age sex 1 aa 20 f > df1 name age sex 1 aa 20 f 2 bb 29 m 3 cc 30 f
(2) Specify the display column
> selectresult=subset(df1,name=="aa",select=c(age,sex)) > selectresult age sex 1 20 f
(3) Multi-condition query
> selectresult=subset(df1,name=="aa" & sex=="f",select=c(age,sex)) > selectresult age sex 1 20 f > df1 name age sex 1 aa 20 f 2 bb 29 m 3 cc 30 f
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.