SoFunction
Updated on 2025-03-01

Commonly used dataframe operations in R language

In the previous section, we briefly introduced the definition of Dataframe. In this section, let’s take a look at the operation of Dataframe in detail.

First, the creation function of the data frame is ( ). Refer to the help document in R language. Let’s learn about the specific usage of ( ):

Usage
(...,  = NULL,  = FALSE,
            = TRUE,  = TRUE,
           stringsAsFactors = ())
()
Arguments
... :these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.
 :NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.

Of course, there are many specific usages of parameters later. I will not elaborate on them one by one here. The first two are mainly used. First of all, "..." represents the table data, which is the data body that is to form the data frame. ( ) is the row name of the data frame. So since the data frame is equivalent to a table in R language, it should have both row names and column names, so how is the column name given? We know that many data processing software and algorithms are based on columns of data. When we built the matrix before, the default was to be filled by columns (byrow=FALSE), and we have already determined the column names from the beginning of creating the data frame. See the following code for details:

I want to create a data frame called "mydataframe", first determine which columns are in the data frame, and then call the function ( ) function

> C1 <-c(1,2,3,4)
> C2 <-c(5,6,7,8)
> C3 <-c(9,10,11,12)
> C4 <-c(13,14,15,16)
> C5 <-c(17,18,19,20)
> mydataframe <- (C1,C2,C3,C4,C5, = c("R1","R2","R3","R4"))
> mydataframe
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20

It can be seen that a data frame is a data structure that splices existing columns into a table. Careful friends will find out how this data frame looks the same as the matrix we mentioned in the previous section! ! ! Let’s review the matrix creation in the previous section:

> mydata <- c(1:20)
> cnames <- c("C1","C2","C3","C4","C5")
> rnames <- c("R1","R2","R3","R4")
> myarray <- matrix(mydata,nrow = 4,ncol = 5,dimnames = list(rnames,cnames))
> myarray
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20

Indeed, there is no difference in appearance, but the elements in the matrix must be consistent, and the data frame can be a collection of various types of data. This kind of collection is not an unconditional messy collection, but is based on columns. The element types of different columns can be different, but the element types of the same column must be the same. Therefore, matrices can be regarded as special data frame types. So what is the point of doing this? In data statistics, we need to have various types of data. Take a simple transcript for example, which includes character elements such as "name", "student number", and "subject", as well as numerical elements such as "fraction", and Boolean elements such as "whether to pass". Therefore, in a broad sense, dataframes are more universal, and matrices are mostly used in mathematical calculations. Let's just talk about it, let's actually create a data frame and then demonstrate its specific operations:

&gt; names &lt;- c("Xiao Ming","Little Red","Xiaolan")
&gt; StudentID &lt;- c("2014","2015","2016")
&gt; subjects &lt;- c("English","English","English")
&gt; scores &lt;- c(87,98,93)
&gt; Result &lt;- (StudentID,names,subjects,scores)
&gt; Result
  StudentID names subjects scores
1      2014  Xiao Ming     English     87
2      2015  Xiaohong     English     98
3      2016  Xiaolan     English     93

As can be seen from above, when no row name is specified for the data frame, the system will default to each row number starting from 1, which is somewhat similar to an Excel table. As usual, let's first learn the basic operations of dataframe data types.

Access to data frame elements: Since the matrix is ​​a special data frame, should the access method of matrix elements be also applicable to dataframes? Not so, we know that the data frame is in units of rows or columns (rows and columns can be transposed), so when accessing elements, they can only be accessed in the entire row or column. That is, dataframe[1,] (access the first row). When dataframe[,1] (access the first column) accesses columns in this way, the return value is arranged by row. To access the column, you can also directly use dataframe(1) to access the first column, or dataframe(column name) to access the specified column. You can also access several columns in succession, see the code for details:

&gt; Result[1,] #Access the first line  StudentID names subjects scores
1      2014  Xiao Ming     English     87
&gt; Result[,1] #Access the first column[1] 2014 2015 2016
Levels: 2014 2015 2016
&gt; Result[1] #Access the first column  StudentID
1      2014
2      2015
3      2016
&gt; Result["names"] #Access the column with the specified label  names
1  Xiao Ming
2  Xiaohong
3  Xiaolan

&gt; Result[1:3,]#Access Lines 1-3  StudentID names subjects scores
1      2014  Xiao Ming     English     87
2      2015  Xiaohong     English     98
3      2016  Xiaolan     English     93
&gt; Result[1:3]#Access columns 1-3  StudentID names subjects
1      2014  Xiao Ming     English
2      2015  Xiaohong     English
3      2016  Xiaolan     English
&gt; Result[c(1,3),]#Only access 1 and 3 lines, pay attention to writing method c( )  StudentID names subjects scores
1      2014  Xiao Ming     English     87
3      2016  Xiaolan     English     93
&gt; Result[c(1,4)]#Only access 1 and 4 columns, pay attention to writing method c( )  StudentID scores
1      2014     87
2      2015     98
3      2016     93
&gt; Result[c("names","scores")]#Only access names and scores columns, pay attention to writing method c( )  names scores
1  Xiao Ming     87
2  Xiaohong     98
3  Xiaolan     93

From the above we can find that for data frame operations, you must use c( ) or list( ). Through the above understanding, we found that ordinary access must have row names and column names, which sometimes brings unnecessary trouble to us. For example, if I want to calculate the average score, Score will bring us some confusion. So what methods can be used to access database elements without row names or column names?

Method 1:Use attach and detach functions, for example, to print all names, then you can write it as:

&gt; attach(Result)
The following objects are masked _by_ .GlobalEnv:
    names, scores, StudentID, subjects
The following objects are masked from Result (pos = 3):
    names, scores, StudentID, subjects
&gt; name &lt;- names
&gt; score &lt;-scores
&gt; detach(Result)
&gt; name
[1] "Xiao Ming" "Little Red" "Xiaolan"
&gt; score
[1] 87 98 93
&gt; mean(score)
[1] 92.66667

Method 2:Use the with function

> with(Result,{score <- scores})
> score
[1] 87 98 93

The above talks about the creation and reading of dataframes. What should I do if I need to add or delete a certain column?

&gt; Result$age&lt;-c(12,14,13)#Add age column&gt; Result
  StudentID names subjects scores age
1      2014  Xiao Ming     English     87  12
2      2015  Xiaohong     English     98  14
3      2016  Xiaolan     English     93  13
&gt; Result2 &lt;- Result[-2]#Delete name column&gt; Result2
  StudentID subjects scores age
1      2014     English     87  12
2      2015     English     98  14
3      2016     English     93  13

What should I do if I need to query the information of students whose grades are equal to 98?

&gt; Result[which(Result$scores==98),]
  StudentID names subjects scores age
2      2015  Xiaohong     English     98  14

As mentioned above, matrix and data frame are also two different data types. We know that data types can be converted to each other. Use is.***( ) to determine whether a variable is of type ***, and use as.***( ) to convert a variable to type ***. Then, correspondingly, the matrix converted to data frame type should be:

> myarray
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20
> myarrayframe <- (myarray)
> myarrayframe
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20
> (myarray)
[1] FALSE
> (myarrayframe)
[1] TRUE

Like matrix matrix operations, the data frame also has rbind and cbind functions, and the usage is roughly the same. Friends who are interested can contact you briefly, so I won't go into details here.

Finally, let’s talk about data processing operations in data frames:

As we mentioned above, we can use dataframe[column number] or dataframe[column value] to read a column of the dataframe, and the return value is still the dataframe type, but this part of the data is not convenient to directly use the sum and average methods we mentioned before for calculation and analysis, because the read data has "row name/column name", which is a character variable. Some people may ask, when I create a data frame, wouldn’t it be enough to not add row names and column names? First, when creating a data frame, you will be assigned row names or column names by default. Second, even if you do not assign row names or column names, what is the point of creating a data frame?

&gt; mydataframe
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20
&gt; mydataframe["C4"]
   C4
R1 13
R2 14
R3 15
R4 16
&gt; mean(mydataframe["C4"])
[1] NA
Warning message:
In (mydataframe["C4"]) : Parameters are not numerical or logical values:ReplyNA
&gt; (mydataframe["C4"])
[1] TRUE

Method 1:Reconvert the data frame format into a matrix format, and then find the data group to be processed according to the matrix index, and use the relevant functions in the matrix or vector to perform certain data processing.

&gt; myarray2 &lt;- (mydataframe)
&gt; (myarray2)
[1] TRUE
&gt; myarray2
   C1 C2 C3 C4 C5
R1  1  5  9 13 17
R2  2  6 10 14 18
R3  3  7 11 15 19
R4  4  8 12 16 20
&gt; x &lt;- myarray[,3] #Read the value of column 3&gt; x
R1 R2 R3 R4
 9 10 11 12
&gt; (x) # Check whether x is a vector type[1] TRUE
&gt; mean(x)
[1] 10.5
&gt; sum(x)
[1] 42

Method 2:When reading the data frame column, use another method, dataframe$ (row name or column name), the return value is vector type

> c <- mydataframe$C3
> c
[1]  9 10 11 12
> (c)
[1] TRUE
> mean(c)
[1] 10.5
> sum(c)
[1] 42

At the same time, you can also use dataframe$ (new column name) <- new vector to add new columns to the dataframe. The specific operation is as follows:

> mydataframe$sum <- mydataframe$C1 +mydataframe$C4
> mydataframe$mean <- (mydataframe$C1+mydataframe$C4)/2
> mydataframe
   C1 C2 C3 C4 C5 sum mean
R1  1  5  9 13 17  14    7
R2  2  6 10 14 18  16    8
R3  3  7 11 15 19  18    9
R4  4  8 12 16 20  20   10

The most popular method is the next method, which directly uses the transform function to form a new data frame. The specific usage is as follows:

> x1 <- mydataframe$C1
> x2 <- mydataframe$C3
> mydataframe2 <- transform(mydataframe,sum2=x1+x2,mean2=(x1+x2)/2)
> mydataframe2
   C1 C2 C3 C4 C5 sum mean sum2 mean2
R1  1  5  9 13 17  14    7   10     5
R2  2  6 10 14 18  16    8   12     6
R3  3  7 11 15 19  18    9   14     7
R4  4  8 12 16 20  20   10   16     8

This is the article about the commonly used operations of Dataframe in R language. For more related R languages, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!