In the previous section, we briefly introduced the definition of Dataframe. In this section, let’s take a look at the operation of Dataframe in detail.
First, the creation function of the data frame is ( ). Refer to the help document in R language. Let’s learn about the specific usage of ( ):
Usage (..., = NULL, = FALSE, = TRUE, = TRUE, stringsAsFactors = ()) () Arguments ... :these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself. :NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
Of course, there are many specific usages of parameters later. I will not elaborate on them one by one here. The first two are mainly used. First of all, "..." represents the table data, which is the data body that is to form the data frame. ( ) is the row name of the data frame. So since the data frame is equivalent to a table in R language, it should have both row names and column names, so how is the column name given? We know that many data processing software and algorithms are based on columns of data. When we built the matrix before, the default was to be filled by columns (byrow=FALSE), and we have already determined the column names from the beginning of creating the data frame. See the following code for details:
I want to create a data frame called "mydataframe", first determine which columns are in the data frame, and then call the function ( ) function
> C1 <-c(1,2,3,4) > C2 <-c(5,6,7,8) > C3 <-c(9,10,11,12) > C4 <-c(13,14,15,16) > C5 <-c(17,18,19,20) > mydataframe <- (C1,C2,C3,C4,C5, = c("R1","R2","R3","R4")) > mydataframe C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20
It can be seen that a data frame is a data structure that splices existing columns into a table. Careful friends will find out how this data frame looks the same as the matrix we mentioned in the previous section! ! ! Let’s review the matrix creation in the previous section:
> mydata <- c(1:20) > cnames <- c("C1","C2","C3","C4","C5") > rnames <- c("R1","R2","R3","R4") > myarray <- matrix(mydata,nrow = 4,ncol = 5,dimnames = list(rnames,cnames)) > myarray C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20
Indeed, there is no difference in appearance, but the elements in the matrix must be consistent, and the data frame can be a collection of various types of data. This kind of collection is not an unconditional messy collection, but is based on columns. The element types of different columns can be different, but the element types of the same column must be the same. Therefore, matrices can be regarded as special data frame types. So what is the point of doing this? In data statistics, we need to have various types of data. Take a simple transcript for example, which includes character elements such as "name", "student number", and "subject", as well as numerical elements such as "fraction", and Boolean elements such as "whether to pass". Therefore, in a broad sense, dataframes are more universal, and matrices are mostly used in mathematical calculations. Let's just talk about it, let's actually create a data frame and then demonstrate its specific operations:
> names <- c("Xiao Ming","Little Red","Xiaolan") > StudentID <- c("2014","2015","2016") > subjects <- c("English","English","English") > scores <- c(87,98,93) > Result <- (StudentID,names,subjects,scores) > Result StudentID names subjects scores 1 2014 Xiao Ming English 87 2 2015 Xiaohong English 98 3 2016 Xiaolan English 93
As can be seen from above, when no row name is specified for the data frame, the system will default to each row number starting from 1, which is somewhat similar to an Excel table. As usual, let's first learn the basic operations of dataframe data types.
Access to data frame elements: Since the matrix is a special data frame, should the access method of matrix elements be also applicable to dataframes? Not so, we know that the data frame is in units of rows or columns (rows and columns can be transposed), so when accessing elements, they can only be accessed in the entire row or column. That is, dataframe[1,] (access the first row). When dataframe[,1] (access the first column) accesses columns in this way, the return value is arranged by row. To access the column, you can also directly use dataframe(1) to access the first column, or dataframe(column name) to access the specified column. You can also access several columns in succession, see the code for details:
> Result[1,] #Access the first line StudentID names subjects scores 1 2014 Xiao Ming English 87 > Result[,1] #Access the first column[1] 2014 2015 2016 Levels: 2014 2015 2016 > Result[1] #Access the first column StudentID 1 2014 2 2015 3 2016 > Result["names"] #Access the column with the specified label names 1 Xiao Ming 2 Xiaohong 3 Xiaolan > Result[1:3,]#Access Lines 1-3 StudentID names subjects scores 1 2014 Xiao Ming English 87 2 2015 Xiaohong English 98 3 2016 Xiaolan English 93 > Result[1:3]#Access columns 1-3 StudentID names subjects 1 2014 Xiao Ming English 2 2015 Xiaohong English 3 2016 Xiaolan English > Result[c(1,3),]#Only access 1 and 3 lines, pay attention to writing method c( ) StudentID names subjects scores 1 2014 Xiao Ming English 87 3 2016 Xiaolan English 93 > Result[c(1,4)]#Only access 1 and 4 columns, pay attention to writing method c( ) StudentID scores 1 2014 87 2 2015 98 3 2016 93 > Result[c("names","scores")]#Only access names and scores columns, pay attention to writing method c( ) names scores 1 Xiao Ming 87 2 Xiaohong 98 3 Xiaolan 93
From the above we can find that for data frame operations, you must use c( ) or list( ). Through the above understanding, we found that ordinary access must have row names and column names, which sometimes brings unnecessary trouble to us. For example, if I want to calculate the average score, Score will bring us some confusion. So what methods can be used to access database elements without row names or column names?
Method 1:Use attach and detach functions, for example, to print all names, then you can write it as:
> attach(Result) The following objects are masked _by_ .GlobalEnv: names, scores, StudentID, subjects The following objects are masked from Result (pos = 3): names, scores, StudentID, subjects > name <- names > score <-scores > detach(Result) > name [1] "Xiao Ming" "Little Red" "Xiaolan" > score [1] 87 98 93 > mean(score) [1] 92.66667
Method 2:Use the with function
> with(Result,{score <- scores}) > score [1] 87 98 93
The above talks about the creation and reading of dataframes. What should I do if I need to add or delete a certain column?
> Result$age<-c(12,14,13)#Add age column> Result StudentID names subjects scores age 1 2014 Xiao Ming English 87 12 2 2015 Xiaohong English 98 14 3 2016 Xiaolan English 93 13 > Result2 <- Result[-2]#Delete name column> Result2 StudentID subjects scores age 1 2014 English 87 12 2 2015 English 98 14 3 2016 English 93 13
What should I do if I need to query the information of students whose grades are equal to 98?
> Result[which(Result$scores==98),] StudentID names subjects scores age 2 2015 Xiaohong English 98 14
As mentioned above, matrix and data frame are also two different data types. We know that data types can be converted to each other. Use is.***( ) to determine whether a variable is of type ***, and use as.***( ) to convert a variable to type ***. Then, correspondingly, the matrix converted to data frame type should be:
> myarray C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20 > myarrayframe <- (myarray) > myarrayframe C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20 > (myarray) [1] FALSE > (myarrayframe) [1] TRUE
Like matrix matrix operations, the data frame also has rbind and cbind functions, and the usage is roughly the same. Friends who are interested can contact you briefly, so I won't go into details here.
Finally, let’s talk about data processing operations in data frames:
As we mentioned above, we can use dataframe[column number] or dataframe[column value] to read a column of the dataframe, and the return value is still the dataframe type, but this part of the data is not convenient to directly use the sum and average methods we mentioned before for calculation and analysis, because the read data has "row name/column name", which is a character variable. Some people may ask, when I create a data frame, wouldn’t it be enough to not add row names and column names? First, when creating a data frame, you will be assigned row names or column names by default. Second, even if you do not assign row names or column names, what is the point of creating a data frame?
> mydataframe C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20 > mydataframe["C4"] C4 R1 13 R2 14 R3 15 R4 16 > mean(mydataframe["C4"]) [1] NA Warning message: In (mydataframe["C4"]) : Parameters are not numerical or logical values:ReplyNA > (mydataframe["C4"]) [1] TRUE
Method 1:Reconvert the data frame format into a matrix format, and then find the data group to be processed according to the matrix index, and use the relevant functions in the matrix or vector to perform certain data processing.
> myarray2 <- (mydataframe) > (myarray2) [1] TRUE > myarray2 C1 C2 C3 C4 C5 R1 1 5 9 13 17 R2 2 6 10 14 18 R3 3 7 11 15 19 R4 4 8 12 16 20 > x <- myarray[,3] #Read the value of column 3> x R1 R2 R3 R4 9 10 11 12 > (x) # Check whether x is a vector type[1] TRUE > mean(x) [1] 10.5 > sum(x) [1] 42
Method 2:When reading the data frame column, use another method, dataframe$ (row name or column name), the return value is vector type
> c <- mydataframe$C3 > c [1] 9 10 11 12 > (c) [1] TRUE > mean(c) [1] 10.5 > sum(c) [1] 42
At the same time, you can also use dataframe$ (new column name) <- new vector to add new columns to the dataframe. The specific operation is as follows:
> mydataframe$sum <- mydataframe$C1 +mydataframe$C4 > mydataframe$mean <- (mydataframe$C1+mydataframe$C4)/2 > mydataframe C1 C2 C3 C4 C5 sum mean R1 1 5 9 13 17 14 7 R2 2 6 10 14 18 16 8 R3 3 7 11 15 19 18 9 R4 4 8 12 16 20 20 10
The most popular method is the next method, which directly uses the transform function to form a new data frame. The specific usage is as follows:
> x1 <- mydataframe$C1 > x2 <- mydataframe$C3 > mydataframe2 <- transform(mydataframe,sum2=x1+x2,mean2=(x1+x2)/2) > mydataframe2 C1 C2 C3 C4 C5 sum mean sum2 mean2 R1 1 5 9 13 17 14 7 10 5 R2 2 6 10 14 18 16 8 12 6 R3 3 7 11 15 19 18 9 14 7 R4 4 8 12 16 20 20 10 16 8
This is the article about the commonly used operations of Dataframe in R language. For more related R languages, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!