When importing large batches of data, if "stringsAsFactors = FALSE" is not explicitly specified, all strings will be converted to factors by default, resulting in slower data processing.
The sample data is as follows:
name,math,english,sex,year "yiifaa",65,68,"M",2018 "yiifee",95,98,"F",2018 "guagua",75,78,"M",2018 "MM",85,88,"F",2018
Check the data summary and found that the string is converted to a factor by default and the group count is performed (this is also one of the reasons for the slow processing speed)
The summary is as follows:
name math english sex year guagua:1 Min. :65.0 Min. :68.0 F:2 Min. :2018 MM :1 1st Qu.:72.5 1st Qu.:75.5 M:2 1st Qu.:2018 yiifaa:1 Median :80.0 Median :83.0 Median :2018 yiifee:1 Mean :80.0 Mean :83.0 Mean :2018 3rd Qu.:87.5 3rd Qu.:90.5 3rd Qu.:2018 Max. :95.0 Max. :98.0 Max. :2018
But such a group count is meaningless, so it is necessary to convert "" into characters, as follows:
#! /usr/bin/env RScript setwd("D:/Workspace/R-Works/R-Stat") scores <- ("", header = TRUE, sep = ",", quote="\"", encoding = "UTF-8", stringsAsFactors = TRUE) # Convert factor to characterscores$name <- (scores$name) # Transfer one more for testingscores$sex <- (scores$sex)
Check out the summary again, as follows:
name math english sex year Length:4 Min. :65.0 Min. :68.0 Length:4 Min. :2018 Class :character 1st Qu.:72.5 1st Qu.:75.5 Class :character 1st Qu.:2018 Mode :character Median :80.0 Median :83.0 Mode :character Median :2018 Mean :80.0 Mean :83.0 Mean :2018 3rd Qu.:87.5 3rd Qu.:90.5 3rd Qu.:2018 Max. :95.0 Max. :98.0 Max. :2018
It can be seen that there is no group count in the summary, but there is more total count. If you want to restore the group count, you need to recreate the factor, as follows:
scores$sex <- factor(scores$sex, levels=c("M", "F"), ordered = TRUE)
in conclusion
When importing large batches of data, in order to improve performance, take two steps as much as possible:
1. Explicitly specify "stringsAsFactors = FALSE";
2. Convert the required data columns (vectors) into factors in turn;
Supplement: R language: Conversion of variable names and strings
In R language, you often encounter the problem of converting variable names and strings to each other.
For example, perform 1000 cycles and store the calculation results in 1000 variables, such as x_1, x_2, ..., x_1000. At this time, you can use the assign() function, the example is as follows:
> a mistake: The object cannot be found'a' > assign('a', 1) > a [1] 1
The above example converts the character 'a' to the variable a and assigns it to 1.
Instead, what if we want to iterate through a sequence of variables and operate on each of them? We can use the get() function. Examples are as follows:
> a <- 1 > b <- 2 > c <- 3 > sequence <- c('a', 'b', 'c') > for (var in sequence){print(var + 10)}
Error in var + 10: Nonnumerical parameters are present in binary column operator
> for (var in sequence){print(get(var) + 10)} [1] 11 [1] 12 [1] 13
We can find that the get function converts the character var into a variable and performs subsequent operations based on the value of the variable.
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.