SoFunction
Updated on 2025-03-01

R language: Implement the mutual conversion of factors and strings

When importing large batches of data, if "stringsAsFactors = FALSE" is not explicitly specified, all strings will be converted to factors by default, resulting in slower data processing.

The sample data is as follows:

name,math,english,sex,year
"yiifaa",65,68,"M",2018
"yiifee",95,98,"F",2018
"guagua",75,78,"M",2018
"MM",85,88,"F",2018

Check the data summary and found that the string is converted to a factor by default and the group count is performed (this is also one of the reasons for the slow processing speed)

The summary is as follows:

  name        math         english     sex        year     
 guagua:1   Min.   :65.0   Min.   :68.0   F:2   Min.   :2018  
 MM    :1   1st Qu.:72.5   1st Qu.:75.5   M:2   1st Qu.:2018  
 yiifaa:1   Median :80.0   Median :83.0         Median :2018  
 yiifee:1   Mean   :80.0   Mean   :83.0         Mean   :2018  
            3rd Qu.:87.5   3rd Qu.:90.5         3rd Qu.:2018  
            Max.   :95.0   Max.   :98.0         Max.   :2018  

But such a group count is meaningless, so it is necessary to convert "" into characters, as follows:

#! /usr/bin/env RScript
setwd("D:/Workspace/R-Works/R-Stat")
scores <- ("", header = TRUE, sep = ",", quote="\"", encoding = "UTF-8", stringsAsFactors = TRUE)
# Convert factor to characterscores$name <- (scores$name)
# Transfer one more for testingscores$sex <- (scores$sex)

Check out the summary again, as follows:

name                math         english         sex                 year     
 Length:4           Min.   :65.0   Min.   :68.0   Length:4           Min.   :2018  
 Class :character   1st Qu.:72.5   1st Qu.:75.5   Class :character   1st Qu.:2018  
 Mode  :character   Median :80.0   Median :83.0   Mode  :character   Median :2018  
                    Mean   :80.0   Mean   :83.0                      Mean   :2018  
                    3rd Qu.:87.5   3rd Qu.:90.5                      3rd Qu.:2018  
                    Max.   :95.0   Max.   :98.0                      Max.   :2018  

It can be seen that there is no group count in the summary, but there is more total count. If you want to restore the group count, you need to recreate the factor, as follows:

scores$sex <- factor(scores$sex, levels=c("M", "F"), ordered = TRUE)

in conclusion

When importing large batches of data, in order to improve performance, take two steps as much as possible:

1. Explicitly specify "stringsAsFactors = FALSE";

2. Convert the required data columns (vectors) into factors in turn;

Supplement: R language: Conversion of variable names and strings

In R language, you often encounter the problem of converting variable names and strings to each other.

For example, perform 1000 cycles and store the calculation results in 1000 variables, such as x_1, x_2, ..., x_1000. At this time, you can use the assign() function, the example is as follows:

&gt; a
mistake: The object cannot be found'a'
&gt; assign('a', 1)
&gt; a
[1] 1

The above example converts the character 'a' to the variable a and assigns it to 1.

Instead, what if we want to iterate through a sequence of variables and operate on each of them? We can use the get() function. Examples are as follows:

> a <- 1
> b <- 2
> c <- 3
> sequence <- c('a', 'b', 'c')
> for (var in sequence){print(var + 10)}

Error in var + 10: Nonnumerical parameters are present in binary column operator

> for (var in sequence){print(get(var) + 10)}
[1] 11
[1] 12
[1] 13

We can find that the get function converts the character var into a variable and performs subsequent operations based on the value of the variable.

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.