Solve the problem of data imbalance in R language

R language solves data imbalance problem

1. Project environment

Development Tools: RStudio

R：3.5.2

Related packages: dplyr, ROSE, DMwR

2. What is data imbalance? Why deal with data imbalance?

The first problem we need to know is "what is data imbalance". The literal explanation is that the data distribution is uneven. When we do supervised learning, if the proportion of one class in the data is much larger than that of other classes, or if the ratio of one class is much smaller than that of other classes, we can think that there is data imbalance problem in this data.

So what impact will such a question have on our subsequent analysis work? Let me give you a simple example, maybe everyone will understand.

Suppose we now need to train a model to distinguish which person in the crowd is a terrorist. So now we have given data of 10,000 people. Before we do the analysis, we are actually very clear that the proportion of terrorists among a group of people is definitely much smaller than that of ordinary people.

So if only one of these 10,000 people is a terrorist, then the ratio of terrorists to normal people is 9999: 1.

So if we directly conduct supervised learning without performing any processing, the model only needs to classify all data as normal people, and the accuracy of the model can reach 99.99%. And such a model is obviously meaningless.

Because basically, the characteristics of terrorists that are likely to exist are basically ignored by the model, which shows why we need to deal with data imbalance.

3. Common data imbalance processing methods

Here are some common methods for dealing with data imbalance:

1. Undersampling method

2. Oversampling method

3. Synthetic Data Generation

4. Cose Sensitive Learning

[Note]: This article mainly focuses on implementation, so there is no explanation of the above methods.

Before processing data, let’s take a look at the distribution of data that needs to be processed.

load("C:/Users/User/Desktop/")
table(data$classification)
(table(data$classification))

> table(data$classification)

-8 1 2 3 4 5

12 104 497 1158 4817 1410

> (table(data$classification))

-8 1 2 3 4 5

0.001500375 0.013003251 0.062140535 0.144786197 0.602275569 0.176294074

1. Undersampling

######### Method 1 ############library(ROSE)
# Since it is a multi-classification problem, we first extract the class with the largest proportion and the class with the smallest proportion in the data.# Balance (transformed into binary classification problems)test &lt;- data[which(data$classification == -8 | data$classification == 4),]
# Convert the classification result into a factor (or else an error will be reported)test$classification &lt;- (test$classification)
# Make undersampling# where method = "under" means the method used is "undersampling"# N = 40 indicates the number of the final entire dataset# seed Random seeds, to preserve the tracking of samplesunder &lt;- (classification ~ ., test, method = "under", N = 40, seed = 1)$data
# View the resultstable(under$classification)

> table(under$classification)

4 -8

28 12

########## Method 2 ############library(dplyr)
# Since it is a multi-classification problem, we first extract the class with the largest proportion and the class with the smallest proportion in the data.# Balance (transformed into binary classification problems)test &lt;- data[which(data$classification == -8 | data$classification == 4),]
# Extract large proportionstest1 &lt;- test[which(test$classification == 4),]
# Reduce the number of large proportions to 12down &lt;- sample_n(test1, 12, replace = TRUE)
# Merge undersampled classesdown &lt;- rbind(test[which(test$classification == -8), ],down)
table(down$classification)

> table(down$classification)

-8 4

12 12

[Note]: Undersampling is a sampling without resumption.

2. Oversampling

######### Method 1 ############library(ROSE)
test &lt;- data[which(data$classification == -8 | data$classification == 4),]
test$classification &lt;- (test$classification)
# The implementation is roughly the same as undersampling, only the type method has been changed to "over", and there is no limit on the total numberunder &lt;- (classification ~ ., test, method = "over", seed = 1)$data
table(under$classification)

> table(under$classification)

4 -8

4817 4785

########## Method 2 ############library(dplyr)
test &lt;- data[which(data$classification == -8 | data$classification == 4),]
# Extract small proportionstest1 &lt;- test[which(test$classification == -8),]
# Reduce the number of small proportion classes to 4817 (same as large proportion classes)# The oversampling method used here is to randomly copy data from a small proportion class and expand it to a specified numberdown &lt;- sample_n(test1, 4817, replace = TRUE)
down &lt;- rbind(test[which(test$classification == 4), ],down)
table(down$classification)

> table(down$classification)

-8 4

4817 4817

3. Synthetic Data Generation

######### Method 1 ############library(ROSE)
# Since it is a multi-classification problem, we first extract the class with the largest proportion and the class with the smallest proportion in the data.# Balance (transformed into binary classification problems)test &lt;- data[which(data$classification == -8 | data$classification == 4),]
# Convert the classification result into a factor (or else an error will be reported)test$classification &lt;- (test$classification)
# ROSE provides the ROSE() function to synthesize artificial datarose &lt;- ROSE(classification ~ ., test, seed = 1)$data
# View the resultstable(rose$classification)

> table(rose$classification)

4 -8

2483 2346

########## Method 2 ############library(DMwR)
test &lt;- data[which(data$classification == -8 | data$classification == 4),]
test$classification &lt;- (test$classification)
# : If = n, the number of small proportion classes becomes (n/100)a + a data (a is the original number of small proportion classes)# : If = m, the number of large proportion classes becomes ((nm)/100)a# Therefore, in this case, the number of small proportion classes becomes (3500/100)*12 + 12 = 432# The number of large proportions becomes ((3500*300)/100^2)*12 = 1260down &lt;- SMOTE(classification ~ ., test,  = 3500,  = 300)
table(down$classification)

> table(down$classification)

-8 4

432 1260

[Note]: Compared with the first two methods, artificial synthesis will neither easily lead to overfitting problems like oversampling, nor will there be a problem of large amounts of information lost undersampling.

4. Cose Sensitive Learning

[Note]: I haven't figured out how to write it yet. . . . .

III. Conclusion

The reason why this article only uses two categories for analysis is that the functions mentioned above used to solve the data imbalance problem are basically targeted at binary classification problems. When there are more than two categories in the imported data, the function will report an error.

However, in the process of actual analysis, we actually encounter multiple classification problems more often. This is what we need to transform multi-classification problems into binary classification problems, and compare each classification in pairs to better solve the problem of data imbalance.