R language implementation converts categorical variables into dummy variables (dummy variable)

Generate test data

a1 <- c(“f”,”f”,”b”,”b”,”c,”c”)

Utilize functions in nnet package

> (a1)
b c f
[1,] 0 0 1
[2,] 0 0 1
[3,] 1 0 0
[4,] 1 0 0
[5,] 0 1 0
[6,] 0 1 0

Code

 &lt;- function(cl) {
  n &lt;- length(cl)
  cl &lt;- (cl)
  x &lt;- matrix( 0,  n ,  length(levels(cl)) )
  # unclass Returns the position of each character in the level table  # Then calculate the position in the vector according to the column  x[n*(unclass(cl)-1) + （1：n）] &lt;- 1
  dimnames(x) &lt;- list(names(cl), levels(cl))
  x
}

Supplement: Settings of dumb variables in R language

When constructing a regression model, if the independent variable X is a continuous variable, the regression coefficient β can be interpreted as: under the condition that other independent variables are unchanged, each unit of X changes, the average change of the dependent variable Y caused by X; if the independent variable X is a binary categorical variable, such as whether to drink alcohol (1=yes, 0=no), the regression coefficient β can be interpreted as: under the condition that other independent variables are unchanged, X=1 (drinker) and X=0 (drinker).

However, when the independent variable X is a multi-categorical variable, such as occupation, education, blood type, disease severity, etc., it is too unsatisfactory to use only one regression coefficient to explain the changing relationship between multi-categorical variables and its impact on the dependent variable.

At this time, we usually convert the original multi-categorized variable into dummy variables. Each dummy variable only represents the difference between two levels or several levels. By constructing a regression model, each dummy variable can obtain an estimated regression coefficient, making the result of the regression easier to explain and more practical.

Dummy Variable, also known as dummy variables, dummy variables or nominal variables, you can see from the name that it is an artificial variable, usually with a value of 0 or 1 to reflect the different properties of a variable. For independent variables with n classification properties, 1 category is usually required to be selected as a reference, so n-1 dumb variables can be generated.

Under what circumstances do you need to set a dumb variable

1. For disordered multi-categorized variables, it needs to be converted into dummy variables when introducing the model.

For example, blood type is generally divided into four types: A, B, O, and AB. It is an unordered multi-categorized variable. Usually, when entering data, in order to quantify the data, we often assign values to 1, 2, 3, and 4.

From the perspective of numbers, after the assignment is 1, 2, 3, and 4, they have a certain sequential relationship from small to large. In fact, there is no such magnitude relationship between the four blood types, and they should be equal and independent of each other. It is unreasonable to assign values according to 1, 2, 3, and 4 and bring them into the regression model. At this time, we need to convert it into a dumb variable.

2. For ordered multi-categorized variables, the introduction of the model needs to be considered as appropriate.

For example, the severity of a disease is generally divided into mild, moderate and severe, which can be considered as an ordered multi-categorical variable. Under normal circumstances, we often assign it to 1, 2, 3 (equal distance) or 1, 2, 4 (equal ratio), etc., and through the numerical relationship from small to large, a certain hierarchical relationship between the severity of the disease is reflected.

However, it should be noted that once the value is assigned to the above numerical form of equidistance or equal ratio, this is to some extent that the severity of the disease also presents a similar equidistance or equal ratio relationship. In fact, due to the clinical complexity of the disease, the different severity are not strictly equidistant or equal-biased relationships, so it is unreasonable to assign the value to the above form. At this time, it can be converted into a dumb variable for quantification.

3. For continuous variables, you can consider setting it as a dummy variable when converting variables.

For continuity variables, many people think that they can be directly brought into the regression model, but sometimes we also need to make appropriate transformations to the continuity variables based on the actual clinical significance. For example, age, when brought into the model as a continuous variable, is interpreted as the effect of age on the dependent variable every year that increases. But often the age increases by one year, the effect is very weak and does not have much practical significance.

At this time, we can discrete the continuous variable of age and divide it according to the age group of 10 years, such as 0-10, 11-20, 21-30, 31-40, etc., and assign each group to 1, 2, 3, and 4. At this time, the regression coefficient of the model can be explained as the impact of age on the dependent variable for every 10 years of increase in age.

The above assignment method is based on a premise, that is, there is a certain linear relationship between age and the dependent variable. But sometimes the following situations may occur. For example, in the lower and higher age groups, the mortality rate of a certain disease is higher, while in the middle-aged and young people, the mortality rate is relatively low, and there is a U-shaped relationship between age and death outcome. At this time, it is unreasonable to assign the age group to 1, 2, 3, and 4.

Therefore, when we cannot determine the change relationship between independent variables and dependent variables and discrete the continuous independent variables, we can consider performing dummy variable transformations.

There is another situation. For example, when BMI is divided into categories such as underweight, normal weight, overweight, obesity, etc. according to clinical diagnostic criteria, since the tangent points divided between different categories are not equally apart, the assignment of values is 1, 2, and 3 is not in line with the actual situation, and it can also be considered to convert it into a dumb variable.

When setting dummy variables, which category should be selected as a reference?

1. Generally speaking, categories with specific meanings or with a certain order level can be selected as reference.

For example, marital status is divided into unmarried, married, divorced, widowed, etc., and "unmarried" can be used as a reference; or, for example, education, it can be divided into primary schools, middle schools, universities, graduate students and other categories, and there is a certain order. "Primary school" can be used as a reference to make the regression coefficient easier to explain.

2. Clinical normal levels can be selected as reference

For example, BMI is divided into categories such as underweight, normal weight, overweight, obesity, etc. According to clinical diagnostic criteria, "normal weight" can be selected as a reference. Other categories are compared with normal weight, which is more clinically practical.

3. You can also use the key categories of researchers' concerns as references

For example, blood types are divided into four types: A, B, O, and AB. Researchers are more concerned about people with O-type blood, so they can use O-type as a reference to analyze the differences in the impact of other blood types on the outcome after comparison with O-type.

4. Implementation in R language

When modeling data including categorical variables in R language, it is generally automatically processed as dummy variables or dummy variables. But there are some special functions, such as the neuralnet function in the neuralnet package, which will not be preprocessed. If you throw the original data directly into it, you will see "requires numeric/complex matrix/vector arguments" that require numeric/complex matrix/vector parameters.

At this time, in addition to deleting these variables, we can only manually convert the factor variable into a dummy variable with the value (0,1). The functions used are generally (), () in nnet package

The following is a UCI german credit data as an example.

First, download the dataset from the UCI website and use the str function to gain a simple understanding of it.

("/ml/machine-learning-databases/statlog/german/", "./")
data <- ("./")
str(data)

There are 21 variables in this data, of which V21 is the target variable, and V1-V20 includes two types: integer and factor. Below, we will use V1 categorical variables (including 4 levels) and three numerical variables V2, V5, and V8 as explanatory variables.

## '':    1000 obs. of  21 variables:
##  $ V1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ V3 : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ V4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ V5 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ V6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ V7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ V8 : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ V9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ V10: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ V11: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ V12: Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ V13: int  67 22 49 45 53 35 53 35 61 28 ...
##  $ V14: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ V15: Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ V16: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ V17: Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ V18: int  1 1 2 2 2 2 1 1 1 1 ...
##  $ V19: Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ V20: Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ V21: int  1 2 1 1 2 1 1 1 1 2……

First, try loading the neuralnet package, and only modeling with numerical variables, without errors.

library("neuralnet")
NNModelAllNum <- neuralnet(V21 ~ V2 + V5 + V8, data)
NNModelAllNum

When we put V1 into the explanatory variable, the following error occurred

NNModel &lt;- neuralnet(V21 ~ V1 + V2 + V5 + V8, data)
## Error: Requires numerical/complex matrix/vector parameters

At this time, you can use a function to convert V1 into three dummy variables, V1A12, V1A13, V1A14.

>dummyV1 <- (~V1, data)
>head(cbind(dummyV1, data$V1))
   (Intercept) V1A12 V1A13 V1A14  
 1           1     0     0     0 1
 2           1     1     0     0 2
 3           1     0     0     1 4
 4           1     0     0     0 1
 5           1     0     0     0 1
 6           1     0     0     1 4

Because the function has no effect on the numerical and categorical variables with Classification Level=2, you can use the function to generate a new dataset modelData, and you can use the dataset to model it.

>modelData <- (~V1 + V2 + V5 + V8 + V21, data)
>head(modelData)
   (Intercept) V1A12 V1A13 V1A14 V2   V5 V8 V21
 1           1     0     0     0  6 1169  4   1
 2           1     1     0     0 48 5951  2   2
 3           1     0     0     1 12 2096  2   1
 4           1     0     0     0 42 7882  2   1
 5           1     0     0     0 24 4870  3   2
 6           1     0     0     1 36 9055  2   1

Another way is to function from nnet package

>library("nnet")
>dummyV12 <- (data$V1)
>head(dummyV12)
      A11 A12 A13 A14
 [1,]   1   0   0   0
 [2,]   0   1   0   0
 [3,]   0   0   0   1
 [4,]   1   0   0   0
 [5,]   1   0   0   0
 [6,]   0   0   0   1

As you can see, the result is slightly different from that, four dummy variables are generated. It should be noted that in order to avoid multicollinearity, for a categorical variable with level=n, you only need to select any n-1 dummy variables.

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.