The data processing technique I will introduce to you now is length to width, which is equivalent to transposition in Excel. However, the length to width and data merging implemented in R language are naturally much more powerful than Excel.
Here are 4 functions, among which melt() and dcast() come from the reshape2 package, gather() and spread() come from the tidyr package
1. Width to length—melt(), gather()
mydata<-( name=c("store1","store2","store3","store4"), address=c("Putuo District","Huangpu District","Xuhui District","Pudong New Area"), sale2014=c(3000,2500,2100,1000), sale2015=c(3020,2800,3900,2000), sale2016=c(5150,3600,2700,2500), sale2017=c(4450,4100,4000,3200) ) #Wide to length—meltmydata1<-melt( mydata, =c("address","name"),#The main field to be retained = "Year",#Converted classification field name (dimension) = "Sale" #Converted metric value name)
Output result
> mydata1<-melt( + mydata, + =c("address","name"),#The main field to be retained+ = "Year",#Converted classification field name (dimension)+ = "Sale" #Converted metric value name+ ) > mydata1 address name Year Sale 1 Putuo District store1 sale2014 3000 2 Huangpu District store2 sale2014 2500 3 Xuhui District store3 sale2014 2100 4 Pudong New Area store4 sale2014 1000 5 Putuo District store1 sale2015 3020 6 Huangpu District store2 sale2015 2800 7 Xuhui District store3 sale2015 3900 8 Pudong New Area store4 sale2015 2000 9 Putuo District store1 sale2016 5150 10 Huangpu District store2 sale2016 3600 11 Xuhui District store3 sale2016 2700 12 Pudong New Area store4 sale2016 2500 13 Putuo District store1 sale2017 4450 14 Huangpu District store2 sale2017 4100 15 Xuhui District store3 sale2017 4000 16 Pudong New Area store4 sale2017 3200
Let's see how to use the gather() function
> #Wide to length—gather> mydata1<-tidyr::gather( + data=mydata, + key="Year", + value="sale", + sale2014:sale2017 + ) > mydata1 name address Year sale 1 store1 Putuo District sale2014 3000 2 store2 Huangpu District sale2014 2500 3 store3 Xuhui District sale2014 2100 4 store4 Pudong New Area sale2014 1000 5 store1 Putuo District sale2015 3020 6 store2 Huangpu District sale2015 2800 7 store3 Xuhui District sale2015 3900 8 store4 Pudong New Area sale2015 2000 9 store1 Putuo District sale2016 5150 10 store2 Huangpu District sale2016 3600 11 store3 Xuhui District sale2016 2700 12 store4 Pudong New Area sale2016 2500 13 store1 Putuo District sale2017 4450 14 store2 Huangpu District sale2017 4100 15 store3 Xuhui District sale2017 4000 16 store4 Pudong New Area sale2017 3200
Unlike melt() function, the gather() function needs to specify the key field key and the corresponding value of the key field, but the gather() function is easier to understand.
2. Length to width—dcast() and spread()
Let’s use the data1 dataset above, let’s take a look at the dcast() function first
#Length to width—dcastdcast( data=mydata1, name+address~Year #The left side is the field to be retained, the right side is the categorical variable to be divided, the number of columns is equal to the expression #Number of categories of categorical variables on the right)
> #Length to width—dcast> dcast( + data=mydata1, + name+address~Year + #The left side is the field to be retained, the right side is the categorical variable to be divided, the number of columns is equal to the expression+ #Number of categories of categorical variables on the right+ ) Using sale as value column: use to override. name address sale2014 sale2015 sale2016 sale2017 1 store1 Putuo District 3000 3020 5150 4450 2 store2 Huangpu District 2500 2800 3600 4100 3 store3 Xuhui District 2100 3900 2700 4000 4 store4 Pudong New Area 1000 2000 2500 3200
The rules of using dcast() function need to be pondered before you can understand it. Please take a look at the comments section and then take a look at spread()
#Length to width—spreadtidyr::spread( data=mydata1, key=Year, value=sale )
> #Length to width—spread> tidyr::spread( + data=mydata1, + key=Year, + value=sale + ) name address sale2014 sale2015 sale2016 sale2017 1 store1 Putuo District 3000 3020 5150 4450 2 store2 Huangpu District 2500 2800 3600 4100 3 store3 Xuhui District 2100 3900 2700 4000 4 store4 Pudong New Area 1000 2000 2500 3200
To call tidyr::spread directly, you need to specify the key field key and the corresponding value.
But from an understanding point of view, I personally prefer the functions of the tidyr package. The use is very clear. You can choose according to the actual situation. Okay, today's sharing ends, see you next time!
Supplement: Use R language to convert data ranks and queues
Software used: R language, mysql
System of use: ubuntu16.04
Comparison before and after use:
Before use
x1 | x2 | y |
---|---|---|
a | A | 1 |
b | B | 2 |
c | C | 3 |
a | A | 4 |
After use (the values inside can undergo multiple transformations, mainly changing x1 and x2 to their crosstab)
A | B | C | |
---|---|---|---|
a | 5(1+4) | NA | NA |
b | NA | 2 | NA |
c | NA | NA | 3 |
Language and mysql database connection
Download RMysql (because I'm using ubuntu, so I use RMysql):
("RMySQL")
Load the required program package:
library(RMySQL)
Import mysql data into R:
con<-dbConnect(MySQL(),dbname="database_name",user="user_name",password="password")
After importing, the resulting data type is Dataframe()
For the RMySQL package, there are several more important statements:
summary(con) : used to view the mysql information corresponding to con
dbListTables(con) : Used to view all tables in the con
dbRemoveTable(con,”table_name”)
dbReadTable(con,”table_name”)
dbWriteTable(con,"table_name",value) where the value type needs to be a data frame
2. Implement data transformation in R
We use the above statement to read the table that needs to be transformed
data=dbReadTable(con,"table_name")
Then use an important function: taply()
Note that I am using the table I used in the example I gave below
data2=tapply(data\$y,list(data\$x1,data$x2),sum)
For taply(x,list(a,b),fac)
a and b are the row list headers that form the crosstab
And x is the value, of course we can not use the value in data as x
fac is a function operation on x. For example, using sum, which is to add the same y of x1 and x2
3. Import the data back to mysql
We have transformed the data into the desired table form above. However, after using taply, the data2 class will become a matrix. If you want to know the data type, you can view it using class() in R. The RMySQL package we use requires a data frame to write data in r to mysql, so we need to convert the matrix into a data frame. So how to do it? It's very simple.
data3=(data2)
In this way, our data3 is a data frame, using statements
dbWriteTable(con,test,data3)
This writes a table in mysql, called test, and the data value is data3.
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.