SoFunction
Updated on 2025-03-10

Detailed explanation of R language learning data acquisition operation example

Introduction

In fact, there are a large number of built-in data sets in R that can be used for analysis and practice, and we can also create data that simulates a specific distribution in R. In actual work, data analysts often face external data from multiple data sources, namely data files with various extensions, such as .txt, .csv, .xlsx, .xls, etc. Files with different extensions represent different file formats, which often cause trouble for analysts.

R provides a wide range of data import tools.

1. Get built-in dataset

The built-in dataset in R exists in each package, among which the basic packagedatasetsThere are only data sets, no functions. This package provides nearly 100 data sets covering various fields such as medicine, nature, and sociology.

You can use the following command to view:

data(package = "datasets")

If you want to call a dataset, you can usedata( )Function. Run the following command and R will load the dataset iris to the workspace.

data(iris)

In addition to the datasets package, many other packages in R also come with datasets. If it is not a basic package that is automatically loaded after running R, we need to install and load these packages before we can use the data in it. underTake the dataset baacteria in the MASS package as an exampleExplain the data calling process:

library(MASS)
data(bacteria)

2. Simulate data with specific distributions

R provides a series of functions that can be used for numerical simulation. These functions arerAt the beginning, commonly used ones include: rnorm( ), runif( ), rbinom( ) and rpois( ), etc. For example:

# The subsequent visualization section will introduce the histogram in detailr1 <- rnorm(n = 100, mean = 0, sd = 1)
# head(r1) # Take the first 5 values ​​and seehist(r1)
r2 <- runif(n = 10000, min = 0, max = 100)
hist(r2)
r3 <- rbinom(n = 80, size = 100, prob = 0.1)
hist(r3)
r4 <- rpois(n = 50, lambda = 1)
hist(r4)

3. Get data in other formats

3.1 txt and csv formats

If the data source is an ASCII format file created with Windows Notepad or other plain text editor, we can use functions( )Read the data in it and returns a data frame.

For example, suppose the created data frame patients' data fileStored in the current working directory, we can use the following command to read the data:

# getwd() # Get the current working directory# Create the data file temporarilyID <- 1:5
sex <- c("male", "female", "male", "female", "male")
age <- c(25, 34, 38, 28, 52)
pain <- c(1, 3, 2, 2, 3)      
 <- factor(pain, levels = 1:3, labels = c("mild", "medium", "severe"))   
patients <- (ID, sex, age, )
(patients, "",  = FALSE)
 <- ("", header = TRUE)

Generate frequently in spreadsheet and database applicationsText file with separator,in.csv files are separated by commas(Comma Separated Values). Function( )is a variant of the function ( ) dedicated to reading .csv files.

( )and ( )The default values ​​of parameters in the two functions are different.

In the function ( )In  , the default value of the parameter header is FALSE, that isIt is believed that the first line of the file starts with the data rather than the variable name.

And in the function ( )In  , the default value of the parameter header is TRUE. Therefore, before reading the data, it is recommended to open the original file for viewing, and then set appropriate parameters to read the data correctly.

(patients, "", =FALSE)
 <- ("")

3.2 xls or xlsx format

There are many ways to read spreadsheet data, the easiest way is to save the data file as a comma-separated (.csv) file in Excel and then read it into R using the above method of reading the .csv file. You can also use third-party packages (such as openxlsx package, readxl package, and gdata package)Read data files in xlsx or xls format directly

Take the openxlsx package as an example:

library(openxlsx)
(patients, "")
 <- ("", sheet = 1)

3.3 Import data from other statistical software

Sometimes we need to read data generated by other statistical software, such as SPSS, SAS, Stata, Minitab, etc. One way is to output the data as a text file from other statistical software and then read the data into R using the function ( ) or ( ). Another method is to use an extension package, such as the foreign package, whose main function is to read and write data from other statistical software.

The following is an example of importing SPSS data files.

Assuming data fileStored in the current working directory, we can use the following command to read the dataset into R:

# To save the number of attachments, let's download it directly to the workspaceURL &lt;- "/qlhatmok4/"
(URL, destfile = "./", method="curl")
library(foreign)
# The parameter `` in function `( )` defaults to FALSE. If not set to TRUE, the returned data will be a list form. &lt;- ("" ,  = TRUE)

The process of importing data files of SAS, Stata and other software using foreign package is similar to the above. Please check it for details.document

4. Data entry

You can enter data directly in R, but if the data volume is large (more than 10 columns or more than 30 rows), entering data in R is not the best choice. We can choose to use spreadsheet software to enter small-scale data, such as Excel.

However, if the amount of data is large, the probability of errors in manually entering data using spreadsheet software is also high. At this time, program software designed specifically for data entry is more suitable, such as the free software EpiData. This software not only can conveniently set constraints for data entry, such as range checking, line wrapping, etc., but also add labels to each variable and variable value.

Functions in foreign package( )You can directly read the .rec file generated by EpiData, butIt is recommended to export the entered data into Stata data files first in EpiData, and then use the function ( ) in R to read the data. The advantage of this is that the properties of variables preset in EpiData, such as variable labels and descriptions, etc. can be retained.

The above is the detailed explanation of the R language learning data acquisition operation example. For more information about R language data acquisition operation, please pay attention to my other related articles!