Parallel explanation of two commonly used parallel methods in R language

Since some paper simulations have been conducted recently, two parallel methods have been tried:parallelandsnowfall, these two methods have their own advantages and disadvantages, but it is still recommendedsnowfall, is relatively stable overall, and it is not easy to report errors due to insufficient memory or excessive parallel threads.

parallel computing

parallel computing: Simply put, it is to use multiple computing resources to solve a computing problem at the same time, which is an effective means to improve the computing speed and processing capabilities of computer systems. (refer to:Introduction to parallel computing）

A problem is broken down into a series of discrete parts that can be executed concurrently;
Each part can be further broken down into a series of discrete instructions;
Instructions from each section can be executed simultaneously on different processors;
An overall control/collaboration mechanism is needed to be responsible for scheduling the execution of different parts.

In our usual simulation, on a computer or server, our computing tasks are distributed among multiple different small cores for processing at the same time.

Where can parallel be used during simulation?

Parallel operationIt is generally suitable for repeated operations, such as repetitive randomly generating data according to the same distribution, and then simulated at the same time. Parallel can be used here. Or we need to do permutation to calculate p-value and other information, and we can also perform parallelism, because this operation can be completed simply by repeating it.

However, algorithms such as iteration and recursion are difficult to implement in parallel, and these are all called serial. Because the latter object requires information from the previous object, you can only calculate the previous one first and then calculate the next content.

When conducting actual simulations to compare the advantages and disadvantages of various methods, the experiment usually needs to be repeated hundreds or thousands of times. Generally, parallel operations can be performed here. The operation written here is the easiest. But there is a disadvantage: it may occur that the server has been running for a long time and has not yet had any results, but I don’t know where it is running. Although there are some ways to view (e.g.snowfallIn-housesfCat()Function, but the output results are relatively messy, and sometimes they cannot output. The specific usage will be introduced later), but it may still be a long time to get some results. If there are some small flaws in parallel dimensions or code, the entire result cannot be output.

Therefore, it is recommended that if you can write parallel to each algorithm, try to write it into each specific algorithm (if you need to write permutation into permutation; if you want to calculate statistics and other information multiple times, you can directly replace the for loop), so that it is more convenient for later actual operations. (The disadvantage of doing this is that it may cause excessive memory usage, which will cause parallel errors)

How do we see in R that we can use parallelism?

Just use the following command to view the number of threads our computer can use:

detectCores()

Theoretically, if this value is ≥2, our computer can operate in parallel (now computers are basically 4 up). Of course, we usually don't use all threads to perform parallelism, otherwise. . . The computer is likely to crash.

Back to the topic, the following are two commonly used parallel operations in R (the default will be the apply family-related operations).

parallel (simple)

One isparallelPackage, the biggest advantage of this package is that it is very convenient, just put our original onesapply()Modified toparApply()；lapply()Modified toparLapply()；sapply()Modify it to our commonly usedparSapply()Wait, and then add the corresponding statements that are parallel to the beginning and end in the beginning and end.

Here is a chestnut (reference:How-to go parallel in R – basics + tips）

First we uselapply()Perform the following operation vectorization operations:

lapply(1:3, function(x) c(x, x ^ 2, x ^ 3))

The output result is:

[[1]]
[1] 1 1 1

[[2]]
[1] 2 4 8

[[3]]
[1] 3 9 27

We modify this to a parallel method, first of all, initialize our parallelism:

library(parallel) # Load the parallel package
# Calculate the number of available threads and set the number of threads used in parallelno_cores &lt;- detectCores() - 1

# Initializationcl &lt;- makeCluster(no_cores)

Then modify the originallapply()Command:

parLapply(cl, 1:3, function(x) c(x, x ^ 2, x ^ 3))

Note: Here is the general onelapply()Compared to add cl.

The output result is:

[[1]]
[1] 1 1 1

[[2]]
[1] 2 4 8

[[3]]
[1] 3 9 27

We are not finished here yet. We initialize our parallelism before. Here we need to end our parallelism, release the threads and memory we use, and return it to the system. Use the following statements specifically:

stopCluster(cl)

At this point, a simple parallelism is completed.

But things are far from that simple. When we need to deal with very complex parallel tasks, we use them repeatedlyparallelWhen using parallel methods in the library, we cannot turn our thread count to the maximum, and sometimes it cannot even half of it, and it will report an error as shown below:

Error in unserialize(node$con) : error reading from connection

The reason for this situation is complicated, because of the mismatch of "number of call cores - computer memory". If your data set is large and calls many cores, if your computer's memory does not match enough, it will cause bad connections and even jamming. In short, it is to burst memory.

Solution (can't say it's completely solved, it can only be said to be able to effectively alleviate):

Use fewer threads for parallelism;

If your computer has very small memory, there is an easy way to determine your maximum thread to use:

max cores = () / ()；

Use a large number of parallelism in small parts; use more in the coderm()Delete useless variables and usegc()Recycle memory space;

But later we will introduce another parallel methodsnowfallRelatively speaking, it is more stable (although the code is relatively complicated to write), so let's leave this for the next blog:Two commonly used parallel methods in RDetailed introduction is provided in this article.

The above is the detailed explanation of parallel methods, which are commonly used in R language. For more information about parallel methods in R language, please pay attention to my other related articles!