Introduction to the Python efficient computing library Joblib

1. What is the Joblib library?

JoblibIt is an open source library for efficient computing in Python. It provides some tools for memory mapping and parallel computing, which can greatly improve the efficiency of scientific computing and data analysis, and is especially suitable for tasks that require repeated computing or large-scale data processing.

Commonly used key functions of the Joblib library include efficient object serialization, temporary cache of function values, and parallel computing, which can optimize the data processing process. Installing the Joblib library in Python is also very simple. Just execute the following command through pip:

pip install joblib

After the installation is completed, execute the following code. If the corresponding version number can be output, it means that the installation has been successfully installed.

import joblib
print(joblib.__version__)

2. Core functions introduction and demonstration

joblibThe main functions of the library are reflected in the following three aspects:

Efficient serialization and deserialization: In addition to special Python objects, the Joblib library can efficiently serialize numerical objects to save locally, and is often used to save and load data objects and model objects;
Fast disk caching: Provides efficient disk cache and delayed loading, and can cache the return value of the function to disk to avoid repeated calculations;
parallel computing: It can easily assign code tasks to multi-core processors.

2.1 Efficient serialization and deserialization of objects

Similar to the pickle library, the Joblib library providesdumpandloadFunctions can efficiently save large data objects (such as large arrays, machine learning models, etc.) to or load them back from local files. Joblib has made specific optimizations for numpy arrays, based on a special serialization format, which is more efficient than general serialization.

The following example compares the efficiency of the pickle library and the Joblib library in saving and loading large-scale arrays. First, an array of (10000, 10000) (10000, 10000) (10000, 10000) is generated. The two libraries are stored and loaded data 5 55 times respectively. The final average processing time is as follows (see the comments section):

import numpy as np
import pickle, joblib, time

# Generate a large numpy array object, such as an array of 10000 x 10000large_array = (10000, 10000)

# 5 cyclesn = 5

# Average processing time 2.54sfor i in range(n):
    with open(f'pickle_data_{i}.pkl', 'wb') as f:
        (large_array, f)

# Average processing time 0.72sfor i in range(n):
    with open(f'pickle_data_{i}.pkl', 'rb') as f:
        load_large_array = (f)

# Average processing time 2.16sfor i in range(n):
    (large_array, f'joblib_data_{i}.joblib')

# Average processing time 0.04sfor i in range(n):
    load2_large_array = (f'joblib_data_{i}.joblib')

Compared to pickle library loading.pklFile, Joblib library load.joblibThe average efficiency of files is extremely high, and the efficiency of saving files also has certain advantages. In addition, the interface of the Joblib library is easier to use, so it is often used instead of the pickle library when handling tasks containing a large amount of data.

This method of saving and reloading is often used in scenarios such as saving and distributing the trained model or calculated data set to other users, and is also often used in deep copy of large-scale data (using direct deep copy, the method of saving and loading after saving is often faster).

2.2 Fast disk caching

Another core function of the Joblib library is that it can quickly cache the calculated return value of the function to disk (memory mode). When the function is called again, if the input parameters of the function have not changed, Joblib directly loads the result from the cache instead of recalculating.

As shown in the following example, by defining the cache directory and creating a cache, after adding a specified decorator, when we run the first function, the function calculation result will be cached to disk. When calling the function again, if the input parameters are the same, the corresponding calculation result will be called out from disk to avoid repeated calculations. Of course, a natural idea is why it is possible not to save the calculation results of a function as a hash table, pass in parameters as keys, and calculate results as values, which is of course feasible, but this will greatly occupy memory. Joblib caches the calculation results that should be in memory to disk, and the cache and call processing is very fast.

from joblib import Memory
import time

cachedir = './my_cache'  # Define cache directorymemory = Memory(cachedir, verbose=0)

@
def expensive_computation(a, b):
    print("Computing expensive_computation...")
    sum_ = 0
    for i in range(1000000):
        sum_ += a * b / 10 + a / b
    return sum_

# The first call will calculate and cache the resultresult = expensive_computation(20, 3)
# 0.0967 s

# The second call will load the result directly from the cacheresult = expensive_computation(20, 3)
# 0.000997 s

The decorator of the above code can be understood as passing the function expensive_computation as a parameter into the() method. The above writing method is equivalent to (expensive_computation()).

Obviously, for tasks with a lot of repetitive calculations, this library can greatly improve processing efficiency. It is worth noting that in the above defined functions, there are print statementsprint(...), When the function is executed for the first time, the print statement will be executed. If the function is executed repeatedly, the previous calculation results will be inherited directly from the cache, without going through the specific calculation logic in the middle, and the relevant statements will not be printed.

2.3 Parallel computing

The most core feature of Joblib should be the provision of advanced (easy to use) parallelization tools that enable us to easily assign compute tasks to multiple CPU cores to execute.

As shown below, when we have multiple independent tasks to be executed, we can use Joblib'sParallelanddelayedFunctions handle these tasks in parallel, saving time.

from joblib import Parallel, delayed
import numpy as np

def process(i):
    data = (1000, 1000)

# Normal loop calculation# 5.798 s
for i in range(1000):
    process(i)

# parallel computing of Joblib# 3.237 s
Parallel(n_jobs=4)(delayed(process)(i) for i in range(1000))

The above formula is in,n_jobsDefine the number of threads, ifn_jobs=-1, then enable all available CPU cores;delayed()Input the task (function) name, then(i)Assign incoming parameters to the task, and perform 1000 10001000 tasks under parallel calculations of Joblib.process()The time of is 3.237s 3.237s, and the time of execution of loops in sequence is 5.798s 5.798s 5.798s.

As the calculation complexity of tasks increases and the number of independent tasks increases, the advantages of parallel computing will gradually become obvious, but compared with the number of parallel tasks we run, this advantage is sometimes not so significant. The reason is that by default,It is to start a separate Python worker process to perform tasks simultaneously on a distributed CPU, but it may cause a lot of overhead because the input and output data need to be serialized in a queue for communication with the worker process. Therefore, Joblib may be less efficient in parallel computing under small-scale tasks.

This is the introduction to this article about the introductory tutorial of the Python efficient computing library Joblib. For more related content of the Python efficient computing library Joblib, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!