Python speed-up numba

Python

Python is really easy to use, but it is really slow (crying to death); C++ is very fast, but it is really hard to write. If you can not touch it in this life, you won’t touch it. Oh my God, is there any way to get the best of both worlds? As the saying goes, there are always more ways than difficulties. Everyone has this problem, so naturally there are big guys who try to solve this problem. Let’s give us the protagonist today:numba

However, before introducing numba, we still have to see why python is so slow:

1. Why is python so slow

Anyone who has used python knows that, especially when there are loops, python will be much slower than C++, so many people avoid introducing complex for loops into python code. We can think about itpythonWhat are the differences between writing from C++:

Dynamic variables

If you have written C/C++, you will find that we need to have a strict definition of variable types, and we need to define the type of variables as int or float. butpython It's different. Anyone who has written python knows that it removes variable declarations and data types. In other words, no matter what data we have, we don’t have to worry about, just save it if you want! So how does python achieve such freedom and freedom? This means that everything in python is an object, and the real data exists in the object. For a simple addition of two variables, Python has to judge the type of the variable first when doing operations, and then take it out for calculation. For C, simple memory read and write and machine instructions ADD are enough. In fact, there are also variable data types in C/C++, but their declarations are very complex and a very headache-inducing structure.

Interpretive language

The biggest advantage of compilation languages like C/C++ is that their compilation process occurs before running, and the source code is converted into executable machine code by the compiler before being called, which saves a lot of time. andpythonAs an interpretive language, it is impossible to compile in one go. It can be run directly in the future. Each time it runs, the source code must be converted into machine code through the interpreter. This is very easydebug(I want to sigh again here that python is really a novice-friendly language~). Of course, there are naturally solutions to this problem. A very important technology is JIT (Just-in-time compilation): JIT instant compilation technology compiles the called functions or program blocks into machine code and loads them into memory at runtime to speed up the execution of the program. To put it bluntly, before executing a piece of code for the first time, the compiled action is first executed, and then the compiled code is executed.

The above simply lists two points, of course there are more reasons. I will not introduce them in detail due to space limitations, but what we mentioned at the beginningnumba It is through JIT that accelerates the python code. So how to use numba to speed up our code? We can look at some simple examples:

Small examples of speeding up python

usenumba How simple and convenient is the python code to accelerate? Let’s first take a look at how to use numba to accelerate python code:

If you are asked to calculate the sum of all elements of a matrix using simple python, it is easy to write the following code:

def cal_sum(a): 
    result = 0 
    for i in range([0]): 
        for j in range([1]): 
            result += a[i, j] 
    return result

When the matrix to be calculated is very small, it seems that the speed is not slow, and it is acceptable, but if the input matrix size is (500, 500),

a = ((500, 500)) 
%timeit cal_sum(a)

The output result is:

47.8 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let's try adding numba:

import numba  
 
@(nopython=True) 
def cal_sum(a): 
    result = 0 
    for i in range([0]): 
        for j in range([1]): 
            result += a[i, j] 
    return result

Enter a matrix of the same size

a = ((500, 500)) 
%timeit cal_sum(a)

The output result is:

236 µs ± 545 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note that we used it here%itemit The test run time (we will leave the reason to talk about it later). By comparing the two times, we can find that numba has achieved a very obvious acceleration effect!

Let's take a look at how to use itnumba Accelerate python code: In actual use, numba is actually added to python functions in the form of a decorator. Users do not have to worry about what method numba uses to optimize the code, just call it. At the same time, it should be noted that the @jit decorator also has a parameter nopython. This parameter is mainly used to distinguish the operating mode of numba. numba actually has two operating modes: one is nopython mode and the other is object mode. Only in nopython mode can you get the best acceleration effect. If numba finds that there is something in your code that it cannot understand, it will automatically enter object mode to ensure that the program can at least run (of course, this actually loses the meaning of adding numba). If we change the decorator to@jit(nopython=True)or@njit, numba will assume that you already have a very good understanding of the accelerated function and force the acceleration method to not enter the object mode. If the compilation is not successful, an exception will be directly thrown.

Of course, when it comes to this, everyone may be confused. How to speed up numbapython Code?

The compilation process of python code includes four stages: lexical analysis -> syntax analysis -> generation bytecode -> interpret bytecode as machine code execution. Common types of python interpreters include cpython, IPython, PyPy, Jython, and IronPython. Unlike other interpreters, numba uses LLVM compilation technology to interpret bytecode.

LLVM is a compiler that takes bytecode and compiles it into machine code. The compilation process involves many additional passes. The LLVM compiler can optimize bytecode, such as some frequently executed modules. LLVM can use it as a "hot code" to optimize bytecode. The LLVM toolchain is very good at optimizing bytecode. It can not only compile numba's code, but also optimize it.

On the first callnumba When decorating a function, numba will infer the parameter type during the call, and numba will compile it into machine code in conjunction with the given parameter type. This process consumes some time, but once compilation is completed, numba will cache the machine code version of the specific type of parameter rendered. If it is called with the same type again, it can reuse the cached machine code without having to compile again.

When measuring performance, if only a simple timer is used to calculate once, the timer includes the time it takes to compile the function during execution, and the most accurate run time should be the run time of the second and later call the function.
For the question of specifying input type, we can try to do a simple experiment to see what the impact it has:

a = ((5000, 5000)) 
 
# The first call time includes the compilation timestart = () 
cal_sum(a) 
end = () 
print("Elapsed (with compilation) = %s" % (end - start)) 
 
# The function is compiled, the machine code is cachedstart = () 
cal_sum(a) 
end = () 
print("Elapsed (after compilation) = %s" % (end - start)) 
 
# Here a itself has type np.float64b = (np.float32) 
 
# Call the same function, but the type of input data becomes np.float32start = () 
cal_sum(b) 
end = () 
print("Elapsed (after compilation) = %s" % (end - start))

Output result:

Elapsed (with compilation) = 0.20406198501586914
Elapsed (after compilation) = 0.025263309478759766
Elapsed (after compilation) = 0.07892274856567383

You can see that if we input a different data type than the first time we compile, the function's run time will also increase significantly, but it is still much lower than the compile time of the first time.

3. If callednumba When explicitly specifying the input and output data types can speed up the compilation speed of the first call function. The disadvantage is that if explicitly specified, the specified data type must be met after calling the function.

a = ((500, 500)) 
 
@() 
def cal_sum1(a): 
    result = 0 
    for i in range([0]): 
        for j in range([1]): 
            result += a[i, j] 
    return result 
 
@('float64(float64[:, :])') 
def cal_sum2(a): 
    result = 0 
    for i in range([0]): 
        for j in range([1]): 
            result += a[i, j] 
    return result 
 
# Do not specify the input and output data type, let numba judge by itself.start = () 
cal_sum1(a) 
end = () 
print("Elapsed (with compilation) = %s" % (end - start)) 
 
# Specify the input and output data typestart = () 
cal_sum2(a) 
end = () 
print("Elapsed (with compilation) = %s" % (end - start))

Time-consuming:

Elapsed (after compilation) = 0.054465532302856445 
Elapsed (after compilation) = 0.0004112720489501953

It can be seen that the compilation time has been greatly reduced. In fact, this time is very close to the time when the machine code generated by the function is directly run.

I have said so much, but then I thought about it, the function of matrix addition seems to have existed in numpy for a long time. It is not easy to use, isn't it delicious? ? Why is it so complicated?

OK, as for the simple example mentioned above, usenumpy andnumba The basic acceleration effect is similar, but in actual situations, not all for loop codes can be used directlynumpy Introduced function implementation. However, numba basically has a very good acceleration effect on all for loop codes, of course, the premise is that the code in the for loop must be understood by numba.

In actual use, it is generally recommended to extract the intensive calculation part of the code as a separate function to implement it and use itnopython Optimize the method, so that we can use itnumba acceleration function. The rest of the parts still use python native code, so that on the one hand, you can call various functions to implement your own code logic in code that is not obvious or cannot be accelerated, and on the other hand, you can also enjoy it.numba acceleration effect.

Accelerate numpy operations

As mentioned above, one of the highlights of numba is to accelerate the for loop. In addition, numba also has the same acceleration effect on numpy operations. Because even numpy is not as fast as numba to convert to machine code, numba is especially good at accelerating the basic operations of numpy (such as addition, multiplication, square, etc.). In fact, to be precise, if the numpy function uses the same operation on each element, it will have better results.

Let's briefly give an example of numba acceleration numpy operation:

a = ((1000, 1000), np.int64) * 5 
b = ((1000, 1000), np.int64) * 10 
c = ((1000, 1000), np.int64) * 15 
 
def add_arrays(a, b, c): 
    return (a, b, c) 
 
@ 
def add_arrays_numba(a, b, c): 
    return (a, b, c) 
 
# Compilation is completed for the first calladd_arrays_numba(a) 
 
# The function is compiled, the machine code is cachedstart = () 
add_arrays_numba(a) 
end = () 
print("Elapsed (after compilation) = %s" % (end - start)) 
 
# Non-numba accelerationstart = () 
add_arrays(a) 
end = () 
print("Elapsed = %s" % (end - start))

Elapsed (after compilation) = 0.002088785171508789
Elapsed = 0.0031290054321289062

When we arenumpy Arrays perform basic array calculations, such as addition, multiplication and squared, and numpy will automatically vectorize internally, which is why it can have better performance than native python code. But in specific cases, numpy's code will not be as fast as optimized machine code.numba Directly acting on numpy operations can also have a certain acceleration effect.

Another example mainly comes fromMMDetection3D, After some simplification, it is mainly used to calculate the coordinates (x, y) of the point to a given[x_min, y_min, x_max, y_max] Within range:

x = ((5000))*5000 
y = ((5000))*5000 
x_min = 0 
x_max = 1000 
y_min=0 
y_max=2000 
 
@ 
def get_clip_numba(x, y, x_min, y_min, x_max, y_max): 
    z = ((x, y), axis=1) 
    z[:, 0] = (z[:, 0], x_min, x_max) 
    z[:, 1] = (z[:, 1], y_min, y_max) 
    return z 
 
def get_clip(x, y, x_min, y_min, x_max, y_max): 
    z = ((x, y), axis=1) 
    z[:, 0] = (z[:, 0], x_min, x_max) 
    z[:, 1] = (z[:, 1], y_min, y_max) 
    return z 
 
%timeit get_clip_numba(x, y, x_min, y_min, x_max, y_max) 
%timeit get_clip(x, y, x_min, y_min, x_max, y_max)

Times used separately:

33.8 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
57.2 µs ± 258 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Judging from the actual situation, not allnumpy Functions can achieve better acceleration after using numba, and in some cases, they can even reduce the running speed of numpy. Therefore, it is recommended to test in advance to confirm the acceleration effect during actual use. Usually when numba is used to speed up numpy, it is a case when for loops and numpy are used together. numba supports most commonly used functions of numpy.

Accelerate with CUDA

The more powerful thing about numba is that we can write CUDA Kernel directly in python, and compile and run our Python programs directly on the GPU. numba supports CUDA GPU programming by directly compiling python code into CUDA kernel and device functions that follow the CUDA execution model (but in fact, numba currently supports very few CUDA APIs, I hope the development team can be a little more attentive ~~~). In order to save the wasted time in copying numpy arrays to a specified device and then storing the results into a numpy array, numba provides some functions to declare and send the arrays to a specified device to save unnecessary copying to cpu.

Common memory allocation functions:

cuda.device_array(): Assign an empty vector on the device, similar to ();
cuda.to_device(): Copy the host's data to the device;
cuda.copy_to_host(): Copy the device's data back to the host;

We can see the effect of using CUDA acceleration through numba using a simple matrix addition example:

from numba import cuda # Call cuda from numbaimport numpy as np
import math
from time import time
 
@
def matrix_add(a, b, result, m, n):
    idx =  +  * 
    idy = +  * 
    if idx &lt; m and idy &lt; n:
        result[idx, idy] = a[idx, idy] + b[idx, idy]
 
 
m = 5000
n = 4000
 
x = (m*n).reshape((m,n)).astype(np.int32)
y = (m*n).reshape((m,n)).astype(np.int32)
 
# Copy data to the devicex_device = cuda.to_device(x)
y_device = cuda.to_device(y)
 
# Initialize a piece of space on the graphics card device to store the GPU calculation resultsgpu_result1 = cuda.device_array((m,n))
gpu_result2 = cuda.device_array((m,n))
cpu_result = ((m,n))
 
threads_per_block = 1024
blocks_per_grid = (m*n / threads_per_block)
# The first call contains the compile timestart = time()
matrix_add[blocks_per_grid, threads_per_block](x_device, y_device, gpu_result1, m, n)
()
print("gpu matrix add time (with compilation) " + str(time() - start))
start = time()
matrix_add[blocks_per_grid, threads_per_block](x_device, y_device, gpu_result2, m, n)
()
print("gpu matrix add time (after compilation)" + str(time() - start))
start = time()
cpu_result = (x, y)
print("cpu matrix add time " + str(time() - start))

The running times are:

gpu matrix add time (with compilation) 0.15977692604064941
gpu matrix add time (after compilation) 0.0005376338958740234
cpu matrix add time 0.023023128509521484

Bynumba When performing CUDA acceleration, it is mainly done by calling@ Decorator implementation, from the result, we can see that numba is significantly accelerated by calling CUDApython program.

The influence of circular writing

The following code is intercepted fromMMDetection3D, mainly used to determine whether a series of points is inside a series of polygons.

We can have two ways of writing:

In the For loop, you need to access the polygon in each looppolygon variable

@(nopython=True) 
def points_in_convex_polygon1(points, polygon, clockwise=True): 
    # first convert polygon to directed lines 
    num_points_of_polygon = [1] 
    num_points = [0] 
    num_polygons = [0] 
    vec1 = ((2), dtype=) 
    ret = ((num_points, num_polygons), dtype=np.bool_) 
    success = True 
    cross = 0.0 
    for i in range(num_points): 
        for j in range(num_polygons): 
            success = True 
            for k in range(num_points_of_polygon): 
                if clockwise: 
                    vec1 = polygon[j, k] - polygon[j, k - 1] 
                else: 
                    vec1 = polygon[j, k - 1] - polygon[j, k] 
                cross = vec1[1] * (polygon[j, k, 0] - points[i, 0]) 
                cross -= vec1[0] * (polygon[j, k, 1] - points[i, 1]) 
                if cross >= 0: 
                    success = False 
                    break 
            ret[i, j] = success 
    return ret

Pre-calculate all vecs before looping

@(nopython=True) 
def points_in_convex_polygon2(points, polygon, clockwise=True): 
    # first convert polygon to directed lines 
    num_points_of_polygon = [1] 
    num_points = [0] 
    num_polygons = [0] 
    # vec for all the polygons 
    if clockwise: 
        vec1 = polygon - polygon[:, ([num_points_of_polygon - 1] + 
                                 list(range(num_points_of_polygon - 1))), :] 
    else: 
        vec1 = polygon[:, ([num_points_of_polygon - 1] + 
                       list(range(num_points_of_polygon - 1))), :] - polygon 
    ret = ((num_points, num_polygons), dtype=np.bool_) 
    success = True 
    cross = 0.0 
    for i in range(num_points): 
        for j in range(num_polygons): 
            success = True 
            for k in range(num_points_of_polygon): 
                vec = vec1[j,k] 
                cross = vec[1] * (polygon[j, k, 0] - points[i, 0]) 
                cross -= vec[0] * (polygon[j, k, 1] - points[i, 1]) 
                if cross >= 0: 
                    success = False 
                    break 
            ret[i, j] = success 
    return ret

A simple test of the speed of the two writing methods:

points = ((20000, 2)) * 100 
polygon = ((1000, 100, 2)) * 200  
 
start = () 
points_in_convex_polygon1(points, polygon) 
end = () 
print("Elapsed (with compilation) = %s" % (end - start)) 
 
start = () 
points_in_convex_polygon1(points, polygon) 
end = () 
print("Elapsed (after compilation) = %s" % (end - start)) 
 
start = () 
points_in_convex_polygon2(points, polygon) 
end = () 
print("Elapsed (with compilation) = %s" % (end - start)) 
 
start = () 
points_in_convex_polygon2(points, polygon) 
end = () 
print("Elapsed (after compilation) = %s" % (end - start))

Output time:

Elapsed (with compilation) = 3.9232356548309326
Elapsed (after compilation) = 3.6778993606567383
Elapsed (with compilation) = 0.6269152164459229
Elapsed (after compilation) = 0.22288227081298828

Through testing, we can find that the second solution will be faster. When actually used, we can minimize the number of memory accesses within the for loop, thereby reducing the running time of the function.

Summarize :

We've introduced some usesnumba Common scenarios of acceleration can effectively increase the speed of our code. However, when you use it, it is recommended to try it more and compare the difference between using and not using it (sometimes, it may become slower if you use numba...). In addition, MMDetection3D has used numba acceleration code very early, and we have beenMMDetection3DThe numba version has been upgraded to obtain better numpy compatibility and code acceleration effects.

This is all about this article about Python speed-up numba. For more related Python numba content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!