SoFunction
Updated on 2024-10-26

How python introduces co-programming and rationale analysis

Related concepts

  • Concurrency: refers to a period of time when several programs are running on the same cpu, but only one program is running on the cpu at any given moment. For example, if the cpu switches 100 processes in a second, the cpu can be considered to have a concurrency of 100.
  • Parallelism: the value of any point in time, there are more than one program running on the cpu at the same time, can be understood as more than one cpu, each cpu independently run their own programs, do not interfere with each other. The number of parallels and the number of cpus are the same.

We usually talk about high concurrency instead of high parallelism because the number of cpu's is limited and cannot be increased.

Image to understand: cpu corresponds to a person, the program corresponds to drink tea, the person to drink tea needs four steps (can correspond to the program needs to open four threads): 1 boil water, 2 prepare tea, 3 wash tea cups, 4 bubble tea.

Concurrent way: boil water while doing 2 prepare tea leaves, 3 wash tea cups, wait for the water to boil and then execute 4 make tea. This saves time compared to executing 1234 in sequence.

Parallel approach: four people are called (four processes are opened) to execute task 1234, and the whole program execution time depends on the most time-consuming step.

  • Synchronous (note that synchronous and asynchronous are only for I/O operations) value calls for IO operations must wait for the IO operation to complete before starting a new call.
  • Asynchronous Refers to a way of calling an IO operation without having to wait for the IO operation to complete before starting a new call.
  • Blocking Refers to a function call where the current thread hangs.
  • Non-blocking Refers to the fact that when a function is called, the current thread is not hung, but returns immediately.

IO multiplexing

sllect, poll, and epoll are all mechanisms for IO multiplexing, which is a mechanism by which a process can listen to multiple descriptors, and once a descriptor is ready (generally read-ready and write-ready), it can notify the program to perform the corresponding operation. But select, poll, epoll are essentially synchronous IO, because they need to read and write events in the read and write event ready to be responsible for their own read and write (i.e., from the kernel space to copy data to the application cache). This means that the read/write process is blocking. Asynchronous IO, on the other hand, does not need to be responsible for reading and writing itself, the implementation of asynchronous IO will be responsible for copying data from kernel space to user space.

select
The select function listens for three types of file descriptors: writefds, readfds, and exceptfds. after the call, the select function blocks until the descriptor is ready (there is data to read or write, or there is an except) or a timeout (timeout specifies the time to wait, or null if the return is immediate), and then the function returns. When the select function returns, you can find a ready descriptor by traversing the fdset.

Pros: good cross-platform (almost all platforms are supported)
Disadvantages: there is a maximum limit on the number of file descriptors that a single process can listen to, generally 1024 on linux, which can be improved by modifying macro definitions or even recompiling the kernel, but this will also result in lower efficiency.

poll

  Unlike select, which uses three bitmaps to represent the fdset, poll uses a pointer implementation of pollfd

The pollfd structure contains the event to listen to and the event that occurs, instead of using the select "parameter-value" method. At the same time, there is no maximum number of pollfd (but the performance will be degraded if the number is too large). Like the select function, pollfd needs to be polled to get the ready descriptor after poll returns.

From the above, both select and poll need to traverse the file descriptors to get the ready sockets after they return. in fact, a large number of clients connected at the same time may have only a few in the ready state at the same moment, so their efficiency decreases as the number of watched descriptors grow.

epoll

epoll is proposed in linux2.6 kernel China, (windows does not support), is the previous select and poll enhanced version. Compared to select and poll, epoll is more flexible, no descriptor limitations. epoll uses a file descriptor to manage multiple descriptors, and stores the time of the user-relative file descriptor into a kernel schedule. This way the coppy at the user control and the kernel control is only needed once.

How to choose?

① In high concurrency while connection activity is not very high, please see below, epoll is better than select (in a website or web system, the user may close at any time after requesting a page)

② Concurrency is not high, while the connection is very active, select is better than epoll. (For example, once the data is connected in the game it will be active all the time and will not be interrupted)

Omitted section: As in the use of select when you need to nest multiple layers of callback functions, and then issued a series of problems, such as poor readability, shared state management difficulties, the emergence of abnormalities in the troubleshooting complexity, so the introduction of the concurrent program, both the operation of the simple, fast and fast.

concurrent program

For the above, we would like to go over a few of these questions:

  1. Using a synchronous approach to writing asynchronous code makes the code readable and easier.
  2. Use a single thread to switch tasks (like switching between functions between single threads, it's super fast)

(1) Threads are switched by the operating system, and the switching of a single thread means that we need programmers to schedule tasks themselves.

(2) does not require locks, high concurrency, if a single thread within the switch function, the performance is much higher than thread switching, higher concurrency.

For example, when we were doing the crawler:

def get_url(url):
 html = get_html(url) # Here the network download IO operation is more time-consuming, want to switch to another function to perform
 infos = parse_html(html)
# download html from url
def get_html(url):
 pass
# Parsing web pages
def parse_html(html):
 pass

Meaning we need a function that can pause, for which we can thread values into the pause. (Recall that our generator function fulfills both of these conditions.) So co-programming is introduced.

Generator Advance

  • The generator can not only output values, but also receive values, using the send() method. Note: Before calling send() to send a non-None value, the generator must be activated first, which can be activated by ①next() ②send(None).
def gen_func():
 html = yield '' # yield preceded by the = sign realizes that 1: you can output values 2: you can accept values passed by the caller
 print(html)
 yield 2
 yield 3
 return 'bobby'
if __name__ == '__main__':
 gen = gen_func()
 url = next(gen)
 print(url)
 html = 'bobby'
 (html) The # send method both passes the value inside the generator and restarts the generator execution to the next yield position.

Print results:

bobby
  • close() method.
def gen_func():
 yield '' # yield preceded by the = sign realizes that 1: you can output values 2: you can accept values passed by the caller
 yield 2
 yield 3
 return 'bobby'
if __name__ == '__main__':
 gen = gen_func()
 url = next(gen)
 ()
 next(gen)

output result:
StopIteration

Special Note: After calling close.(), the generator will generate a GeneratorExit when running down the line, singularly if you catch the exception with try, even if you catch it and encounter a field behind it, you still can't run down the line, because once you call the close method, the generator will terminate (if there is still next, an exception is generated) so we don't want to catch the exception with try. (if there is a next, an exception will be thrown) so we don't try to catch the exception. (This note can be ignored for now)

def gen_func():
 try:
  yield '' 
 except GeneratorExit:
  pass
 yield 2
 yield 3
 return 'bobby'
if __name__ == '__main__':
 gen = gen_func()
 print(next(gen))
 ()
 next(gen)

output result:
RuntimeError: generator ignored GeneratorExit
  • Calls the throw() method. Used to throw an exception. The exception can be caught ignored.
def gen_func():
 yield '' # yield preceded by the = sign realizes that 1: you can output values 2: you can accept values passed by the caller
 yield 2
 yield 3
 return 'bobby'
if __name__ == '__main__':
 gen = gen_func()
 print(next(gen))
 (Exception, 'Download Error')

output result:
 Download Error

yield from

Let's start with a function: from itertools import chain

from itertools import chain
my_list = [1,2,3]
my_dict = {'frank':'yangchao', 'ailsa':'liuliu'}
for value in chain(my_list, my_dict, range(5,10)): chain()method can be passed in multiple iterable objects,and then traverse them separately。
 print(value)

Print results:
1
2
3
frank
ailsa
5
6
7
8
9

This function can be implemented using yield from: yield from function 1: from an iterable object will return values one by one.

my_list = [1,2,3]
my_dict = {'frank':'yangchao', 'ailsa':'liuliu'}
def chain(*args, **kwargs):
 for itemrable in args:
  yield from itemrable
for value in chain(my_list, my_dict, range(5,10)):
 print(value)

Look at the following code:

def gen():
 yield 1

def g1(gen):
 yield from gen

def main():
 g = g1(gen)
 (None)

Code Analysis: In this code, main calls g1, main is called the caller, g1 is called the delegate, and gen is called the sub-generator yield from will create a two-way channel between the caller main and the sub-generator gen. (This means that it is possible to cross over to the delegator directly.)

Example: When yield from is used in the delegate middle(), the caller main forms a data channel directly with the sub-generator sales_sum.

final_result = {}
def sales_sum(pro_name):
 total = 0
 nums = []
 while True:
  x = yield
  print(pro_name+'Sales volume', x)
  if not x:
   break
  total += x
  (x)
 return total, nums # When the program runs to return, it will return the return value of return to the delegate, i.e. final_result[key] in middle
def middle(key):
 while True: The # equivalent of constantly listening for sales_sum to return data, (in this case three times)
  final_result[key] = yield from sales_sum(key)
  print(key +''Sales tally complete!!!'')
def main():
 data_sets = {
  'Mask':[1200, 1500, 3000],
  'Cell phone':[88, 100, 98, 108],
  'Clothes':[280, 560,778,70],
 }

 for key, data_set in data_sets.items():
  print('start key', key)
  m = middle(key)
  (None) # Pre-excitation generator
  for value in data_set:
   (value)
  (None)# Send a None to make the x value in sales_sum None to exit the while loop.
 print(final_result)
if __name__ == '__main__':
 main()

in the end:
start key cleanser
cleanser销量 1200
cleanser销量 1500
cleanser销量 3000
cleanser销量 None
cleanser销量统计完成!!
start key cell phone
cell phone销量 88
cell phone销量 100
cell phone销量 98
cell phone销量 108
cell phone销量 None
cell phone销量统计完成!!
start key apparel
apparel销量 280
apparel销量 560
apparel销量 778
apparel销量 70
apparel销量 None
apparel销量统计完成!!
{'Mask': (5700, [1200, 1500, 3000]), 'Cell phone': (394, [88, 100, 98, 108]), 'Clothes': (1688, [280, 560, 778, 70])}

One might wonder why you can't just use the main() function to go directly to the sales_sum call. Adding a delegate complicates the code. Look at the following directly with the main() function directly to call sales_sum code:

def sales_sum(pro_name):
 total = 0
 nums = []
 while True:
  x = yield
  print(pro_name+'Sales volume', x)
  if not x:
   break
  total += 1
  (x)
 return total, nums

if __name__ == '__main__':
 my_gen = sales_sum('Mask')
 my_gen.send(None)
 my_gen.send(1200)
 my_gen.send(1500)
 my_gen.send(3000)
 my_gen.send(None)

output result:
Mask sales 1200
Mask sales 1500
Mask sales 3000
Mask sales None
Traceback (most recent call last):
 File "D:/MyCode/Cuiqingcai/Flask/", line 56, in <module>
 my_gen.send(None)
StopIteration: (3, [1200, 1500, 3000])

From the above code can be seen, even if the data return results out, or will return an exception, which can be seen yield from one of the biggest advantages is that when the sub-generator runtime exceptions, yield from can be directly and automatically deal with these exceptions.

Summary of the yield from function:

The values produced by the sub-generator are given directly to the caller; the values sent by the caller through .send() are passed directly to the sub-generator; if None is passed, the next() method of the sub-generator will be called, and if it is not None, the sen() method of the sub-generator will be called.
The final return EXPR, when the subgenerator exits, triggers a StopIteration (EXPR) exception
The value of the yield from expression is the first argument passed to the StopIteration exception when the subgenerator terminates.
If the call is made with a StopIteration exception, the delegate generator resumes running while the other exceptions bubble upwards.
In the exceptions passed to the delegate generator, all exceptions except GeneratorExit are passed to the sub-generator's .throw() method; if a StopIteration exception occurs when .throw() is called, then the delegate generator is resumed, and all other exceptions bubble upward.
If .close() is called on the delegate generator or a GeneratorExit exception is passed in, the .close() method of the child generator is called, if not it is not, and if an exception is thrown when .close() is called then it bubbles upwards, otherwise the delegate generator runs a GeneratorExit exception.

The above is python how to introduce concurrent and principle analysis of the details, more information about python concurrent please pay attention to my other related articles!