SoFunction
Updated on 2024-10-29

How Python multithreading works with multiple files at the same time

Python multithreading multiple files at the same time

When you need to perform the same operation on a large number of files, traversing them one by one is very time-consuming. At this point, we can use Python's multi-threaded operation to greatly improve processing efficiency and reduce processing time.

Background to the issue

For example, let's say we now need to read out all the videos from underneath a folder and then process each video frame by frame.

As the video frame-by-frame processing itself is a relatively time-consuming task, if the serial way to deal with each video file in order is very inefficient, at this point, an easier solution is to use Python's multi-threaded processing.

Defining Generic Handler Functions

Concurrency is suitable for processing similar tasks. For example, if we need to process each frame in a video, and the processing function accepts a list of video names and processes the videos within the list sequentially, then we can build the following processing function:

import cv2
def func(video_names):
    for video_name in video_names:
        cap = (video_name)
        while True:
            ret, frame = ()
            if ret:
                # process temp frame
            else:
                break

In this way, we only need to divide the names of the videos to be processed into multiple sub-lists according to the number of threads opened for concurrent batch processing.

Multi-threaded Thread

Multithreading in Python is realized through the threading library, in addition, multiprocessing can also realize concurrent processing, there is a certain difference between the two. Here we use threading to realize concurrent processing of multiple different files.

import threading
# video_names_list = [part_names_1_list, part_names_2_list, ..., part_names_k_list]
for part_video in video_names_list:
    thread = (target=func, args=([part_video]))
    ()

Here, the list of file names to be processed is first divided into sublists, and then a thread is opened for each sublist to be processed.

Python multithreaded file manipulation

Use python to write one million URLs in a csv file to a mongo database, using multithreading here.

Directly post the code as follows:

import os
import threading  # Import process
import csv
import time
from Mongo_cache import MongoCache 
import 
import winsound
NUM_THREAD = 5
COUNT = 0
lock = ()
cache = MongoCache()     # Database connection initialization
def worker():
"""
func: read data from a csv file and return the data
"""
    for path in (()):
        #print("Current working directory", path)
        file_name = ('.')
        #print(file_name)
        if file_name[-1] == 'csv':
            #print("The address is:", path)
            file = open(path)
            data = (file)
            return data
        else:
            pass
def save_info(data,i, num_retries=2):
"""
func: save the data
"""
    global COUNT
    global lock
    global cache
    for _, website in data:
        try:
            ()
            #print("Thread {} is running".format(threading.current_thread().name, i))
            item = {'website':website}
            cache(item)
            COUNT += 1
        except:
            if num_retries > 0:
                save_info(data, i, num_retries-1)
        finally:
            ()
def main():
"""
Starting a thread
"""
    print("start working")
    print("working...")
    data = worker()
    threads = []   #Set the main thread
    for i in range(NUM_THREAD):
        t = (target=save_info, args=(data, i))
        (t)
    for i in range(NUM_THREAD):
        threads[i].start()
    for i in range(NUM_THREAD):
        threads[i].join()
    print("all was done!")
if __name__ == '__main__':
    s_time = ()
    main()
    e_time = ()
    print("Total number of message entries:", COUNT)
    print("Total time consumed:", e_time-s_time)
    speak = ('')
    ("Good morning, eric, end of program!")

Data Storage Module

import pickle
import zlib
from  import Binary 
from datetime import datetime, timedelta
from pymongo import MongoClient
import time
class MongoCache(object):
    def __init__(self, client=None, expires=timedelta(days=30)):
         = MongoClient('localhost', 27017) if client is None else client
         = 
        #.create_index('timestamp', expireAfterSeconds=expires.total_seconds()) #Setting the automatic deletion time
    def __call__(self,url):
        (url)
        #print("Saved successfully")
    def __contains__(self,url):
        try:
            self[url]
        except KeyError:
            return False
        else:
            return True
    def __getitem__(self, url):
        record = .find_one({'_id':url})
        if record:
            return ((record['result']))
        else:
            raise KeyError(url + 'dose not exist')
    def __setitem__(self, url, result):
        record = {'result': Binary(((result))), 'timestamp':()}
        ({'_id':url},{'$set':record},upsert=True)
    def clear(self):
        ()

The time taken to save one million URLs from the csv file to the database is

start working
working...
all was done!
Total number of messages: 1000000
total time consumption: 427.4034459590912

summarize

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.