Implementing a simple file search engine using Python

Text is the basic and advanced knowledge about Python file operations, including reading and writing files, file and directory management, error handling, file path operations, file encoding, processing of large files, temporary files, file permissions, and a simple file search engine example. The advanced section involves file mode, buffering, file locking, advanced file search techniques, file system monitoring, cross-platform file path processing, performance considerations, security, and a further optimized file search engine example.

Base

Read and write files

Sample code：

# Read the filewith open('', 'r') as file:
    content = ()
    print(content)

# Write to a filewith open('', 'w') as file:
    ('Hello, World!')

No additional installation package required, built-in PythonopenFunctions can read and write files.

File and directory management

Sample code：

import os
import shutil

# Create a directory('new_directory')

# Rename the directory('new_directory', 'renamed_directory')

# Delete the file('old_file.txt')

# Copy the file('', '')

# List the contents of the directoryprint(('.'))

Package introduction：

osModule: Provides rich methods for processing files and directories.
shutilModule: Provides a series of advanced operations on files and file collections.

Error handling

It is important to handle potential errors when doing file operations. For example, trying to open a file that does not exist will raiseFileNotFoundError. usetryandexceptStatements can help you handle these situations gracefully:

try:
    with open('non_existent_file.txt', 'r') as file:
        content = ()
except FileNotFoundError:
    print("The file does not exist.")

Context Manager

PythonwithStatements provide a concise way to manage resources, especially for file operations. usewithIt can ensure that the file is closed correctly after use, even if an exception occurs during file operation.

with open('', 'r') as file:
    content = ()
    print(content)

File path operation

AlthoughosThe module provides basic path operation functions, butpathlibModules provide a more object-oriented way to handle file paths. usepathlibIt can make path operation more intuitive and easy to maintain:

from pathlib import Path

# Current directory pathcurrent_dir = Path('.')
# List all files in the current directoryfor file in current_dir.iterdir():
    print(file)

# Read the filefile_path = current_dir / ''
with file_path.open('r') as file:
    content = ()

File encoding

When working with text files, it is very important to consider the encoding of the file. By default, Python opens files with the system's default encoding, which can cause problems when porting code between different systems. Specifying the encoding ensures that the file is read and written correctly:

# Open a file using UTF-8 encodingwith open('', 'r', encoding='utf-8') as file:
    content = ()

Process large files

For very large files, reading their contents at once can consume a lot of memory. Using iterators to read line by line can reduce memory usage:

with open('large_file.txt', 'r') as file:
    for line in file:
        process(line)  # Process each line

Temporary documents

Sometimes, you may need to create temporary files to store data that are no longer needed after the program is finished.tempfileThe module provides methods to create temporary files and directories:

import tempfile

# Create temporary fileswith ('w+t') as temp_file:
    temp_file.write('Hello, World!')
    temp_file.seek(0)  # Go back to the beginning of the file    print(temp_file.read())

File permissions

On Linux and UNIX systems, file permissions are crucial to file security. useosModule, you can check and modify permissions for files:

import os

# Modify file permissions (read-only)('', 0o444)

Comprehensive example – a simple file search engine

A file search engine that allows users to specify a root directory and a file name (or partial file name), and then search for files matching that name in that directory and all its subdirectories.

import os
import time

def find_files(directory, filename):
    matches = []
    # traverse the root directory    for root, dirnames, filenames in (directory):
        for name in filenames:
            # Check whether the file name contains search keywords            if () in ():
                ((root, name))
    return matches

# User inputroot_directory = input("Please enter the root directory to search: ")
file_to_find = input("Please enter the file name to search for (substantial matching supports): ")

# Record the start timestart_time = ()

# Search for filesfound_files = find_files(root_directory, file_to_find)

# Record the end timeend_time = ()

# Output resultprint(f"turn up {len(found_files)} A file:")
for file in found_files:
    print(file)

# Time to outputprint(f"Time-consuming search: {end_time - start_time:.2f} Second")

This script is used()Function, this function can traverse all subdirectories in a specified directory. The script adds the full paths of all found matching files to a list and prints those paths after the search is complete.

The user is first prompted to enter the root directory and file name to search for. Then the script will callfind_filesFunction to perform searches. The search results will show the number of files found and their paths.

Note that this script is case-insensitive when the filename matches, because it uses.lower()Method to convert filenames to lowercase. This means that searches are case-insensitive.

$ python3
Please enter the root directory to search: /DB6/project
Please enter the file name to search for (substantial matching supports):
531 files were found:
/DB6/project/blog/BlogSSR/node_modules/@kangc/v-md-editor/src/components/scrollbar/
......
Search time: 46.71 seconds

Advanced

Detailed explanation of file mode

useopenWhen using functions, you can open the file through different modes, which determine the read and write permissions and behavior of the file.

# Write mode, if the file exists, overwrite the original contentwith open('', 'w') as file:
    ('Hello, Python!')

# Append mode, the written content will be added to the end of the filewith open('', 'a') as file:
    ('\nAppend text.')

# Binary write modewith open('', 'wb') as file:
    (b'\x00\xFF')

buffer

Buffering is an important concept in file operations, which affects the timing of data being written to files. Python allows you to control the buffering behavior of files.

# Open files in unbuffered modewith open('', 'r', buffering=0) as file:
    print(())

File lock

In a multi-threaded or multi-process environment, file locks can be used to avoid data conflicts.

import portalocker

with open('', 'a') as file:
    (file, portalocker.LOCK_EX)
    ('Locked file.\n')
    (file)

Advanced file search skills

Combinedand regular expressions can implement complex file search logic.

import os
import re

def search_files(directory, pattern):
    regex = (pattern)
    for root, _, files in (directory):
        for name in files:
            if (name):
                print((root, name))

search_files('.', 'example.*')

File system monitoring

usewatchdogLibrary can monitor changes in file systems, which is very useful for applications that need to respond in real time based on file updates.

from  import Observer
from  import LoggingEventHandler

event_handler = LoggingEventHandler()
observer = Observer()
(event_handler, path='.', recursive=True)
()

Cross-platform file path processing

pathlibModules provide an object-oriented way to process file paths.

from pathlib import Path

p = Path('')
with ('r') as file:
    print(())

Performance considerations

usemmapModules can improve the processing efficiency of large files through memory mapping.

import mmap
import os

with open('', 'r+b') as f:
    mm = ((), 0)
    print(())
    ()

Security

When dealing with file paths, especially those from users, special care is required to avoid security vulnerabilities.

from pathlib import Path

def safe_open(file_path, root_directory):
    root = Path(root_directory).resolve()
    absolute_path = (root / file_path).resolve()
    if root not in absolute_path.parents:
        raise ValueError("No access to files outside the root directory")
    return open(absolute_path, 'r')

user_path = '../'
try:
    file = safe_open(user_path, '.')
    print(())
except ValueError as e:
    print(e)

Comprehensive examples - Further modification of file search engine

import os
import re
import time
from  import ThreadPoolExecutor

def search_files(directory, pattern):
    """
     Search for files matching regular expressions in the specified directory.
     """
    matches = []
    regex = (pattern)
    for root, dirnames, filenames in (directory):
        for name in filenames:
            if (name):
                ((root, name))
    return matches

def search_directory(directory, pattern):
    """
     Search for a single directory.
     """
    try:
        return search_files(directory, pattern)
    except PermissionError:
        return []  # Ignore permission errors
def main(root_directory, pattern):
    """
     Main function: Search the directory in parallel and summarize the results.
     """
    start_time = ()
    matches = []

    # Use ThreadPoolExecutor to search in parallel    with ThreadPoolExecutor() as executor:
        futures = []
        for root, dirs, files in (root_directory):
            for dirname in dirs:
                future = (search_directory, (root, dirname), pattern)
                (future)

        # Wait for all threads to complete and summarize the results        for future in futures:
            (())

    end_time = ()
    
    # Print search results    print(f"turn up {len(matches)} A file:")
    # for match in matches:
    #     print(match)
    
    print(f"Time-consuming search: {end_time - start_time:.2f} Second")

if __name__ == "__main__":
    import sys
    if len() != 3:
        print("Usage: python search_engine.py [root directory] [search mode]")
    else:
        main([1], [2])

os: Used to interact with the operating system, including traversing the directory tree.
re: Used for regular expression matching to search for file names by pattern.
time: Used to measure the start and end times of the search operation to calculate the total time.
: Used to parallelize search tasks to improve search efficiency.

search_files function

This function accepts two parameters:directory(Directory path to search) andpattern(regular expression pattern) and return a complete list of paths to all files that match the pattern.

First, create an empty listmatchesto store the matching file path found.
use(pattern)Compile regular expression patterns for use in searches.
use(directory)Iterates over the specified directory and all its subdirectories. For each directory,Return a triple(root, dirnames, filenames),inrootis the path to the current directory.dirnamesis a list of names of all subdirectories under this directory.filenamesis a list of names of all files in this directory.
In each directory, iterate through all file names, using regular expressions.search(name)Method checks whether the file name matches the given pattern. If match, use the full path to the file (using(root, name)Build) Add tomatchesin the list.
Function returnsmatchesList containing the paths to all found matching files.

search_directory function

This function is encapsulatedsearch_filesFunctions to search in a single directory and handle possible occurrencesPermissionError。

Accept andsearch_filesSame parameters.
Try callingsearch_filesSearch for functions, if encounteredPermissionError(For example, because there is not enough permission to access a directory), the exception is caught and an empty list means no matching file was found.

main function

This is the main function of the script, which is responsible for initializing parallel searches, summarizing results, and printing the search time-consuming and found matching files.

First record the search start time.
Create an empty listmatchesto store all matching file paths found.
useThreadPoolExecutorCreate a thread pool to perform search tasks in parallel. This passes through the root directory and all its subdirectories and submits one for each subdirectoriessearch_directoryThe task is implemented in the thread pool.
useSubmit the task and return itFutureObject added tofuturesin the list.
use()Wait for all tasks to complete and collect the results, expanding the matching file path found by each task tomatchesin the list.
Record the search end time and calculate the total time.
Print the total number of matching files found and the time to search. Commented out sections can be uncommented to print the path to each matching file.

Script entry

Check the number of command line parameters. If it is not equal to 3 (script name, root directory, and search mode), print the instructions for use.
If the number of parameters is correct, callmainFunction and pass it into the root directory and search mode.

Run it to see the effect

$ python3 /DB6/project index.*
1409008 files were found:
Search time: 147.67 seconds

The above is the detailed content of using Python to implement a simple file search engine. For more information about Python file search engine, please follow my other related articles!