Common methods for operating PDF files in Python

1. Install the required libraries

First, the necessary libraries need to be installed. You can install these libraries using the following command:

pip install PyPDF2
pip install 
pip install reportlab

2. Use PyPDF2 to operate PDF files

PyPDF2It is a very popular library that supports merging, splitting, encrypting, decrypting, and rotating PDF files.

2.1 Merge multiple PDF files

import PyPDF2
 
# Create a PDF merger objectpdf_merger = ()
 
# List of PDF files that need to be mergedpdf_files = ['', '', '']
 
# Merge PDF filesfor pdf in pdf_files:
    pdf_merger.append(pdf)
 
# Output the merged PDF filepdf_merger.write('merged_output.pdf')
pdf_merger.close()
 
print("PDF file merge is complete!")

2.2 Split PDF files

import PyPDF2
 
# Open PDF filewith open('', 'rb') as file:
    reader = (file)
 
    # Get the number of pages of a PDF file    total_pages = len()
 
    # Create a PDF writer object    writer = ()
 
    # Split into one PDF file per page    for page_num in range(total_pages):
        writer.add_page([page_num])
 
        # Write to a new PDF file        with open(f'page_{page_num + 1}.pdf', 'wb') as output_file:
            (output_file)
 
    print("PDF file split is completed!")

2.3 Extract text from PDF file

import PyPDF2
 
# Open PDF filewith open('', 'rb') as file:
    reader = (file)
    text = ""
    
    # Extract text from all pages    for page_num in range(len()):
        page = [page_num]
        text += page.extract_text()
 
print("PDF file content:")
print(text)

3. Use pdfminer to extract PDF text

It is a library that focuses on extracting text from PDFs,PyPDF2More suitable for complex text extraction operations. It supports extracting text and metadata from PDFs.

3.1 Extract text from PDF files

from pdfminer.high_level import extract_text
 
# Extract text from a PDF filetext = extract_text('')
 
print("Extracted text content:")
print(text)

4. Create PDF files using reportlab

reportlabis a very powerful library, mainly used to generate PDF files. It provides a rich API to design and generate PDFs.

4.1 Create a simple PDF file

from  import letter
from  import canvas
 
# Create a PDF file and draw textdef create_pdf(output_filename):
    c = (output_filename, pagesize=letter)
    (100, 750, "Hello, this is a simple PDF created with ReportLab!")
    ()
 
# Call the function to generate a PDF filecreate_pdf("")
print("PDF file creation is complete!")

4.2 Adding an image in PDF

from  import letter
from  import canvas
 
def create_pdf_with_image(output_filename):
    c = (output_filename, pagesize=letter)
    (100, 750, "Here is an image below:")
    
    # Add an image    ("", 100, 500, width=200, height=150)  # Image position and size    
    ()
 
create_pdf_with_image("pdf_with_image.pdf")
print("The PDF file (with image) is created!")

5. Use PyMuPDF (fitz) to extract text

PyMuPDFIt is a library that processes PDF, XPS, EPUB and other file formats, with very powerful functions and high efficiency. You can use it to extract text, images, and other content.

5.1 Extract text from PDF files

import fitz  # PyMuPDF
 
# Open PDF filedoc = ('')
 
# Extract text from all pagestext = ""
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    text += page.get_text()
 
print("Contents of PDF files:")
print(text)

6. Encrypt and decrypt PDF files

6.1 Encrypt PDF using PyPDF2

import PyPDF2
 
# Open PDF filewith open('', 'rb') as file:
    reader = (file)
    writer = ()
    
    # Add all pages in the PDF to the writer object    for page in :
        writer.add_page(page)
    
    # Set password    password = "your_password"
    (password)
    
    # Write to encrypted files    with open('encrypted_sample.pdf', 'wb') as encrypted_file:
        (encrypted_file)
 
print("PDF file encryption is complete!")

6.2 Decrypt PDF using PyPDF2

import PyPDF2
 
# Open the encrypted PDF filewith open('encrypted_sample.pdf', 'rb') as file:
    reader = (file)
    
    # Decrypt PDF files    password = "your_password"
    if reader.is_encrypted:
        (password)
    
    # Create a PDF writer object    writer = ()
    
    # Add the decrypted page to the writer    for page in :
        writer.add_page(page)
    
    # Output the decrypted PDF file    with open('decrypted_sample.pdf', 'wb') as decrypted_file:
        (decrypted_file)
 
print("PDF file decryption is complete!")

Summarize

Using Python to process PDF files is a very common task. Through different libraries, you can implement various PDF operations:

PyPDF2: Used to merge, split, encrypt, decrypt and extract text.
: Focus on extracting text from PDFs, suitable for scenarios where complex text parsing is required.
reportlab: Used to generate PDF files, support drawing, adding text, pictures, etc.
PyMuPDF (fitz): Supports efficient extraction of text, images, etc. and process PDF files.

The above is the detailed content of common methods for operating PDF files in Python. For more information about Python operating PDF files, please pay attention to my other related articles!