1. Install the required libraries
First, the necessary libraries need to be installed. You can install these libraries using the following command:
pip install PyPDF2 pip install pip install reportlab
2. Use PyPDF2 to operate PDF files
PyPDF2
It is a very popular library that supports merging, splitting, encrypting, decrypting, and rotating PDF files.
2.1 Merge multiple PDF files
import PyPDF2 # Create a PDF merger objectpdf_merger = () # List of PDF files that need to be mergedpdf_files = ['', '', ''] # Merge PDF filesfor pdf in pdf_files: pdf_merger.append(pdf) # Output the merged PDF filepdf_merger.write('merged_output.pdf') pdf_merger.close() print("PDF file merge is complete!")
2.2 Split PDF files
import PyPDF2 # Open PDF filewith open('', 'rb') as file: reader = (file) # Get the number of pages of a PDF file total_pages = len() # Create a PDF writer object writer = () # Split into one PDF file per page for page_num in range(total_pages): writer.add_page([page_num]) # Write to a new PDF file with open(f'page_{page_num + 1}.pdf', 'wb') as output_file: (output_file) print("PDF file split is completed!")
2.3 Extract text from PDF file
import PyPDF2 # Open PDF filewith open('', 'rb') as file: reader = (file) text = "" # Extract text from all pages for page_num in range(len()): page = [page_num] text += page.extract_text() print("PDF file content:") print(text)
3. Use pdfminer to extract PDF text
It is a library that focuses on extracting text from PDFs,
PyPDF2
More suitable for complex text extraction operations. It supports extracting text and metadata from PDFs.
3.1 Extract text from PDF files
from pdfminer.high_level import extract_text # Extract text from a PDF filetext = extract_text('') print("Extracted text content:") print(text)
4. Create PDF files using reportlab
reportlab
is a very powerful library, mainly used to generate PDF files. It provides a rich API to design and generate PDFs.
4.1 Create a simple PDF file
from import letter from import canvas # Create a PDF file and draw textdef create_pdf(output_filename): c = (output_filename, pagesize=letter) (100, 750, "Hello, this is a simple PDF created with ReportLab!") () # Call the function to generate a PDF filecreate_pdf("") print("PDF file creation is complete!")
4.2 Adding an image in PDF
from import letter from import canvas def create_pdf_with_image(output_filename): c = (output_filename, pagesize=letter) (100, 750, "Here is an image below:") # Add an image ("", 100, 500, width=200, height=150) # Image position and size () create_pdf_with_image("pdf_with_image.pdf") print("The PDF file (with image) is created!")
5. Use PyMuPDF (fitz) to extract text
PyMuPDF
It is a library that processes PDF, XPS, EPUB and other file formats, with very powerful functions and high efficiency. You can use it to extract text, images, and other content.
5.1 Extract text from PDF files
import fitz # PyMuPDF # Open PDF filedoc = ('') # Extract text from all pagestext = "" for page_num in range(len(doc)): page = doc.load_page(page_num) text += page.get_text() print("Contents of PDF files:") print(text)
6. Encrypt and decrypt PDF files
6.1 Encrypt PDF using PyPDF2
import PyPDF2 # Open PDF filewith open('', 'rb') as file: reader = (file) writer = () # Add all pages in the PDF to the writer object for page in : writer.add_page(page) # Set password password = "your_password" (password) # Write to encrypted files with open('encrypted_sample.pdf', 'wb') as encrypted_file: (encrypted_file) print("PDF file encryption is complete!")
6.2 Decrypt PDF using PyPDF2
import PyPDF2 # Open the encrypted PDF filewith open('encrypted_sample.pdf', 'rb') as file: reader = (file) # Decrypt PDF files password = "your_password" if reader.is_encrypted: (password) # Create a PDF writer object writer = () # Add the decrypted page to the writer for page in : writer.add_page(page) # Output the decrypted PDF file with open('decrypted_sample.pdf', 'wb') as decrypted_file: (decrypted_file) print("PDF file decryption is complete!")
Summarize
Using Python to process PDF files is a very common task. Through different libraries, you can implement various PDF operations:
-
PyPDF2
: Used to merge, split, encrypt, decrypt and extract text. -
: Focus on extracting text from PDFs, suitable for scenarios where complex text parsing is required.
-
reportlab
: Used to generate PDF files, support drawing, adding text, pictures, etc. -
PyMuPDF
(fitz): Supports efficient extraction of text, images, etc. and process PDF files.
The above is the detailed content of common methods for operating PDF files in Python. For more information about Python operating PDF files, please pay attention to my other related articles!