1. Environmental preparation
Install the required library
pip install python-docx PyMuPDF openpyxl beautifulsoup4 pillow pip install pdfplumber # PDF parsing alternative solutionpip install tk # Python comes with no installation required
Tool selection
Development environment: VSCode + Python plugin
Debug Tool: Python IDLE (beginner friendly)
Packaging tool: pyinstaller (optional, used to generate exe)
2. Project architecture design
image-extractor/
├── # Main program entry
├── core/
│ ├── docx_extractor.py
│ ├── pdf_extractor.py
│ ├── excel_extractor.py
│ └── html_extractor.py
└── outputs/ # Default output directory
3. Core function implementation
(1) Word document extraction (docx_extractor.py)
import zipfile import os from PIL import Image def extract_docx_images(file_path, output_dir): # Unzip the docx file with (file_path, 'r') as zip_ref: # Extract pictures in media folder image_files = [f for f in zip_ref.namelist() if ('word/media/')] for img_file in image_files: # Save the picture to the output directory zip_ref.extract(img_file, output_dir) # Rename the file src = (output_dir, img_file) dst = (output_dir, (img_file)) (src, dst) return len(image_files)
(2) PDF file extraction (pdf_extractor.py)
import fitz # PyMuPDF import os def extract_pdf_images(file_path, output_dir): doc = (file_path) img_count = 0 for page_num in range(len(doc)): page = doc.load_page(page_num) images = page.get_images(full=True) for img_index, img in enumerate(images): xref = img[0] base_image = doc.extract_image(xref) img_data = base_image["image"] # Save as PNG img_path = (output_dir, f"pdf_page{page_num}_img{img_index}.png") with open(img_path, "wb") as f: (img_data) img_count += 1 return img_count
(3) Excel file extraction (excel_extractor.py)
from openpyxl import load_workbook import os def extract_excel_images(file_path, output_dir): wb = load_workbook(file_path) img_count = 0 for sheet in : for image in sheet._images: # Get picture data img = image._data img_path = (output_dir, f"excel_{}_img{img_count}.png") with open(img_path, "wb") as f: (img) img_count += 1 return img_count
(4) HTML file extraction (html_extractor.py)
import requests from bs4 import BeautifulSoup import os import base64 def extract_html_images(html_path, output_dir): if html_path.startswith('http'): response = (html_path) soup = BeautifulSoup(, '') else: with open(html_path, 'r') as f: soup = BeautifulSoup((), '') img_tags = soup.find_all('img') img_count = 0 for img in img_tags: src = ('src') if ('data:image'): # Process base64 encoded pictures header, data = (',', 1) img_format = ('/')[1].split(';')[0] img_data = base64.b64decode(data) img_path = (output_dir, f"html_img{img_count}.{img_format}") with open(img_path, 'wb') as f: (img_data) img_count += 1 return img_count
4. Interactive interface development ()
import tkinter as tk from tkinter import filedialog, messagebox from core import docx_extractor, pdf_extractor, excel_extractor, html_extractor import os class ImageExtractorApp: def __init__(self, root): = root ("Multi-format picture extraction tool") # File path variable self.file_path = () self.output_dir = (value="outputs") # Create interface components self.create_widgets() def create_widgets(self): # File selection (, text="Select File:").grid(row=0, column=0, padx=5, pady=5) (, textvariable=self.file_path, width=40).grid(row=0, column=1) (, text="Browse", command=self.select_file).grid(row=0, column=2) # Output directory (, text="Output Directory:").grid(row=1, column=0) (, textvariable=self.output_dir, width=40).grid(row=1, column=1) (, text="Select Directory", command=self.select_output_dir).grid(row=1, column=2) # Execute button (, text="Start extraction", command=self.start_extraction).grid(row=2, column=1, pady=10) # Log area self.log_text = (, height=10, width=50) self.log_text.grid(row=3, column=0, columnspan=3) def select_file(self): file_types = [ ('Supported file types', '*.docx *.pdf *.xlsx *.html'), ('Word Documents', '*.docx'), ('PDF file', '*.pdf'), ('Excel File', '*.xlsx'), ('Web File', '*.html') ] self.file_path.set((filetypes=file_types)) def select_output_dir(self): self.output_dir.set(()) def start_extraction(self): file_path = self.file_path.get() output_dir = self.output_dir.get() if not (output_dir): (output_dir) ext = (file_path)[1].lower() try: if ext == '.docx': count = docx_extractor.extract_docx_images(file_path, output_dir) elif ext == '.pdf': count = pdf_extractor.extract_pdf_images(file_path, output_dir) elif ext == '.xlsx': count = excel_extractor.extract_excel_images(file_path, output_dir) elif ext == '.html': count = html_extractor.extract_html_images(file_path, output_dir) else: ("mistake", "Unsupported file types") return self.log_text.insert(, f"Successfully extracted {count} A picture to {output_dir}\n") except Exception as e: ("mistake", f"Extraction failed: {str(e)}") if __name__ == "__main__": root = () app = ImageExtractorApp(root) ()
5. Instructions for use
Operation steps
1. Run
2. Click to browse to select the file (support .docx/.pdf/.xlsx/.html)
3. Select the output directory (default outputs)
4. Click to start extraction
5. Check the extraction results of the bottom log area
Effect example
Successfully extracted 5 pictures to outputs/
Successfully extracted 3 pictures to outputs/
6. Frequently Asked Questions
Q1: Can Excel pictures be extracted?
Reason: openpyxl can only extract embedded images, but cannot extract floating images
Solution: Use xlrd+image coordinate recognition instead (requires more complicated processing)
Q2: Is the pictures extracted by PDF blurred?
Reason: PDF embedded with low resolution pictures
Solution: Higher precision extraction mode using pdfplumber
Q3: The program is unresponsive?
Reason: Large file processing takes time to block the main thread
Solution: Use multi-threading instead (refer to threading module)
7. Project expansion suggestions
Added batch processing: Supports batch import of folders
Add image preview: Show thumbnails in the interface
Support compression package: directly decompress ZIP/RAR files and process contents
Added format conversion: Automatically convert special formats such as HEIC/WEBP
The above is a detailed explanation of the method of Python to extract pictures from regular documents. For more information about Python’s extraction of pictures, please pay attention to my other related articles!