SoFunction
Updated on 2025-04-13

Detailed explanation of how Python implements extracting pictures from regular documents

1. Environmental preparation

Install the required library

pip install python-docx PyMuPDF openpyxl beautifulsoup4 pillow
pip install pdfplumber  # PDF parsing alternative solutionpip install tk          # Python comes with no installation required

Tool selection

Development environment: VSCode + Python plugin

Debug Tool: Python IDLE (beginner friendly)

Packaging tool: pyinstaller (optional, used to generate exe)

2. Project architecture design

image-extractor/
├──                 # Main program entry
├── core/
│   ├── docx_extractor.py
│   ├── pdf_extractor.py
│   ├── excel_extractor.py
│   └── html_extractor.py
└── outputs/             # Default output directory

3. Core function implementation

(1) Word document extraction (docx_extractor.py)

import zipfile
import os
from PIL import Image

def extract_docx_images(file_path, output_dir):
    # Unzip the docx file    with (file_path, 'r') as zip_ref:
        # Extract pictures in media folder        image_files = [f for f in zip_ref.namelist() if ('word/media/')]
        
        for img_file in image_files:
            # Save the picture to the output directory            zip_ref.extract(img_file, output_dir)
            # Rename the file            src = (output_dir, img_file)
            dst = (output_dir, (img_file))
            (src, dst)
            
    return len(image_files)

(2) PDF file extraction (pdf_extractor.py)

import fitz  # PyMuPDF
import os

def extract_pdf_images(file_path, output_dir):
    doc = (file_path)
    img_count = 0
    
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        images = page.get_images(full=True)
        
        for img_index, img in enumerate(images):
            xref = img[0]
            base_image = doc.extract_image(xref)
            img_data = base_image["image"]
            
            # Save as PNG            img_path = (output_dir, f"pdf_page{page_num}_img{img_index}.png")
            with open(img_path, "wb") as f:
                (img_data)
            img_count += 1
                
    return img_count

(3) Excel file extraction (excel_extractor.py)

from openpyxl import load_workbook
import os

def extract_excel_images(file_path, output_dir):
    wb = load_workbook(file_path)
    img_count = 0
    
    for sheet in :
        for image in sheet._images:
            # Get picture data            img = image._data
            img_path = (output_dir, f"excel_{}_img{img_count}.png")
            with open(img_path, "wb") as f:
                (img)
            img_count += 1
                
    return img_count

(4) HTML file extraction (html_extractor.py)

import requests
from bs4 import BeautifulSoup
import os
import base64

def extract_html_images(html_path, output_dir):
    if html_path.startswith('http'):
        response = (html_path)
        soup = BeautifulSoup(, '')
    else:
        with open(html_path, 'r') as f:
            soup = BeautifulSoup((), '')
    
    img_tags = soup.find_all('img')
    img_count = 0
    
    for img in img_tags:
        src = ('src')
        if ('data:image'):
            # Process base64 encoded pictures            header, data = (',', 1)
            img_format = ('/')[1].split(';')[0]
            img_data = base64.b64decode(data)
            img_path = (output_dir, f"html_img{img_count}.{img_format}")
            with open(img_path, 'wb') as f:
                (img_data)
            img_count += 1
                
    return img_count

4. Interactive interface development ()

import tkinter as tk
from tkinter import filedialog, messagebox
from core import docx_extractor, pdf_extractor, excel_extractor, html_extractor
import os

class ImageExtractorApp:
    def __init__(self, root):
         = root
        ("Multi-format picture extraction tool")
        
        # File path variable        self.file_path = ()
        self.output_dir = (value="outputs")
        
        # Create interface components        self.create_widgets()
    
    def create_widgets(self):
        # File selection        (, text="Select File:").grid(row=0, column=0, padx=5, pady=5)
        (, textvariable=self.file_path, width=40).grid(row=0, column=1)
        (, text="Browse", command=self.select_file).grid(row=0, column=2)
        
        # Output directory        (, text="Output Directory:").grid(row=1, column=0)
        (, textvariable=self.output_dir, width=40).grid(row=1, column=1)
        (, text="Select Directory", command=self.select_output_dir).grid(row=1, column=2)
        
        # Execute button        (, text="Start extraction", command=self.start_extraction).grid(row=2, column=1, pady=10)
        
        # Log area        self.log_text = (, height=10, width=50)
        self.log_text.grid(row=3, column=0, columnspan=3)
    
    def select_file(self):
        file_types = [
            ('Supported file types', '*.docx *.pdf *.xlsx *.html'),
            ('Word Documents', '*.docx'),
            ('PDF file', '*.pdf'),
            ('Excel File', '*.xlsx'),
            ('Web File', '*.html')
        ]
        self.file_path.set((filetypes=file_types))
    
    def select_output_dir(self):
        self.output_dir.set(())
    
    def start_extraction(self):
        file_path = self.file_path.get()
        output_dir = self.output_dir.get()
        
        if not (output_dir):
            (output_dir)
            
        ext = (file_path)[1].lower()
        
        try:
            if ext == '.docx':
                count = docx_extractor.extract_docx_images(file_path, output_dir)
            elif ext == '.pdf':
                count = pdf_extractor.extract_pdf_images(file_path, output_dir)
            elif ext == '.xlsx':
                count = excel_extractor.extract_excel_images(file_path, output_dir)
            elif ext == '.html':
                count = html_extractor.extract_html_images(file_path, output_dir)
            else:
                ("mistake", "Unsupported file types")
                return
                
            self.log_text.insert(, f"Successfully extracted {count} A picture to {output_dir}\n")
        except Exception as e:
            ("mistake", f"Extraction failed: {str(e)}")

if __name__ == "__main__":
    root = ()
    app = ImageExtractorApp(root)
    ()

5. Instructions for use

Operation steps

1. Run

2. Click to browse to select the file (support .docx/.pdf/.xlsx/.html)

3. Select the output directory (default outputs)

4. Click to start extraction

5. Check the extraction results of the bottom log area

Effect example

Successfully extracted 5 pictures to outputs/
Successfully extracted 3 pictures to outputs/

6. Frequently Asked Questions

Q1: Can Excel pictures be extracted?

Reason: openpyxl can only extract embedded images, but cannot extract floating images

Solution: Use xlrd+image coordinate recognition instead (requires more complicated processing)

Q2: Is the pictures extracted by PDF blurred?

Reason: PDF embedded with low resolution pictures

Solution: Higher precision extraction mode using pdfplumber

Q3: The program is unresponsive?

Reason: Large file processing takes time to block the main thread

Solution: Use multi-threading instead (refer to threading module)

7. Project expansion suggestions

Added batch processing: Supports batch import of folders

Add image preview: Show thumbnails in the interface

Support compression package: directly decompress ZIP/RAR files and process contents

Added format conversion: Automatically convert special formats such as HEIC/WEBP

The above is a detailed explanation of the method of Python to extract pictures from regular documents. For more information about Python’s extraction of pictures, please pay attention to my other related articles!