SoFunction
Updated on 2025-04-20

Python implements intelligent extraction and synthesis of word document content

How to extract content from about 10 docx documents and generate new documents. The extracted content includes text content, pictures, tables, formulas, etc. of the source document, as well as the style, font, format of the target document, as well as the language style, word usage specifications, grammar habits, etc. of the target document. This is a rather complex requirement because it involves not only content extraction, but also deep formatting and style imitation. The perfect solution for fully automated is extremely difficult, especially for complex formulas and subtle language styles.

A pragmatic solution is to adopt a hybrid strategy of automation + manual assistance. The following are detailed ideas, technical paths, methods and steps:

Core idea

Content extraction (mainly automation): Use programmatically to extract the required core content (some representation of text, pictures, tables, formulas) from the source DOCX file.

Style Application (automation): Based on a template document that defines the target style, layout, font, etc., insert the extracted content into the new document and apply the styles defined in the template.

Language Style Adjustment (Automatic Assist + Manual): Use large language model (LLM) or natural language processing (NLP) technology to perform preliminary style, wording and grammatical adjustments to the extracted text, followed by manual review and refinement.

Complex element processing (mainly manual): Manual adjustments are made for elements that are difficult to automatically process (such as complex formulas, specific typesetting).

Technical path

Main tools: Python programming language

Core library:

  • python-docx: Used to read and write DOCX files (text, table, pictures, basic style applications).
  • (Optional) For formula processing: You may need to parse the underlying XML (OOXML) of DOCX, or look for libraries that specialize in processing MathML/OMML (this part is more difficult), or extract formulas as images.
  • (Optional) For image processing: Pillow (PIL Fork) may need to be used to process images.
  • (Optional) For language style adjustment: Calling large language model APIs (such as OpenAI GPT series, Google Gemini, or other similar services).

Auxiliary tools:

  • Microsoft Word: Used to create template documents, final reviews, and tweaks.
  • XML Editor (optional): Used to deeply analyze DOCX internal structures (especially formulas).

Implementation steps

Phase 1: Preparation

1. Create a target template document ():

  • Create a new document in Word.
  • Defining styles: Carefully set all required styles (Title 1, Title 2, Text, Quotation, List, Table Style, etc.), including font, font size, color, paragraph spacing, indentation, etc. Make sure the style name is clear and understandable (e.g. TargetHeading1, TargetBodyText, TargetTableStyle).
  • Set page layout: page margins, paper size, headers and footers, etc.
  • Save: Save this document as . This will be the basis for all newly generated documents.

2. Clear extraction rules:

Key: You need to define very clearly what content needs to be extracted from each source document. The rules can be based on:

  • Specific title: "Extract everything under 'Chapter 3 Methods'".
  • Specific style: "Extract all content that has the 'source document focus' style applied".
  • Keywords/tags: “Extract paragraphs containing the ‘[EXTRACT]’ tag”.
  • Structure location: "Extract the second table for each document".
  • Manually specify: (most flexible but slowest) Manually mark the content to be extracted in the source document (for example, using Word's annotation feature or a specific highlighting color), and then let the script recognize these tags.
  • Documentation rules: Keep these rules clearly documented so that you can write scripts.

3. Set up the development environment:

Install Python.

Use pip to install the necessary libraries:

pip install python-docx Pillow requests # If you need to call the LLM API# Other libraries may be required, depending on the implementation

(Optional) Get the LLM API key.

Stage 2: Content Extraction (Python Script)

import os
from docx import Document
from  import Inches
# You may need to import other modules, such as processing XML or calling APIs
# --- Configuration ---SOURCE_DOCS_DIR = 'files/transform/docx/source_documents'
TARGET_TEMPLATE = 'files/transform/docx/'
OUTPUT_DOC_PATH = 'files/transform/docx/generated_document.docx'
EXTRACTION_RULES = { # Sample rules need to be modified according to your actual situation    'source_doc_1.docx': {'heading_start': 'Chapter 3', 'heading_end': 'Chapter 4'},
    'source_doc_2.docx': {'style_name': 'SourceHighlight'},
    # ... Rules for other documents}

# --- Helper function (example) ---def should_extract_paragraph(paragraph, rules):
    # Implement the logic of determining whether a paragraph should be extracted based on rules    # For example: Check whether the paragraph text matches, whether the style matches, etc.    # Return True or False    # (This part of the logic needs to be written according to your specific rules)    style_name = 
    text = ()
    # Example: Simple style-based rules    if 'style_name' in rules and style_name == rules['style_name']:
        return True
    # Example: Simple rules based on the starting title (status management is required)    # if 'heading_start' in rules ... (A more complex logic is required to track the current chapter)    return False # No extraction by default
def extract_content_from_doc(source_path, rules):
    """Extract content from a single source document"""
    extracted_elements = []
    try:
        source_doc = Document(source_path)
        # If the tag is in the extraction area (for example, between specific chapters)        in_extraction_zone = False # The initial state needs to be adjusted according to the rules
        for element in source_doc.:
            # Handle different types of elements: paragraphs, tables, etc.            if ('p'): # is a paragraph                paragraph = (element, source_doc)

                # --- Core extraction logic ---                # Here you need to implement complex judgment logic based on your EXTRACTION_RULES                # For example, determine whether the starting title is encountered, whether the end title is encountered, whether the paragraph style matches, etc.                # This is a simplified example that may actually require more granular state management                if 'heading_start' in rules and ('Heading') and rules['heading_start'] in :
                    in_extraction_zone = True
                    continue # Do not extract the starting title itself?  Depend on demand                if 'heading_end' in rules and ('Heading') and rules['heading_end'] in :
                    in_extraction_zone = False
                    continue # Arrive the end title and stop extracting
                if in_extraction_zone or should_extract_paragraph(paragraph, rules):
                     # Extract text content                    text_content = 
                     # Try to extract basic formats (bold, italic) - It is more complicated, and may need to traverse runs                    # TODO: Extract the picture (need to check the inline_shapes in the paragraph or the drawing in runs)                    # TODO: Extracting formulas (extremely challenging, see discussion below)                    extracted_elements.append({'type': 'paragraph', 'text': text_content, 'style': }) # Can carry source style names for reference
            elif ('tbl'): # Yes a table                table = (element, source_doc)
                # --- Extract form ---                # TODO: Implementing table extraction logic, you may need to check whether it is within the extraction area                # if in_extraction_zone or table_should_be_extracted(table, rules):
                table_data = []
                for row in :
                    row_data = [ for cell in ]
                    table_data.append(row_data)
                extracted_elements.append({'type': 'table', 'data': table_data})

            # --- Processing pictures ---            # Find pictures in paragraphs (inline_shapes)            # paragraph = (element, source_doc) # Re-get paragraph object if needed
            # for run in :
            #     if ('.//wp:inline | .//wp:anchor'): # Check for drawings
            #         # This part is complex: need to get image data (rId) and relate it back
            #         # to the actual image part in the docx package.
            #         # python-docx can extract images, but associating them perfectly
            #         # with their original position during extraction requires care.
            #         # Placeholder:
            #         # image_data = get_image_data(run, source_doc)
            #         # if image_data:
            #         #    extracted_elements.append({'type': 'image', 'data': image_data, 'filename': f'img_{len(extracted_elements)}.png'})
            pass # Placeholder for image extraction logic

    except Exception as e:
        print(f"Error processing {source_path}: {e}")
    return extracted_elements

# --- Main process ---all_extracted_content = []
source_files = [f for f in (SOURCE_DOCS_DIR) if ('.docx')]

for filename in source_files:
    source_path = (SOURCE_DOCS_DIR, filename)
    rules = EXTRACTION_RULES.get(filename, {}) # Get the extraction rules for this file    if rules: # Only process files that define rules        print(f"Extracting from: {filename}")
        content = extract_content_from_doc(source_path, rules)
        all_extracted_content.extend(content)
    else:
        print(f"Skipping {filename}, no rules defined.")

print(f"Total elements extracted: {len(all_extracted_content)}")

Stage 3: Language style adjustment (optional, Python + LLM API)

# --- ---
import requests
import json

# --- Configure LLM ---LLM_API_URL = "YOUR_LLM_API_ENDPOINT" # ., OpenAI API URL
LLM_API_KEY = "YOUR_LLM_API_KEY"
LLM_PROMPT_TEMPLATE = """
 Please rewrite this text according to the following requirements:
 Target language style: [Described in detail here, for example: formal, objective, concise]
 Word Specifications: [List the specifications here, for example: use "user" instead of "customer", avoid using abbreviations]
 Grammar habits: [Described here, for example: use active voice more often, and the sentence length is moderate]
 Target audience: [Describe the target readers]

 original:
 "{text}"

 Rewritten text:
 """

def adapt_text_style(text):
    """Using the LLM API to adjust text style"""
    if not ():
        return text # Skip empty text
    prompt = LLM_PROMPT_TEMPLATE.format(text=text)
    headers = {
        "Authorization": f"Bearer {LLM_API_KEY}",
        "Content-Type": "application/json",
    }
    data = {
        "model": "gpt-4", # or the model you use        "prompt": prompt,
        "max_tokens": 1024, #Adjust as needed        "temperature": 0.5, # Control creativity, lower values ​​are more conservative    }
    try:
        response = (LLM_API_URL, headers=headers, json=data)
        response.raise_for_status() # Check for HTTP errors        result = ()
        # Analyze the results returned by LLM, note that the formats of different APIs may be different        rewritten_text = result['choices'][0]['text'].strip() # Sample path        print(f"Original: {text[:50]}... | Rewritten: {rewritten_text[:50]}...")
        return rewritten_text
    except  as e:
        print(f"Error calling LLM API: {e}")
        return text # Return to the original text when an error occurs    except (KeyError, IndexError) as e:
        print(f"Error parsing LLM response: {e} - Response: {}")
        return text # Return to the original text when an error occurs
# --- Apply style adjustment ---adjusted_content = []
for element in all_extracted_content:
    if element['type'] == 'paragraph':
        # --- Calling the LLM API ---        # adjusted_text = adapt_text_style(element['text'])
        # element['text'] = adjusted_text # Update text        # --- Or do not call it first, and wait until it is generated before processing ---        adjusted_content.append(element)
    elif element['type'] == 'table':
         # Table content can also be processed cell by cell, but may not work well or costly         # A better approach might be to organize the table content into text and describe it to the LLM, or to process it manually         adjusted_content.append(element)
    elif element['type'] == 'image':
         # The image cannot be processed directly         adjusted_content.append(element)
    # Handle other types...
# --- (Continued to the next stage: document generation) ---

Stage 4: Generate target document (Python script)

# --- (Continued) ---
# --- Create target document (based on templates) ---try:
    target_doc = Document(TARGET_TEMPLATE)
except Exception as e:
    print(f"Error loading template {TARGET_TEMPLATE}: {e}")
    # You can consider creating an empty document as a backup    # target_doc = Document()
    exit()


# --- Fill in content and apply styles ---for element in adjusted_content: # Use adjusted content, or original extracted content    if element['type'] == 'paragraph':
        text = element['text']
        # --- Core: Styles defined in the application template ---        # Simple way: All paragraphs apply default text styles        # target_doc.add_paragraph(text, style='TargetBodyText') # Assume that this style is in the template
        # Complex way: determine which target style to apply based on source document information or content        # Example: If the source style is Heading 1, apply TargetHeading1        source_style = ('style', '') # Get the source style name (if saved during extraction)        if source_style.startswith('Heading 1'):
             target_doc.add_paragraph(text, style='TargetHeading1') # Assume that this style is present in the template        elif source_style.startswith('Heading 2'):
             target_doc.add_paragraph(text, style='TargetHeading2')
        # ... Other style mapping rules        else:
             target_doc.add_paragraph(text, style='TargetBodyText') # Default Style
    elif element['type'] == 'table':
        table_data = element['data']
        if table_data:
            # Create a table            num_rows = len(table_data)
            num_cols = len(table_data[0]) if num_rows > 0 else 0
            if num_rows > 0 and num_cols > 0:
                # --- Apply the table style defined in the template ---                table = target_doc.add_table(rows=num_rows, cols=num_cols, style='TargetTableStyle') # Assume that this table style is present in the template                # Fill in data                for i, row_data in enumerate(table_data):
                    for j, cell_text in enumerate(row_data):
                        # Prevent column mismatch errors                        if j < len([i].cells):
                            [i].cells[j].text = cell_text
                # You can add more table formatting codes, such as setting column widths, etc.
    elif element['type'] == 'image':
        # --- Add picture ---        # image_data = element['data']
        # image_filename = element['filename']
        # # You need to save image_data as a temporary file or use BytesIO        # from io import BytesIO
        # image_stream = BytesIO(image_data)
        # try:
        # target_doc.add_picture(image_stream, width=Inches(4.0)) # Adjust width        # except Exception as e:
        #    print(f"Error adding image {image_filename}: {e}")
        pass # Placeholder for image insertion

    # --- Processing Formulas (Challenge) ---    # If the formula is extracted as a picture:    #   elif element['type'] == 'formula_image':
    # # Add image...    # If the formula is extracted as MathML/OMML (XML string):    #   elif element['type'] == 'formula_mathml':
    # # It's hard to insert MathML directly using python-docx    # # You may need to operate OOXML directly (very complex)    # # Or, insert a placeholder "[FORMULA]" into the paragraph and replace it manually    #       target_doc.add_paragraph(f"[FORMULA: {element['id']}]", style='TargetBodyText')
    # If the formula is extracted as a plain text approximation:    #   elif element['type'] == 'formula_text':
    # target_doc.add_paragraph(element['text'], style='FormulaStyle') # Special styles may be required
# --- Save the final document ---try:
    target_doc.save(OUTPUT_DOC_PATH)
    print(f"Document successfully generated: {OUTPUT_DOC_PATH}")
except Exception as e:
    print(f"Error saving document: {e}")

Stage 5: Manual review and refining

1. Open the generated document (generated_document.docx).

2. Check the overall structure and content integrity: Is all required content extracted and placed in the correct position?

3. Check the style and format:

  • Are all texts correctly applied to the template style?
  • Do the font, font size and spacing meet the requirements?
  • Is the table style correct? Do column width and alignment need to be adjusted?
  • Is the image position and size appropriate?

4. Check language style and specifications:

  • Read the text through and check whether the tone and words meet the target requirements.
  • Fix LLM errors or unnatural expressions that may arise.
  • Ensure the terminology is uniform.
  • Perform spelling and grammar checks.

5. Handle complex elements:

Formula: This is where manual operation is most likely. If the script has a placeholder inserted, you need to manually copy and paste the formulas from the source document, or recreate them using Word's formula editor. Make sure the formula is numbered and referenced correctly.

Special layout: Check whether there are places where special layouts are needed (such as mixed pictures and texts, columns, etc.) and adjust them manually.

6. Final finalization: Save the modified document.

Challenges and strategies for formula processing

Difficulty: Formulas in DOCX are usually stored using OMML (Office Math Markup Language) and are nested in complex XML structures. python-docx has limited support for this.

Strategy:

  • Extract as an image (most feasible): Try rendering or screenshotting the formula as an image during the extraction phase. This will lose editing capabilities, but will ensure visual effects. It is also difficult to implement, and may require the help of other tools or libraries (such as the docx2python library may provide some help, or you need to analyze OOXML to find the image representation).
  • Extract as MathML/OMML (Complex): parses OOXML and extracts XML fragments of formulas. But python-docx cannot directly reinsert these XMLs and render them into formulas. A very low-level OOXML operation is required.
  • Extracted as approximate text (simple but loses precision): python-docx sometimes gets an approximate representation of plain text when reading the paragraph text property containing formulas. This may be useful for simple formulas, but complex formulas will be completely distorted.
  • Manual processing (most reliable): Identify the formula position in the script, insert the placeholder, and then manually copy/create the formula during the manual review phase.

Summarize

This is a multi-stage, combining automation and manual processes.

Automation strengths: repetitive content extraction, template-based style application, preliminary text style conversion (using LLM).

Manual intervention points: Define precise extraction rules, handle complex formulas, fine-tune language styles and terms, final format fine-tune and quality inspection.

The most time-consuming part will be writing and debugging the extraction logic and ultimately manual review and correction. Be sure to start with a small amount of documentation and simple rules, gradually iterate and perfect your scripts.

The above is the detailed content of Python implementing intelligent extraction and synthesis of word document content. For more information about Python word document content extraction and synthesis, please pay attention to my other related articles!