Manipulating Word documents in Python is a common task, especially in the fields of office automation and data processing. This article will summarize and compare several commonly used Python libraries and methods in detail, including their advantages and disadvantages, applicable scenarios, and specific code examples. We will dig into the specific features and usage tips of each method to help you better understand and choose the right method.
1. python-docx
Overview:
python-docx is a Python library for creating and modifying Microsoft Word documents (.docx format). It provides a rich API that allows developers to easily generate and edit Word documents.
Main functions:
- Create a new Word document
- Add paragraphs, titles, pictures, tables, etc.
- Set text format (font, size, color, etc.)
- Read and modify the content of an existing document
- Insert page breaks, bookmarks, hyperlinks, etc.
advantage:
- Cross-platform: Can run on any Python-enabled platform.
- Easy to use: The API is concise and clear, and the documentation is detailed.
- Feature-rich: Supports the creation and editing of multiple document elements.
shortcoming:
- Only the .docx format is supported, not the .doc format.
- Support for complex Word documents, such as documents containing macros or embedded objects, is limited.
Applicable scenarios:
- Create and edit simple Word documents.
- Automatically generate reports, resumes and other documents.
- Batch processing of Word documents.
Detailed features and code examples:
1. Create and edit documents
from docx import Document from import Pt, Inches # Create a new documentdoc = Document() # Add a titledoc.add_heading('Document Title', 0) # Add paragraphdoc.add_paragraph('This is the first paragraph of the document. ') # Add text with stylep = doc.add_paragraph('This is a belt') run = p.add_run('There is a special format') = True = True # Set font stylerun = p.add_run('This is the text that sets the font') = 'Arial' = Pt(14) = True = True = RGBColor(0xFF, 0x00, 0x00) # red # Add an imagedoc.add_picture('path_to_image.jpg', width=Inches(1.25)) # Add a tabletable = doc.add_table(rows=2, cols=3) (0, 0).text = 'Row 1 Column 1' (0, 1).text = 'Row 1 Column 2' (1, 0).text = 'Row 2 Column 1' # Add page breakdoc.add_page_break() # Add bookmarkp = doc.add_paragraph('This is the bookmark location') p.add_bookmark('bookmark_name') # Add a hyperlinkp = doc.add_paragraph('This is a hyperlink:') run = p.add_hyperlink('', 'Click here') # Save the document('')
2. Read and modify existing documents
from docx import Document # Open an existing documentdoc = Document('existing_document.docx') # Read document contentfor para in : print() # Modify the document contentpara = [0] = 'This is the modified content' # Add a new paragraphdoc.add_paragraph('This is a new paragraph added') # Delete paragraphpara = [1] p = para._element ().remove(p) p._p = p._element = None # Save the modified document('modified_document.docx')
2. docx-mailmerge
Overview:
docx-mailmerge is a library for batch generation of Word documents. It allows you to define placeholders in Word templates and then fill them with Python scripts to generate multiple similar documents.
Main functions:
- Read Word template files
- Replace placeholders in templates
- Generate multiple documents
advantage:
- Easy to use: just define templates and data to generate multiple documents.
- Supports complex data structures: can handle nested data.
shortcoming:
- The functions are relatively single: mainly suitable for mail merging scenarios.
- Complex document editing operations are not supported.
Applicable scenarios:
- Generate contracts, invoices, certificates and other documents in batches.
- Automate the generation of personalized reports.
Detailed features and code examples:
Create templates and generate documents
from mailmerge import MailMerge # Open the template filetemplate = MailMerge('') # View placeholders in the templateprint(template.get_merge_fields()) # Define datadata = { 'name': 'John Doe', 'address': '123 Main St', 'city': 'Anytown', 'state': 'Anystate', 'zip': '12345' } # Generate a document(**data) ('') # Generate multiple documentsdata_list = [ {'name': 'John Doe', 'address': '123 Main St', 'city': 'Anytown', 'state': 'Anystate', 'zip': '12345'}, {'name': 'Jane Smith', 'address': '456 Elm St', 'city': 'Othertown', 'state': 'Otherstate', 'zip': '67890'} ] for i, data in enumerate(data_list): (**data) (f'output_{i+1}.docx')
3.
Overview:
Is a library for calling Windows COM objects in Python. Through it, you can directly control Microsoft Word applications and implement advanced operations on Word documents.
Main functions:
- Open and close Word apps
- Create and edit documents
- Read and modify document content
- Handle complex document structures (such as macros, embedded objects, etc.)
advantage:
- Powerful: Can implement almost all Word operations.
- Supports .doc and .docx formats.
shortcoming:
- Windows platforms only.
- The learning curve is steeper: you need to be familiar with the internal structure of COM objects and Word.
- Low performance: There may be performance losses due to the need to start a Word application.
Applicable scenarios:
- Scenarios where complex document operations are required.
- Process documents containing macros or embedded objects.
Detailed features and code examples:
1. Create and edit documents
import os from import Dispatch # Open Word applicationword = Dispatch('') = 0 # Run in the background, not displayed = 0 # No warning displayed # Create a new documentdoc = () # Add a title(). = 'Document Title' = True = 16 # Add paragraph(). = 'This is the first paragraph of the document. ' # Add text with stylep = ().Range = 'This is a belt' = False = False (0) # wdCollapseEnd = 'There is a special format' = True = True # Set font style(0) # wdCollapseEnd = 'This is the text that sets the font' = 'Arial' = 14 = True = True = 255 # red # Add an image('path_to_image.jpg', LinkToFile=False, SaveWithDocument=True) # Add a tabletable = (Range=().Range, NumRows=2, NumColumns=3) (1, 1). = 'Row 1 Column 1' (1, 2). = 'Row 1 Column 2' (2, 1). = 'Row 2 Column 1' # Add page break().(7) # wdPageBreak # Add bookmark('bookmark_name', ().Range) ['bookmark_name']. = 'This is the bookmark location' # Add a hyperlinkp = ().Range (Anchor=p, Address='', SubAddress='', ScreenTip='Click here', TextToDisplay='Click here') # Save the document('') # Close documents and Word applications() ()
2. Read and modify existing documents
import os from import Dispatch # Open Word applicationword = Dispatch('') = 0 # Run in the background, not displayed = 0 # No warning displayed # Open an existing documentdoc = ('existing_document.docx') # Read document contentfor para in : print() # Modify the document contentpara = [0] = 'This is the modified content' # Add a new paragraph(). = 'This is a new paragraph added' # Delete paragraphpara = [1] () # Save the modified document() # Close documents and Word applications() ()
4. mammoth
Overview:
mammoth is a library for converting Word documents (.docx format) to HTML. It can help you extract the content of Word documents for use in web applications.
Main functions:
- Convert .docx files to HTML
- Extract text and style information from a document
advantage:
- Lightweight: Focus on document conversion, not relying on other libraries.
- Easy to integrate: converted HTML can be easily embedded into web applications.
shortcoming:
- Single function: mainly used for document conversion and does not support document editing operations.
- Complex style conversion is not supported: Some complex styles may not be fully retained.
Applicable scenarios:
- Convert Word documents to HTML for web presentation.
- Extract plain text content from the document.
Detailed features and code examples:
Convert documents
from mammoth import convert_to_html # Read .docx filewith open('', 'rb') as docx_file: result = convert_to_html(docx_file) # Get the converted HTMLhtml = # Save HTML filewith open('', 'w', encoding='utf-8') as html_file: html_file.write(html) # Handle conversion errorsif : for message in : print(f"Error: {} - {}")
5. pandoc
Overview:
pandoc is a powerful document conversion tool that supports conversion between multiple formats. Although it is not a Python library, document conversion can be implemented through a Python script call to the pandoc command.
Main functions:
- Convert documents in multiple formats to Word documents (.docx)
- Supports Markdown, LaTeX and other formats
advantage:
- Supports a wide range of document formats.
- High conversion quality: It can retain the format of the original document very well.
shortcoming:
- The pandoc command line tool is required.
- Document editing operations are not supported.
Applicable scenarios:
- Convert documents in other formats to Word documents.
- Scenarios that require high-quality document conversion.
Detailed features and code examples:
Convert documents
import subprocess # Call the pandoc command to convert the Markdown file into a Word document(['pandoc', '', '-o', '']) # Call the pandoc command to convert a LaTeX file into a Word document(['pandoc', '', '-o', '']) # Call the pandoc command to convert HTML files into Word documents(['pandoc', '', '-o', '']) # Handle conversion errorstry: (['pandoc', '', '-o', ''], check=True) except as e: print(f"Error: {} - {}")
6. PyWinAuto
Overview:
PyWinAuto is an automated testing tool that can be used to simulate user actions, including opening and editing Word documents. This approach is suitable for scenarios where complex interactions are required.
Main functions:
- Simulate user operations (click, enter text, etc.)
- Control application windows and menus
advantage:
- High flexibility: can simulate any user operation.
- Supports complex interactions.
shortcoming:
- Windows platforms only.
- The learning curve is steeper: you need to be familiar with the concepts and techniques of automated testing.
Applicable scenarios:
- Scenarios where complex interactions are required.
- Test and verify the functionality of Word documents.
Detailed features and code examples:
Simulate user operations
from import Application # Launch Word Applicationapp = Application().start('C:\\Program Files\\Microsoft Office\\Office16\\') (title='Unt title - Word') # Analog input text(title='Unt title - Word').type_keys('Hello, World!', with_spaces=True) # Save the document(title='Unt title - Word').menu_select('File->Save as...') (title='Save As').type_keys('C:\\path\\to\\', with_spaces=True) (title='Save As').button('save').click() # Close the document(title='Unt title - Word').menu_select('File->Close') # Close Word Application()
7. Apache POI via Py4J
Overview:
Apache POI is a Java library used to handle Microsoft Office file formats. With Py4J, you can call Java code in Python, thereby using Apache POI to process Word documents.
Main functions:
- Create and edit Word documents
- Read and modify document content
- Handle complex document structures
advantage:
- Powerful: Can implement almost all Word operations.
- Supports multiple Office file formats.
shortcoming:
- Need to install the Java environment.
- The learning curve is steeper: you need to be familiar with the use of Java and Py4J.
Applicable scenarios:
- Scenarios where complex document structures are needed.
- Scenarios that require cross-platform support.
Detailed features and code examples:
Create and edit documents
First, you need to install Py4J and Apache POI, and then write a Java class to handle Word documents.
// Java code () import .*; public class WordProcessor { public void createDocument(String path) { XWPFDocument document = new XWPFDocument(); // Add a title XWPFParagraph titlePara = (); (); XWPFRun titleRun = (); ("Document Title"); (16); (true); // Add paragraph XWPFParagraph para = (); XWPFRun run = (); ("This is the first paragraph of the document."); // Add text with style run = (); ("This is a belt"); run = (); ("Special format"); (true); (true); // Set font style run = (); ("This is the text that sets the font"); ("Arial"); (14); (true); (true); ("FF0000"); // red // Add pictures try { InputStream pictureStream = new FileInputStream("path_to_image.jpg"); (pictureStream, Document.PICTURE_TYPE_JPEG); int pictureIndex = ().size(); XWPFParagraph picPara = (); XWPFRun picRun = (); (().get(pictureIndex - 1), Document.PICTURE_TYPE_JPEG, "", (100), (100)); } catch (Exception e) { (); } // Add a table XWPFTable table = (2, 3); (0).getCell(0).setText("Road 1 Column 1"); (0).getCell(1).setText("Road 1 Column 2"); (1).getCell(0).setText("Row 2 Column 1"); // Add page break XWPFParagraph pageBreakPara = (); ().addBreak(); // Add bookmark XWPFParagraph bookmarkPara = (); ("bookmark_name"); ().setText("This is the bookmark location"); // Add a hyperlink XWPFParagraph linkPara = (); XWPFHyperlink link = (); (""); ().setText("Click here"); // Save the document try { FileOutputStream out = new FileOutputStream(path); (out); (); } catch (Exception e) { (); } } }
Then call this Java class in Python:
from py4j.java_gateway import JavaGateway, GatewayClient # Start Java Gatewaygateway = JavaGateway(GatewayClient(port=25333), start_callback_server=True) # Get Java Objectsword_processor = gateway.entry_point.getWordProcessor() # Call Java methodsword_processor.createDocument("") # Close Gateway()
Summarize
method | Main functions | advantage | shortcoming | Applicable scenarios |
---|---|---|---|---|
python-docx | Create and edit Word documents | Cross-platform, easy to use, rich features | Only support .docx format, not complex documents | Create and edit simple documents, generate reports automatically |
docx-mailmerge | Bulk Generation of Word Documents | Simple and easy to use, support for complex data | Single function, does not support document editing | Bulk generation of contracts, invoices, etc. |
Control Word Applications | Powerful, support for .doc and .docx formats | Windows platform only, steep learning curve | Complex document operations and processing embedded objects | |
mammoth | Convert .docx to HTML | Lightweight, easy to integrate | Single function, does not support document editing | Document conversion, web display |
pandoc | Document format conversion | Supports wide range of formats and high conversion quality | Need to install command line tools | Document conversion, high-quality output |
PyWinAuto | Simulate user operations | High flexibility and support complex interactions | Windows platform only, steep learning curve | Complex interaction, test verification |
Apache POI via Py4J | Create and edit Word documents | Powerful and supports multiple formats | Requires a Java environment and a steep learning curve | Complex document operation, cross-platform support |
This is the end of this article about the implementation and comparison of 7 methods of Python operating Word documents (the most complete in history). For more related Python operating Word content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!