SoFunction
Updated on 2024-10-28

Python automated office of the Word document content reading

preamble

In the previous chapters we learned about common file operations, such as file creation, copy and paste, cut and paste, file name renaming, deletion, and so on. In addition, we also learned some basic exercises, how to find a file, how to find a file by content, and so on.

In this chapter and the follow-up, will begin to learn some special file automation-related operations. Such as word, excel, PPT, although it is said to be a special file, in fact, is the actual work we often use the file types.

Next we move on to learning about automated manipulation of word documents.

New modules covered in the chapter

python-docx

pdfkit

pydocx

Batch file reading with python

word sharp tool python-docx

python-docx is a python library for creating modifiable Microsoft Word, providing a full set of Word operations and is the most commonly used Word tool.

Before you use it, understand a few concepts:

  • Document: is a Word document object, different from the concept of VBA Worksheet, Document is independent, open a different Word document, there will be different Document object, there is no impact on each other.
  • Paragraph: is a paragraph, a Word document consists of multiple paragraphs, when typing a Enter key in the document, it will become a new paragraph, type shift + Enter, will not be segmented!
  • Run denotes a section, each paragraph consists of multiple sections, and consecutive texts with the same style in a paragraph form a section, so a paragraph object has a Run list.

For example, the word document schematic below:

The word document structure is divided as follows:

python-docx installation

Installation:

pip install python-docx If the installation is too slow, change to a domestic source address (below)

pip install -i /simple python-docx

Import:

import docx
from docx import …

Document of python-docx

Import packages and modules:

from docx import Document

Usage:

Document (word file address)

Return Value:

word document object

Python-docx paragraph content reading

In fact, to read a word document, the main thing is to read its paragraphs and its tables. Whether it is a paragraph or form, it is a string inside, our purpose is to read the contents of these strings.

Let's first look at how paragraph content is read:

Source:

document_obj.paragraphs returns a list of paragraphs through the paragraphs function of the document object; if there are multiple paragraphs in a word file, there will be multiple paragraph objects.

Usage:

Get each paragraph object by looping through it and calling the text

The demo case script is below:

# coding:utf-8

import os
from docx import Document

path = ((), 'test_file/text.docx')
print("\'Text.docx\' The path is:", path)     # Debug Path

doc = Document(path)

for p in :
    print()

Running results are as follows: (PS: the text is just a demonstration, I am not a training organization!)

Read table contents in python-docx.

Next we look at how to read the contents of a table in a word file:

Source:

document_obj.tables returns a list of tables via the paragraphs function of the document object; inside is a list of table objects.

Usage:

Also by looping through the contents of the rows and columns

Return Value:

Each form field (string)

The demo case code is below:

# coding:utf-8

import os
from docx import Document

path = ((), 'test_file/text.docx')
print("\'Text.docx\' The path is:", path)     # Debug Path

doc = Document(path)

# for p in :
#     print()

for t in :            # for loop to get the form object
    for row in :          # Get each line
        row_str = []
        for cell in :    # Get each line of a separate small table, and then splice its contents together; after the splice is complete, print it out in a second for loop
            row_str.append()
        print(row_str)
        
# You can also get the contents of the columns in a table via "columns", try it yourself!

The results of the run are as follows:

This article on the Python Office Automation Word file content of the reading of the article is introduced to this, more related Python read Word content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!