preamble
In the previous chapters we learned about common file operations, such as file creation, copy and paste, cut and paste, file name renaming, deletion, and so on. In addition, we also learned some basic exercises, how to find a file, how to find a file by content, and so on.
In this chapter and the follow-up, will begin to learn some special file automation-related operations. Such as word, excel, PPT, although it is said to be a special file, in fact, is the actual work we often use the file types.
Next we move on to learning about automated manipulation of word documents.
New modules covered in the chapter
python-docx
pdfkit
pydocx
Batch file reading with python
word sharp tool python-docx
python-docx is a python library for creating modifiable Microsoft Word, providing a full set of Word operations and is the most commonly used Word tool.
Before you use it, understand a few concepts:
- Document: is a Word document object, different from the concept of VBA Worksheet, Document is independent, open a different Word document, there will be different Document object, there is no impact on each other.
- Paragraph: is a paragraph, a Word document consists of multiple paragraphs, when typing a Enter key in the document, it will become a new paragraph, type shift + Enter, will not be segmented!
- Run denotes a section, each paragraph consists of multiple sections, and consecutive texts with the same style in a paragraph form a section, so a paragraph object has a Run list.
For example, the word document schematic below:
The word document structure is divided as follows:
python-docx installation
Installation:
pip install python-docx If the installation is too slow, change to a domestic source address (below)
pip install -i /simple python-docx
Import:
import docx
from docx import …
Document of python-docx
Import packages and modules:
from docx import Document
Usage:
Document (word file address)
Return Value:
word document object
Python-docx paragraph content reading
In fact, to read a word document, the main thing is to read its paragraphs and its tables. Whether it is a paragraph or form, it is a string inside, our purpose is to read the contents of these strings.
Let's first look at how paragraph content is read:
Source:
document_obj.paragraphs returns a list of paragraphs through the paragraphs function of the document object; if there are multiple paragraphs in a word file, there will be multiple paragraph objects.
Usage:
Get each paragraph object by looping through it and calling the text
The demo case script is below:
# coding:utf-8 import os from docx import Document path = ((), 'test_file/text.docx') print("\'Text.docx\' The path is:", path) # Debug Path doc = Document(path) for p in : print()
Running results are as follows: (PS: the text is just a demonstration, I am not a training organization!)
Read table contents in python-docx.
Next we look at how to read the contents of a table in a word file:
Source:
document_obj.tables returns a list of tables via the paragraphs function of the document object; inside is a list of table objects.
Usage:
Also by looping through the contents of the rows and columns
Return Value:
Each form field (string)
The demo case code is below:
# coding:utf-8 import os from docx import Document path = ((), 'test_file/text.docx') print("\'Text.docx\' The path is:", path) # Debug Path doc = Document(path) # for p in : # print() for t in : # for loop to get the form object for row in : # Get each line row_str = [] for cell in : # Get each line of a separate small table, and then splice its contents together; after the splice is complete, print it out in a second for loop row_str.append() print(row_str) # You can also get the contents of the columns in a table via "columns", try it yourself!
The results of the run are as follows:
This article on the Python Office Automation Word file content of the reading of the article is introduced to this, more related Python read Word content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!