Python automation Office document processing strategy

1. Automatically process Word documents

1. Install the python-docx library

python-docx is a powerful library that can read, modify and create Word documents. Before you start, you need to make sure that the library is already installed. You can install it through the following command:

pip install python-docx

2. Read Word document content

Reading Word document content is very simple, you can read text in the document piece by piece. Here is a sample code:

from docx import Document
 
# Open a Word documentdoc = Document('')
 
# traverse paragraphs in the document and print contentfor paragraph in :
    print()

This code opens a document named and prints its contents piece by piece.

3. Modify Word document content

python-docx also allows you to modify the content of the document. For example, you can replace specific words in a document:

from docx import Document
 
# Open a Word documentdoc = Document('')
 
# traverse paragraphs and replace specific wordsfor paragraph in :
    if 'old_word' in :
        new_text = ('old_word', 'new_word')
         = new_text
 
# Save the modified document('modified_example.docx')

This code replaces all old_words in the document with new_word and saves them as new document.

4. Add new paragraphs and text

You can also add new paragraphs and text to the document:

from docx import Document
 
# Open a Word documentdoc = Document('')
 
# Add a new paragraphnew_paragraph = doc.add_paragraph()
 
# Add text to the new paragraphnew_paragraph.add_run('This is a new paragraph added by Python.')
 
# Save the modified document('modified_example.docx')

This code adds a new paragraph at the end of the document and writes the specified text.

5. Practical case: Batch adjustment of Word style

If you have multiple Word documents that need to adjust the font, font size, paragraph format and other styles uniformly, python-docx can show off your skills. Here is a sample code for batch styling of Word:

import os
from docx import Document
 
# Define a function that adjusts stylesdef adjust_word_style(file_path):
    doc = Document(file_path)
    for paragraph in :
        for run in :
             = 'Times New Roman'  # Set fonts             = 12  # Set the font size        paragraph.paragraph_format.line_spacing = 1.5  # Set line spacing    (file_path)
 
# Specify the folder pathfolder_path = 'your_folder_path'  # Replace with your folder path 
# traverse all files in the folderfor file_name in (folder_path):
    if file_name.endswith('.docx'):
        file_path = (folder_path, file_name)
        adjust_word_style(file_path)

This code will iterate through all .docx files in the specified folder and adjust their styles uniformly.

2. Automatically process Excel documents

1. Install openpyxl and pandas library

openpyxl and pandas are two powerful tools for processing Excel documents. You can install them with the following command:

pip install openpyxl pandas

2. Use openpyxl to read and modify Excel files

openpyxl can easily read and modify Excel files. Here is a sample code:

import openpyxl
 
# Load existing Excel fileworkbook = openpyxl.load_workbook('')
 
# Select a worksheetsheet = 
 
# Read cell valuescell_value = sheet['A1'].value
print(f"Cell A1 The value is: {cell_value}")
 
# Modify cell valuessheet['A1'] = "New Value"
 
# Save the modified file('modified_example.xlsx')

This code will open an Excel file named, read the value of cell A1, modify it to "new value", and save it as a new file.

3. Use pandas to read, clean and save Excel data

Pandas is more flexible and powerful when processing Excel data. Here is a sample code that uses pandas to read, clean and save Excel data:

import pandas as pd
 
# Read Excel filedata = pd.read_excel('')
 
# View the first five elements dataprint(())
 
# Data cleaning: delete empty valuesdata = ()
 
# Data Filter: Select a specific columnselected_columns = data[['Name', 'Age']]
 
# Data sortingsorted_data = data.sort_values(by='Age', ascending=False)
 
# Save processed data to a new Excel filesorted_data.to_excel('cleaned_data.xlsx', index=False)

This code will read the Excel file named, delete the empty value, select the two columns Name and Age, sort it in descending order of the Age column, and save the processed data as a new Excel file.

4. Practical cases: data extraction and summary

Extracting specific data from a complex Excel table and performing summary calculations is a common task. Here is a sample code that extracts the sum of sales per month from the sales data table:

import openpyxl
 
# Load Excel workbookwb = openpyxl.load_workbook('sales_data.xlsx')
 
# Select a worksheetsheet = 
 
# Initialize a dictionary to store sales per monthmonthly_sales = {}
 
# traverse the rows in the table (assuming the first row is the title row)for row in range(2, sheet.max_row + 1):
    month = (row=row, column=2).value  # Assume that the month is in the second column    sales_amount = (row=row, column=3).value  # Assume that sales are in the third column    if month in monthly_sales:
        monthly_sales[month] += sales_amount
    else:
        monthly_sales[month] = sales_amount
 
# Print the total sales per monthfor month, sales in monthly_sales.items():
    print(f"{month}: {sales}")

This code reads an Excel file named sales_data.xlsx, extracts the total sales of each month, and prints it out.

3. Automatically process PDF documents

1. Install PyPDF2 and pdfplumber libraries

PyPDF2 and pdfplumber are two major tools for processing PDF documents. You can install them with the following command:

pip install PyPDF2 pdfplumber

2. Read and merge PDF files using PyPDF2

PyPDF2 can read the content of a PDF file, obtain the number of file pages, extract the content of a specified page, and merge multiple PDF files. Here is a sample code:

import PyPDF2
 
# Open PDF filewith open('', 'rb') as file:
    reader = (file)
 
# Get the number of pages of a PDF filenum_pages = ()
print(f"PDF There are files {num_pages} Page")
 
# Extract the content of the first pagepage = (0)
text = ()
print(f"第一Page的内容是:\n{text}")
 
# Create a new PDF filewriter = ()
for i in range(num_pages):
    page = (i)
    (page)
 
with open('new_example.pdf', 'wb') as output_file:
    (output_file)
print("New PDF file saved")

This code opens a PDF file named, extracts the contents of the first page and prints it out, and then creates a new PDF file with all pages.

3. Use pdfplumber to extract PDF text more accurately

Compared to PyPDF2, pdfplumber is more accurate when extracting PDF text. Here is a sample code for extracting PDF text content using pdfplumber:

import pdfplumber
 
# Open PDF filewith ('') as pdf:
    # Get the number of pages of a PDF file    num_pages = len()
    print(f"PDF There are files {num_pages} Page")
 
    # Extract the content of the first page    first_page = [0]
    text = first_page.extract_text()
 
    print(f"第一Page的内容是:\n{text}")

Practical case: Batch extraction of table data in PDF

When processing PDF files containing table data, pdfplumber can accurately extract table content. Here is a sample code for batch extraction of tabular data for all PDF files in a specified folder:

import os
import pdfplumber
 
Specify the folder path
folder_path = 'your_pdf_folder_path' # Replace with your folder pathoutput_data = []
 
Iterate through all files in the folder
for file_name in (folder_path):
if file_name.endswith('.pdf'):
file_path = (folder_path, file_name)
 
# Open PDF file    with (file_path) as pdf:
        # Assume that each PDF file has only one page containing tabular data        page = [0]  # Adjust page number according to actual situation        
        # Extract form        table = page.extract_table()
        
        # Add table data to the output list (the data structure can be adjusted as needed)        output_data.append({
            'file_name': file_name,
            'table_data': table
        })
Print or save extracted table data
for item in output_data:
print(f"file name: {item['file_name']}")
for row in item['table_data']:
print(row)
print("\n")
 
Save asExceldocument，AvailablepandasofDataFrameandto_excelmethod
import pandas as pd
df = (output_data_reformatted) # The data structure needs to be adjusted according to the actual situation to adapt to DataFramedf.to_excel('extracted_tables.xlsx', index=False)

Note: In practical applications, it may be necessary to adjust the code to adapt to the table structure and data format of different PDF files. Furthermore, if the tables in the PDF file span multiple pages, the code needs to be modified accordingly to iterate through all relevant pages.

Through this article, you have mastered the basic methods of using Python to automate Word, Excel, and PDF documents. These skills will greatly improve your office efficiency and reduce the error rate of manual operations. As you gain insight into these libraries and tools, you can also explore more advanced features to meet more complex document processing needs.

The above is the detailed content of the full guide to processing Python’s automated Office document. For more information about processing Python’s automated Office, please follow my other related articles!