introduction
In daily document processing, splitting large Word documents into multiple independent files is a common requirement. Splitting documents can bring many benefits, such as:
- Improve management efficiency: Large documents usually contain a lot of information, and processing and maintenance can be very complex. By splitting documents, you can break the content into smaller parts, simplifying the management and update process.
- Easy to collaborate: In team collaboration, multiple members may process the same document at the same time. After splitting the document, team members can each be responsible for different parts, reduce conflicts and improve work efficiency.
- Optimize performance: Large documents may cause software performance degradation when loading, editing, and saving. Splitting documents can reduce the impact of file size on system performance and make operations smoother.
- Simplified version control: For version control, small files change tracking is more intuitive, making it easier to trace and review, and avoiding repeated operations on the entire large document.
- Information organization and search: Split documents by chapter or topic, which helps to classify and organize information and facilitates subsequent search and citation.
This article will introduceusePythonPut WordSplit the document into multiple documentsFour different ways, including:
- Python split Word documents by section
- Python split Word documents by title
- Python split Word documents by bookmark
- Python splits Word documents into multiple HTML pages
Usage Tools
To split a Word document in Python, you can use for Pythonlibrary.
for Python is mainly used to create, read, edit and convert Word files in Python applications. It can handle various Word formats, including Doc, Docx, Docm, Dot, Dotx, Dotm, etc. In addition, you can convert Word documents to other types of file formats, such as Word to PDF, Word to RTF, Word to HTML, Word to text, Word to pictures, Word to OFD/XPS/PostScript.
You can install for Python from PyPI by running the following command in the terminal:
pip install
Python split Word documents by section
In Word, sections are used to divide a document into different sections, each section can have independent header, footer, page orientation, margins, and other formatting settings. Splitting Word documents by section allows each section to be saved as a separate file, thereby improving management, editing, and collaboration efficiency for specific sections without affecting the entire document.
The main steps for splitting Word documents by section are as follows:
- Open the source document: CreateDocument An instance of the class and load the source Word document that needs to be split.
-
Traversal Festival: Access each section in the source document one by one. For each section:
- Create a new document: Generate a new Word document for each section.
- Copy the contents of the festival: Copy the contents of the current section from the source document to the new document.
- Save the file: Save each new document as a separate file.
Implementation code:
from import * from import * # Load the source documentwith Document() as document: ("Test.docx") # traverse all sections in the document for sec_index in range(): # Access the current section section = [sec_index] # Create a new document for the current section with Document() as new_document: # Copy the current section to the new document new_document.(()) # Copy the theme and style of the source document to the new document to ensure consistent formatting (new_document) (new_document) # Save the new document as a separate file output_file = f"Output/Festival{sec_index + 1}.docx" new_document.SaveToFile(output_file, FileFormat.Docx2016)
Python split Word documents by title
Another common way to split Word documents is to split by title. This method splits the document into multiple independent files based on the specified title style (such as "Heading1".
The main steps for splitting Word documents by title are as follows:
- Open the source document:create Document An instance of the class and load the source Word document to be split.
-
Traversal Festival: Access each section in the source document one by one. For each section:
- Identify the title: Access each object in the section one by one, and find a paragraph with style "Heading1" as a segmentation point.
- Create a new document: When "Heading1" is discovered, a new document is generated and the title paragraph is copied into the new document.
- Copy content: Continue to copy content to the new document until the next "Heading1" is encountered.
- Save the file: Save each new document as a separate file.
Implementation code:
from import * from import * # Load the source documentwith Document() as source_document: source_document.LoadFromFile("Test.docx") # Initialize variables new_documents = [] new_document = None new_section = None is_inside_heading = False # traverse all sections in the document for sec_index in range(source_document.): # Access the current section section = source_document.Sections[sec_index] # Iterate through all objects in the current section for obj_index in range(): # Access the current object obj = [obj_index] # Check whether the current object is a paragraph if isinstance(obj, Paragraph): para = obj # Check whether the paragraph style is "Heading1" if == "Heading1": # Add document object to list if new_document is not None: new_documents.append(new_document) # Create a new document new_document = Document() # Add a new section to the new document new_section = new_document.AddSection() # Copy the section attributes of the source document to the section of the new document (new_section) # Copy paragraphs into sections of new document new_section.(()) # Set is_inside_heading flag to True is_inside_heading = True else: if is_inside_heading: # Copy the paragraph before the next Heading1 into the section of the new document new_section.(()) else: if is_inside_heading: # Copy non-paragraph objects into sections of new documents new_section.(()) # Add document object to the list if new_document is not None: new_documents.append(new_document) # traverse all document objects in the list for i, doc in enumerate(new_documents): # Copy the theme and style of the source document to ensure consistent formatting source_document.CloneThemesTo(doc) source_document.CloneDefaultStyleTo(doc) # Save the document as a separate file output_file = f"Output/Title content{i + 1}.docx" (output_file, FileFormat.Docx2016)
Python split Word documents by bookmark
Bookmarks are marks in documents that indicate specific locations or areas. Users can insert bookmarks where they need to customize split points to generate separate files that match a specific structure or logic.
The main steps for splitting Word documents by bookmark are as follows:
- Open the source document:createDocumentAn instance of the class and load the source Word document to be split.
-
Traversing bookmarks: Access each bookmark in the source document one by one. For each bookmark:
- Create a new document: Generate a new document for each bookmark.
- Add section: Add a new section to the new document.
- Replace bookmark content:useBookmarksNavigator The class extracts the content of the current bookmark, then inserts the bookmark of the same name into the new document, and replaces the content of the new bookmark with the extracted bookmark content.
- Save the file: Save each new document as a separate file.
Implementation code:
from import * from import * # Load the source documentwith Document() as document: ("Test.docx") # traverse all bookmarks in the document for bookmark_index in range(): # Access the current bookmark bookmark = [bookmark_index] # Create a new document for the current bookmark with Document() as new_document: # Add a new section to the new document new_section = new_document.AddSection() # Copy Section Properties [0].CloneSectionPropertiesTo(new_section) # Create a bookmark navigation for source documents bookmarks_navigator = BookmarksNavigator(document) # Navigate to the current bookmark bookmarks_navigator.MoveToBookmark() # Get bookmark content textBodyPart = bookmarks_navigator.GetBookmarkContent() # Add a paragraph to the new document paragraph = new_section.AddParagraph() # Add the same bookmark to the paragraph () () # Create a bookmark navigation for new documents new_bookmarks_navigator = BookmarksNavigator(new_document) # Navigate to the newly added bookmark in the new document new_bookmarks_navigator.MoveToBookmark() # Replace the content of the new bookmark with the content of the bookmark in the original document new_bookmarks_navigator.ReplaceBookmarkContent(textBodyPart) # Copy the theme and style of the source document to ensure consistent formatting (new_document) (new_document) # Save the new document as a separate file output_file = f"Output/Bookmark_{}.docx" new_document.SaveToFile(output_file, FileFormat.Docx2016)
Python splits Word documents into multiple HTML pages
Splitting a Word document into multiple HTML pages means dividing and converting the document content into multiple independent HTML web pages. This method allows documents to be displayed in the form of multiple pages in the browser, improving the flexibility of browsing and operation.
Here are the main steps to split a Word document into multiple HTML pages by section:
- Open the source document:create DocumentClass instance and load the source Word document to be split.
-
Traversal Festival: Access each section in the source document one by one. For each section:
- Create a new document: Create a new document for the current section.
- Copy the contents of the festival: Copy the contents of the current section from the source document to the new document.
- Embed CSSand images: Set HTML export options for new documents so that CSS styles and images are embedded into HTML pages.
- Save as HTMLdocument: Save the new document as an HTML file.
from import * from import * # Load the source documentwith Document() as document: ("Test.docx") # traverse all sections in the document for sec_index in range(): # Get the current section section = [sec_index] # Create a new document new_document = Document() # Copy the current section to the new document new_document.(()) # Copy the theme and style of the source document to ensure consistent formatting (new_document) (new_document) # Embed CSS style and image data into HTML pages new_document. = new_document. = True # Save new document as a separate HTML file output_file = f"Output/Festival-{sec_index + 1}.html" new_document.SaveToFile(output_file, )
In addition to splitting the content of a Word document into HTML pages, you can also adjust itFileFormatParameters split it into other formats such asPDF、XPS、Markdownwait.
This is the end of this article about four practical tips for splitting Word documents in Python. For more related content of Python splitting Word documents, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!