SoFunction
Updated on 2025-04-08

How to read .doc format files across platforms in Python

introduction

In a cross-platform development environment, handling different file formats is a common problem, especially the old .doc format. Microsoft Word's .doc file format is not as widely supported as the .docx format, so we often need a reliable solution to read and convert .doc files. In this blog, we will introduce a cross-platform solution based on Python, using LibreOffice to convert .doc files to .docx format, and then further process the document content through docx2txt.

Program Overview

  • Target: In a cross-platform environment, read.docFormat file.
  • step
    1. Using LibreOffice will.docConvert file to.docxFormat.
    2. usedocx2txtExtraction.docxThe content of the file.

The benefits of this approach are:

  • LibreOffice is cross-platform and supports Linux, macOS, and Windows.
  • docx2txtis a lightweight library used to.docxExtract text content from the file.

LibreOffice Cross-platform installation method

Using LibreOffice conversion.docBefore the file, you need to install LibreOffice in the operating system. Here are the installation methods for different operating systems:

1. Windows Installation

  • Visit the official LibreOffice website:LibreOffice download page
  • Select the Windows version, download and install the installation package.
  • After the installation is complete, make suresofficeCommands can be called correctly on the command line (usually LibreOffice will automatically configure environment variables).

2. macOS installation

  • Open the terminal and install LibreOffice using Homebrew
brew install --cask libreoffice
  • After the installation is completed, LibreOffice should automatically configure the relevant commands, which you can run through the terminal.sofficeTo confirm whether the installation is successful.

3. Linux installation

  • On Ubuntu or Debian systems, you can install LibreOffice using the following command:
sudo apt update
sudo apt install libreoffice
  • In a Fedora system, use the following command:
sudo dnf install libreoffice
  • After the installation is completed, you can usesofficeCommand to confirm whether the installation is successful.

4. Verify the installation

On any operating system, you can check via the command linesofficeWhether it works normally:

soffice --version

If the version information of LibreOffice is output, the installation is successful.

Steps detailed explanation

  1. Convert with LibreOffice.docfor.docx: LibreOffice provides a command line toolsoffice, can batch process file conversion. It supports.docConvert file to.docxFormat. In this way, even if the original file is old.docFormat, it can also be converted to more modern and easier to process.docxFormat.

  2. Read.docxFile content: After the conversion is completed, we usedocx2txtLibrary to read.docxText in the file.docx2txtProvides a simple API that extracts text content from a document and returns it.

Complete code implementation

Here is the complete code to implement the above solution:

import docx2txt
import os
import subprocess
import shlex
 
def convert_doc_to_docx(doc_file, output_directory):
    """
     Use LibreOffice to convert .doc files to .docx format
     :raises RuntimeError: An exception is thrown when the conversion fails
     """
    try:
        # Create an output directory (if it does not exist)        (output_directory, exist_ok=True)
 
        # Escape all parameters to prevent command injection        safe_doc_file = (doc_file)
        safe_output_dir = (output_directory)
 
        # Build and execute commands        command = f"soffice --headless --convert-to docx --outdir {safe_output_dir} {safe_doc_file}"
        result = (
            command,
            shell=True,
            check=True,
            stdout=,
            stderr=,
            text=True
        )
 
        print(f"{doc_file} Convert to .docx Format complete")
    except  as e:
        error_msg = f"Conversion failed: {()}" if  else "Unknown error"
        raise RuntimeError(f"{error_msg}\nOrder: {}") from e
    except Exception as e:
        raise RuntimeError(f"An unexpected error occurred during the conversion process: {str(e)}") from e
 
 
def main():
 
    # Enter file path    doc_file = "D:/extracodes/open-webui/backend/docs/"
 
    # Get the directory according to doc_file    output_directory = (doc_file)
 
    # Change to .docx according to the suffix of doc_file    docx_file = doc_file.replace(".doc", ".docx")
 
    # Determine whether docx_file already exists. If it already exists, delete it    if (docx_file):
        (docx_file)
 
    # Step 1: Convert .doc file to .docx    convert_doc_to_docx(doc_file, output_directory)
 
    # Step 2: Process the converted .docx file (here we take the conversion to HTML as an example)    page_content = (docx_file)
    print(page_content)
 
 
if __name__ == "__main__":
    main()
Code explanation
  1. convert_doc_to_docxFunction: This function accepts.docFile path and output directory as input, use LibreOffice's command line toolsofficeCome and join.docConvert file to.docxFormat.--headlessParameters indicate that the GUI is not started.--convert-toUsed to specify the output format.

  2. mainFunction

    • First, define.docFile pathdoc_file
    • Then generate the target by replacing the file suffix.docxFile path.
    • Check if there is already a name of the same name.docxFile, delete if there is one.
    • Callconvert_doc_to_docxFunctions are converted.
    • After the conversion is completed, usefrom.docxExtract text content from the file and output it.

Applicable environment

  • Linux/macOS/Windows:LibreOffice anddocx2txtThey are all cross-platform and therefore can be used on most operating systems.
  • Python environment: Need to installdocx2txt Library,Can be passed pip install docx2txtInstall.

Conclusion

By using LibreOffice withdocx2txt, we can conveniently process across platforms.docFile, convert to.docxExtract text content after formatting. This method is simple and reliable, and is suitable for most batch processing..docFile scenario. If you need to deal with more complex file content or support more file formats, you can also explore other Python libraries such aspython-docxorPyWin32(Windows environment only).

The above is the detailed content of Python's cross-platform method to read .doc files in .doc format. For more information about Python's cross-platform reading .doc files, please follow my other related articles!