introduction
In a cross-platform development environment, handling different file formats is a common problem, especially the old .doc format. Microsoft Word's .doc file format is not as widely supported as the .docx format, so we often need a reliable solution to read and convert .doc files. In this blog, we will introduce a cross-platform solution based on Python, using LibreOffice to convert .doc files to .docx format, and then further process the document content through docx2txt.
Program Overview
-
Target: In a cross-platform environment, read
.doc
Format file. -
step:
- Using LibreOffice will
.doc
Convert file to.docx
Format. - use
docx2txt
Extraction.docx
The content of the file.
- Using LibreOffice will
The benefits of this approach are:
- LibreOffice is cross-platform and supports Linux, macOS, and Windows.
-
docx2txt
is a lightweight library used to.docx
Extract text content from the file.
LibreOffice Cross-platform installation method
Using LibreOffice conversion.doc
Before the file, you need to install LibreOffice in the operating system. Here are the installation methods for different operating systems:
1. Windows Installation
- Visit the official LibreOffice website:LibreOffice download page。
- Select the Windows version, download and install the installation package.
- After the installation is complete, make sure
soffice
Commands can be called correctly on the command line (usually LibreOffice will automatically configure environment variables).
2. macOS installation
- Open the terminal and install LibreOffice using Homebrew
brew install --cask libreoffice
- After the installation is completed, LibreOffice should automatically configure the relevant commands, which you can run through the terminal.
soffice
To confirm whether the installation is successful.
3. Linux installation
- On Ubuntu or Debian systems, you can install LibreOffice using the following command:
sudo apt update sudo apt install libreoffice
- In a Fedora system, use the following command:
sudo dnf install libreoffice
- After the installation is completed, you can use
soffice
Command to confirm whether the installation is successful.
4. Verify the installation
On any operating system, you can check via the command linesoffice
Whether it works normally:
soffice --version
If the version information of LibreOffice is output, the installation is successful.
Steps detailed explanation
Convert with LibreOffice
.doc
for.docx
: LibreOffice provides a command line toolsoffice
, can batch process file conversion. It supports.doc
Convert file to.docx
Format. In this way, even if the original file is old.doc
Format, it can also be converted to more modern and easier to process.docx
Format.Read
.docx
File content: After the conversion is completed, we usedocx2txt
Library to read.docx
Text in the file.docx2txt
Provides a simple API that extracts text content from a document and returns it.
Complete code implementation
Here is the complete code to implement the above solution:
import docx2txt import os import subprocess import shlex def convert_doc_to_docx(doc_file, output_directory): """ Use LibreOffice to convert .doc files to .docx format :raises RuntimeError: An exception is thrown when the conversion fails """ try: # Create an output directory (if it does not exist) (output_directory, exist_ok=True) # Escape all parameters to prevent command injection safe_doc_file = (doc_file) safe_output_dir = (output_directory) # Build and execute commands command = f"soffice --headless --convert-to docx --outdir {safe_output_dir} {safe_doc_file}" result = ( command, shell=True, check=True, stdout=, stderr=, text=True ) print(f"{doc_file} Convert to .docx Format complete") except as e: error_msg = f"Conversion failed: {()}" if else "Unknown error" raise RuntimeError(f"{error_msg}\nOrder: {}") from e except Exception as e: raise RuntimeError(f"An unexpected error occurred during the conversion process: {str(e)}") from e def main(): # Enter file path doc_file = "D:/extracodes/open-webui/backend/docs/" # Get the directory according to doc_file output_directory = (doc_file) # Change to .docx according to the suffix of doc_file docx_file = doc_file.replace(".doc", ".docx") # Determine whether docx_file already exists. If it already exists, delete it if (docx_file): (docx_file) # Step 1: Convert .doc file to .docx convert_doc_to_docx(doc_file, output_directory) # Step 2: Process the converted .docx file (here we take the conversion to HTML as an example) page_content = (docx_file) print(page_content) if __name__ == "__main__": main()
Code explanation
convert_doc_to_docx
Function: This function accepts.doc
File path and output directory as input, use LibreOffice's command line toolsoffice
Come and join.doc
Convert file to.docx
Format.--headless
Parameters indicate that the GUI is not started.--convert-to
Used to specify the output format.-
main
Function:- First, define
.doc
File pathdoc_file
。 - Then generate the target by replacing the file suffix
.docx
File path. - Check if there is already a name of the same name
.docx
File, delete if there is one. - Call
convert_doc_to_docx
Functions are converted. - After the conversion is completed, use
from
.docx
Extract text content from the file and output it.
- First, define
Applicable environment
-
Linux/macOS/Windows:LibreOffice and
docx2txt
They are all cross-platform and therefore can be used on most operating systems. -
Python environment: Need to install
docx2txt
Library,Can be passedpip install docx2txt
Install.
Conclusion
By using LibreOffice withdocx2txt
, we can conveniently process across platforms.doc
File, convert to.docx
Extract text content after formatting. This method is simple and reliable, and is suitable for most batch processing..doc
File scenario. If you need to deal with more complex file content or support more file formats, you can also explore other Python libraries such aspython-docx
orPyWin32
(Windows environment only).
The above is the detailed content of Python's cross-platform method to read .doc files in .doc format. For more information about Python's cross-platform reading .doc files, please follow my other related articles!