Python uses pypandoc to convert markdown file and LaTex formula to word

Most of the answers generated by big models such as Tongyi Qianwen are of markdown type, and they need to be converted into Word files.

A pypandoc introduction

1. Project introduction

pypandoc is a lightweight Python wrapper for pandoc. pandoc is a general document conversion tool that supports document conversion in multiple formats, such as Markdown, HTML, LaTeX, DocBook, etc. pypandoc makes calling pandoc in Python scripts more convenient by providing a simple Python interface.

2. Installation

Install using pip

pip install pypandoc_binary

Automatically download Pandoc and install

Note: pypandoc provides two packages:

pypandoc: The user needs to install the pandoc software by himself to use it.

pypandoc_binary: contains precompiled pandoc binary files, which is convenient for users to get started quickly.

Manual installation

You can manually install pandoc and then install pypandoc library

pip install pypandoc

You can also install pypandoc first and then run the pypandoc.download_pandoc() function in pyhon to automatically download and install Pandoc, and store it in a directory that pypandoc can access.

2. Use Python to convert markdown to Word

This script implements three types of functions

1. Convert markdown file to word file

2. Turn the "-" at the beginning of the paragraph in markdown to enter to avoid rendering it into black dots or hollow circles and other unusual symbols in Word.

3. Customized templates and format output.

import pypandoc
import time
import re

# Define pathpath1 = r""
path2 = r".docx"
template_path = r"D:\aTools\ytemplates\templates_s.docx"

# Read the contents of the original Markdown filewith open(path1, 'r', encoding='utf-8') as file:
    content = ()

# Use regular expressions to replace the part that starts with '-' with a newlineprocessed_content = (r'- ', '\n', content)

# Record the start timet1 = ()

# Convert processed content to Word documentpypandoc.convert_text(
    processed_content,
    'docx',
    format='md',
    outputfile=path2,
    extra_args=['--reference-doc', template_path]
)

# Printing timeprint(() - t1)
print("The conversion is complete!")

3. Directly specify Word format

Directly read the file (can be txt or md) and convert it to a word of the specified format.

This format is:

1. Turn the "-" at the beginning of the paragraph in markdown to enter to avoid rendering it into black dots or hollow circles and other unusual symbols in Word.

2. Continue to bold and align the original bold part with the left

3. The font is black GB2312

Note: When replacing the code with regular ####, these need to be replaced first with level 4 titles, otherwise there will be logical errors, resulting in odd numbers of # that cannot be replaced.

When setting Chinese fonts, you cannot use = 'Song Song_GB2312', but use style._element.(qn('w:eastAsia'), 'Song Song_GB2312') to set Chinese fonts.

import re
from docx import Document
from  import Pt, RGBColor
from  import WD_ALIGN_PARAGRAPH
from  import qn

def set_font_color(run):
     = 'Times New Roman'
    run._element.(qn('w:eastAsia'), 'Imitation Song_GB2312')
     = Pt(12)
     = RGBColor(0, 0, 0)
     = False

def process_content(line, paragraph):
    """General content processing function"""
    bold_pattern = (r'\*\*(.*?)\*\*')
    matches = list(bold_pattern.finditer(line))
    
    if not matches:
        run = paragraph.add_run(line)
        set_font_color(run)
    else:
        start = 0
        for match in matches:
            if () &gt; start:
                run = paragraph.add_run(line[start:()])
                set_font_color(run)
            run = paragraph.add_run((1))
             = True
            set_font_color(run)
            start = ()
        if start &lt; len(line):
            run = paragraph.add_run(line[start:])
            set_font_color(run)

def mdtxt2word(txt_path, docx_path):
    with open(txt_path, 'r', encoding='utf-8') as file:
        content = (r'- ', '\n', ())

    doc = Document()
    style = ['Normal']
     = 'Times New Roman'
    style._element.(qn('w:eastAsia'), 'Imitation Song_GB2312')
     = Pt(12)
     = RGBColor(0, 0, 0)

    # Merge title regular expressions    heading_pattern = (
        r'^\s*(#{1,4})\s*(.*?)\s*$' # Match 1-4 titles starting with #    )

    for line in ('\n'):
        # Process all title types        heading_match = heading_pattern.match(line)
        if heading_match:
            level = len(heading_match.group(1))  # Determine the level according to # quantity            title_text = heading_match.group(2).strip()
            
            if not title_text:
                continue  # Skip empty titles
            # Create a title at the corresponding level            heading = doc.add_heading(level=min(level, 4))  # Limit maximum level 4             = WD_ALIGN_PARAGRAPH.LEFT
            
            # Handle bold marks in title content            process_content(title_text, heading)
            continue

        # Handle ordinary paragraphs        paragraph = doc.add_paragraph()
         = WD_ALIGN_PARAGRAPH.LEFT
        process_content(line, paragraph)

    (docx_path)
    print("The conversion is complete!")

if __name__ == "__main__":
    txt_path = r"C:\Users\xueshifeng\Desktop\"
    docx_path = r"C:\Users\xueshifeng\Desktop\"
    mdtxt2word(txt_path, docx_path)

4. Convert LaTex formula to Word

Replace the middle position of the latex_content string $$ with a formula, or copy the code directly to GPT and let the GPT modify the code

import pypandoc

# Define a LaTeX string containing a specific formula#$$ The middle position is replaced by the formula, or copy the code directly to GPT to let the GPT generate the final codelatex_content = r"""
\documentclass{article}
\usepackage{amsmath} % Make sure to include packages for mathematical typography
\begin{document}

$ L(y_i, f(x_i)) = \max(0, 1 - y_if(x_i)) $


\end{document}
"""

# Convert LaTeX content to Word documentoutput_file = r""

output = pypandoc.convert_text(
    latex_content,  # Entered string    'docx',         # Output format    format='latex', # Input format (LaTeX)    outputfile=output_file,  # Output file path    extra_args=['--mathml']  # Extra parameters to ensure that the formula is rendered in MathML format)

# Check whether the conversion is successfulif output != '':
    print(f"An error occurred during the conversion process: {output}")
else:
    print(f"Word Document generated: {output_file}")

4. Convert LaTex formula to Word and add to Word

The difficulty lies in how to manage file handles. No good method was found, so we used the method of closing and then opening the opened document first.

import os
import pypandoc
from docx import Document
import tempfile
import time
import pythoncom
from  import Dispatch  # pywin32 library needs to be installed
def is_file_locked(filepath):
    try:
        with open(filepath, 'a'):
            return False
    except PermissionError:
        return True
    except FileNotFoundError:
        return False

def close_word_document(filepath):
    try:
        word = Dispatch("")
        for doc in :
            if () == (filepath).lower():
                ()
                ()
                print("Save and close Word document")
                return True
        ()
    except Exception as e:
        print(f"closureWordDocument failure：{str(e)}")
    return False

def generate_latex_content(formula):
    """Generate complete LaTeX document content"""
    return fr"""
    \documentclass{{article}}
    \usepackage{{amsmath}}
    \begin{{document}}

    start：

    ${formula}$

    Finish。
    \end{{document}}
    
    """
    
def doc_creat(user_formula, output_file):


    # Check if the file exists    if not (output_file):
        # Create a new document object        doc = Document()
        # Save the document        (output_file)
        print(f"File created：{output_file}")
        document = Document(output_file)
    else:
        print("File Opened")
        
    
    
    retry_count = 3
    for _ in range(retry_count):
        if is_file_locked(output_file):
            print("Detected that the file was occupied, try to close the Word document...")
            if close_word_document(output_file):
                (0.5)  # Wait for the system to release the file                continue
            else:
                print("Error: The file is occupied by other programs. Please close it manually and try again!")
                break

        try:
            with (delete=False, suffix=".tex") as temp_tex_file:
                latex_content = generate_latex_content(user_formula)
                temp_tex_file.write(latex_content.encode('utf-8'))
                temp_tex_file_name = temp_tex_file.name

            with (delete=False, suffix=".docx") as temp_docx_file:
                temp_docx_file_name = temp_docx_file.name

            # Convert LaTeX to Word            pypandoc.convert_file(
                temp_tex_file_name, 'docx', 
                outputfile=temp_docx_file_name, extra_args=['--mathjax']
            )

            # Create or open the target document            target_doc = Document(output_file) if (output_file) else Document()
            temp_doc = Document(temp_docx_file_name)
            
            # Copy all elements            for element in temp_doc.:
                target_doc.(element)
            
            # Save the target document            target_doc.save(output_file)
            print(f"Content has been successfully added to：{output_file}")
            
            # Automatically open documents with Word            (output_file)
            break

        except PermissionError:
            print("File permissions are incorrect, please check whether the file is occupied by other programs")
            break
        except Exception as e:
            print(f"Operation failed：{str(e)}")
            break
        finally:
            if 'temp_tex_file_name' in locals() and (temp_tex_file_name):
                (temp_tex_file_name)
            if 'temp_docx_file_name' in locals() and (temp_docx_file_name):
                (temp_docx_file_name)
    else:
        print("The number of retry has reached the upper limit, please check the file status")

if __name__ == '__main__':
    # User input formula (example)    user_formula = r"\frac{\sqrt{x^2 + y^2}}{z}"   
    # Output file path    output_file = r"C:\Users\xueshifeng\Desktop\"
    
    doc_creat(user_formula, output_file)

The above is the detailed content of Python using pypandoc to convert markdown files and LaTex formulas to word. For more information about the conversion of Python pypandoc format, please pay attention to my other related articles!