Most of the answers generated by big models such as Tongyi Qianwen are of markdown type, and they need to be converted into Word files.
A pypandoc introduction
1. Project introduction
pypandoc is a lightweight Python wrapper for pandoc. pandoc is a general document conversion tool that supports document conversion in multiple formats, such as Markdown, HTML, LaTeX, DocBook, etc. pypandoc makes calling pandoc in Python scripts more convenient by providing a simple Python interface.
2. Installation
Install using pip
pip install pypandoc_binary
Automatically download Pandoc and install
Note: pypandoc provides two packages:
pypandoc: The user needs to install the pandoc software by himself to use it.
pypandoc_binary: contains precompiled pandoc binary files, which is convenient for users to get started quickly.
Manual installation
You can manually install pandoc and then install pypandoc library
pip install pypandoc
You can also install pypandoc first and then run the pypandoc.download_pandoc() function in pyhon to automatically download and install Pandoc, and store it in a directory that pypandoc can access.
2. Use Python to convert markdown to Word
This script implements three types of functions
1. Convert markdown file to word file
2. Turn the "-" at the beginning of the paragraph in markdown to enter to avoid rendering it into black dots or hollow circles and other unusual symbols in Word.
3. Customized templates and format output.
import pypandoc import time import re # Define pathpath1 = r"" path2 = r".docx" template_path = r"D:\aTools\ytemplates\templates_s.docx" # Read the contents of the original Markdown filewith open(path1, 'r', encoding='utf-8') as file: content = () # Use regular expressions to replace the part that starts with '-' with a newlineprocessed_content = (r'- ', '\n', content) # Record the start timet1 = () # Convert processed content to Word documentpypandoc.convert_text( processed_content, 'docx', format='md', outputfile=path2, extra_args=['--reference-doc', template_path] ) # Printing timeprint(() - t1) print("The conversion is complete!")
3. Directly specify Word format
Directly read the file (can be txt or md) and convert it to a word of the specified format.
This format is:
1. Turn the "-" at the beginning of the paragraph in markdown to enter to avoid rendering it into black dots or hollow circles and other unusual symbols in Word.
2. Continue to bold and align the original bold part with the left
3. The font is black GB2312
Note: When replacing the code with regular ####, these need to be replaced first with level 4 titles, otherwise there will be logical errors, resulting in odd numbers of # that cannot be replaced.
When setting Chinese fonts, you cannot use = 'Song Song_GB2312', but use style._element.(qn('w:eastAsia'), 'Song Song_GB2312') to set Chinese fonts.
import re from docx import Document from import Pt, RGBColor from import WD_ALIGN_PARAGRAPH from import qn def set_font_color(run): = 'Times New Roman' run._element.(qn('w:eastAsia'), 'Imitation Song_GB2312') = Pt(12) = RGBColor(0, 0, 0) = False def process_content(line, paragraph): """General content processing function""" bold_pattern = (r'\*\*(.*?)\*\*') matches = list(bold_pattern.finditer(line)) if not matches: run = paragraph.add_run(line) set_font_color(run) else: start = 0 for match in matches: if () > start: run = paragraph.add_run(line[start:()]) set_font_color(run) run = paragraph.add_run((1)) = True set_font_color(run) start = () if start < len(line): run = paragraph.add_run(line[start:]) set_font_color(run) def mdtxt2word(txt_path, docx_path): with open(txt_path, 'r', encoding='utf-8') as file: content = (r'- ', '\n', ()) doc = Document() style = ['Normal'] = 'Times New Roman' style._element.(qn('w:eastAsia'), 'Imitation Song_GB2312') = Pt(12) = RGBColor(0, 0, 0) # Merge title regular expressions heading_pattern = ( r'^\s*(#{1,4})\s*(.*?)\s*$' # Match 1-4 titles starting with # ) for line in ('\n'): # Process all title types heading_match = heading_pattern.match(line) if heading_match: level = len(heading_match.group(1)) # Determine the level according to # quantity title_text = heading_match.group(2).strip() if not title_text: continue # Skip empty titles # Create a title at the corresponding level heading = doc.add_heading(level=min(level, 4)) # Limit maximum level 4 = WD_ALIGN_PARAGRAPH.LEFT # Handle bold marks in title content process_content(title_text, heading) continue # Handle ordinary paragraphs paragraph = doc.add_paragraph() = WD_ALIGN_PARAGRAPH.LEFT process_content(line, paragraph) (docx_path) print("The conversion is complete!") if __name__ == "__main__": txt_path = r"C:\Users\xueshifeng\Desktop\" docx_path = r"C:\Users\xueshifeng\Desktop\" mdtxt2word(txt_path, docx_path)
4. Convert LaTex formula to Word
Replace the middle position of the latex_content string $$ with a formula, or copy the code directly to GPT and let the GPT modify the code
import pypandoc # Define a LaTeX string containing a specific formula#$$ The middle position is replaced by the formula, or copy the code directly to GPT to let the GPT generate the final codelatex_content = r""" \documentclass{article} \usepackage{amsmath} % Make sure to include packages for mathematical typography \begin{document} $ L(y_i, f(x_i)) = \max(0, 1 - y_if(x_i)) $ \end{document} """ # Convert LaTeX content to Word documentoutput_file = r"" output = pypandoc.convert_text( latex_content, # Entered string 'docx', # Output format format='latex', # Input format (LaTeX) outputfile=output_file, # Output file path extra_args=['--mathml'] # Extra parameters to ensure that the formula is rendered in MathML format) # Check whether the conversion is successfulif output != '': print(f"An error occurred during the conversion process: {output}") else: print(f"Word Document generated: {output_file}")
4. Convert LaTex formula to Word and add to Word
The difficulty lies in how to manage file handles. No good method was found, so we used the method of closing and then opening the opened document first.
import os import pypandoc from docx import Document import tempfile import time import pythoncom from import Dispatch # pywin32 library needs to be installed def is_file_locked(filepath): try: with open(filepath, 'a'): return False except PermissionError: return True except FileNotFoundError: return False def close_word_document(filepath): try: word = Dispatch("") for doc in : if () == (filepath).lower(): () () print("Save and close Word document") return True () except Exception as e: print(f"closureWordDocument failure:{str(e)}") return False def generate_latex_content(formula): """Generate complete LaTeX document content""" return fr""" \documentclass{{article}} \usepackage{{amsmath}} \begin{{document}} start: ${formula}$ Finish。 \end{{document}} """ def doc_creat(user_formula, output_file): # Check if the file exists if not (output_file): # Create a new document object doc = Document() # Save the document (output_file) print(f"File created:{output_file}") document = Document(output_file) else: print("File Opened") retry_count = 3 for _ in range(retry_count): if is_file_locked(output_file): print("Detected that the file was occupied, try to close the Word document...") if close_word_document(output_file): (0.5) # Wait for the system to release the file continue else: print("Error: The file is occupied by other programs. Please close it manually and try again!") break try: with (delete=False, suffix=".tex") as temp_tex_file: latex_content = generate_latex_content(user_formula) temp_tex_file.write(latex_content.encode('utf-8')) temp_tex_file_name = temp_tex_file.name with (delete=False, suffix=".docx") as temp_docx_file: temp_docx_file_name = temp_docx_file.name # Convert LaTeX to Word pypandoc.convert_file( temp_tex_file_name, 'docx', outputfile=temp_docx_file_name, extra_args=['--mathjax'] ) # Create or open the target document target_doc = Document(output_file) if (output_file) else Document() temp_doc = Document(temp_docx_file_name) # Copy all elements for element in temp_doc.: target_doc.(element) # Save the target document target_doc.save(output_file) print(f"Content has been successfully added to:{output_file}") # Automatically open documents with Word (output_file) break except PermissionError: print("File permissions are incorrect, please check whether the file is occupied by other programs") break except Exception as e: print(f"Operation failed:{str(e)}") break finally: if 'temp_tex_file_name' in locals() and (temp_tex_file_name): (temp_tex_file_name) if 'temp_docx_file_name' in locals() and (temp_docx_file_name): (temp_docx_file_name) else: print("The number of retry has reached the upper limit, please check the file status") if __name__ == '__main__': # User input formula (example) user_formula = r"\frac{\sqrt{x^2 + y^2}}{z}" # Output file path output_file = r"C:\Users\xueshifeng\Desktop\" doc_creat(user_formula, output_file)
The above is the detailed content of Python using pypandoc to convert markdown files and LaTex formulas to word. For more information about the conversion of Python pypandoc format, please pay attention to my other related articles!