introduction
In modern document processing, Markdown (MD) has gradually become the preferred format for developers, technical writers and content creators due to its concise syntax and good readability. However, many documents are still saved in Microsoft Word's DOCX format. In order to convert DOCX documents to Markdown format, we can use Java and related libraries to implement automated parsing.
This article will introduce how to parse DOCX documents into Markdown documents using Java and related libraries, and provide a complete code example.
1. Introduction to tools and libraries
In order to implement the DOCX to Markdown conversion, we need the following tools and libraries:
- Java: A widely used programming language suitable for handling text and document conversion tasks.
- Apache POI: A Java library for processing Microsoft Office documents (such as DOCX, XLSX).
- CommonMark: A Java library for processing Markdown format, supporting Markdown parsing and generation.
- Pandoc(Optional): A powerful document conversion tool that supports conversion between multiple formats. The conversion can be implemented through Java calling command line tools.
This article will focus on parsing DOCX documents using Apache POI and converting them to Markdown format.
2. Install the dependency library
Before we start, we need to introduce the required dependency library into our project. If you use Maven to build a project, you canAdd the following dependencies to:
<dependencies> <!-- Apache POI for DOCX parsing --> <dependency> <groupId></groupId> <artifactId>poi-ooxml</artifactId> <version>5.2.3</version> </dependency> <!-- CommonMark for Markdown generation --> <dependency> <groupId></groupId> <artifactId>commonmark</artifactId> <version>0.21.0</version> </dependency> </dependencies>
3. Use Apache POI to parse DOCX documents
Apache POI is a powerful Java library that can read and write Microsoft Office documents. We can use the XWPFDocument class to parse the content in the DOCX file, including paragraphs, titles, tables, pictures, etc.
Here is a simple example showing how to read text content in a DOCX file using Apache POI:
import ; import ; import ; import ; import ; public class DocxParser { public static String parseDocx(String filePath) throws IOException { StringBuilder text = new StringBuilder(); try (FileInputStream fis = new FileInputStream(filePath); XWPFDocument document = new XWPFDocument(fis)) { // traverse paragraphs in the document List<XWPFParagraph> paragraphs = (); for (XWPFParagraph paragraph : paragraphs) { (()).append("\n"); } } return (); } public static void main(String[] args) { try { String docxText = parseDocx(""); (docxText); } catch (IOException e) { (); } } }
4. Convert parsed content to Markdown format
After parsing the DOCX document, we need to convert its contents to Markdown format. Markdown's syntax is relatively simple, for example:
- title:
# Title 1
,## Title 2
- Paragraph: Write directly to text
- List:
- List items
- Table: Use
|
and-
symbol - picture:

We can manually convert it to Markdown format based on what Apache POI parsed. Here is an example:
import .*; import ; import ; import ; public class DocxToMarkdown { public static String convertToMarkdown(String filePath) throws IOException { StringBuilder markdown = new StringBuilder(); try (FileInputStream fis = new FileInputStream(filePath); XWPFDocument document = new XWPFDocument(fis)) { // traverse paragraphs in the document List<XWPFParagraph> paragraphs = (); for (XWPFParagraph paragraph : paragraphs) { String text = (); if (()) { continue; } //Judge paragraph style (title, list, etc.) String style = (); if (style != null && ().contains("heading")) { // Title int level = (("\\D", "")); ("#".repeat(level)).append(" ").append(text).append("\n"); } else { // Ordinary paragraph (text).append("\n"); } } } return (); } public static void main(String[] args) { try { String markdown = convertToMarkdown(""); (markdown); } catch (IOException e) { (); } } }
5. Handle complex formats (tables, pictures, etc.)
DOCX documents may contain complex formats such as tables and pictures. Apache POI provides corresponding classes to handle these contents:
-
sheet:use
XWPFTable
The class parses the table content and converts it to the Markdown table format. -
picture:use
XWPFPictureData
The class extracts the image and saves it as a file, and inserts the image link in Markdown.
Here is an example of processing tables:
import .*; import ; import ; import ; public class DocxToMarkdown { public static String convertToMarkdown(String filePath) throws IOException { StringBuilder markdown = new StringBuilder(); try (FileInputStream fis = new FileInputStream(filePath); XWPFDocument document = new XWPFDocument(fis)) { // Processing form List<XWPFTable> tables = (); for (XWPFTable table : tables) { for (XWPFTableRow row : ()) { for (XWPFTableCell cell : ()) { ("| ").append(()).append(" "); } ("|\n"); } ("\n"); } } return (); } public static void main(String[] args) { try { String markdown = convertToMarkdown(""); (markdown); } catch (IOException e) { (); } } }
6. Advanced conversion with Pandoc (optional)
If more complex format conversion is needed (such as supporting mathematical formulas, footnotes, etc.), you can use the Pandoc tool. Pandoc supports converting DOCX to Markdown via the command line. We can use Java to call command line tools to achieve the conversion:
import ; import ; import ; public class PandocConverter { public static void convertDocxToMarkdown(String docxPath, String mdPath) { try { String command = ("pandoc -s %s -t markdown -o %s", docxPath, mdPath); Process process = ().exec(command); (); // Read command output BufferedReader reader = new BufferedReader(new InputStreamReader(())); String line; while ((line = ()) != null) { (line); } } catch (IOException | InterruptedException e) { (); } } public static void main(String[] args) { convertDocxToMarkdown("", ""); } }
7. Summary
By using Apache POI and Java, we can easily parse DOCX documents into Markdown format. This method is not only suitable for simple text conversion, but also handles complex document formats such as tables, pictures and titles.
If more advanced conversion functions are required, it can be implemented in conjunction with Pandoc tools.
The above is the detailed content of the code implementation of using Java to parse DOCX documents into Markdown documents. For more information about parsing Java DOCX into Markdown, please pay attention to my other related articles!