Python uses Docling library to play with document processing

1. Background

Document processing has always been a headache in daily development. Whether it is technical documents, design documents, or files of various formats (such as PDF, DOCX, PPTX, etc.), it often takes a lot of effort to efficiently parse, convert and extract information. The emergence of Docling provides an elegant solution to this problem. It not only supports a variety of mainstream document formats, but also can deeply analyze PDFs and extract complex information such as page layout and table structure. More importantly, Docling provides a unified document presentation format and a convenient CLI, making document processing simple and efficient.

Next, we will dig into the power of Docling and show how it can help us work on documents efficiently with actual code examples.

2. What is Docling

Docling is a powerful third-party Python library focused on document processing and conversion. It supports a variety of document formats, including PDF, DOCX, PPTX, HTML, pictures, etc. The core function of Docling is in-depth analysis of PDFs, which can recognize page layout, reading order, table structure, and even support OCR functions to process scanned documents. In addition, Docling also provides a unified document representation format (DoclingDocument) to facilitate developers to follow-up processing.

3. Install Docling

As a third-party library, the installation of Docling is very simple. Just use the pip command to complete the installation:

pip install docling

If you need to support the CPU version of PyTorch, you can use the following command:

pip install docling --extra-index-url /whl/cpu

Once installed, you can use the power of Docling.

4. How to use library functions

Here are five commonly used functions of Docling and how to use them:

1. ()

This function is used to convert documents and supports local paths or URLs.

from docling.document_converter import DocumentConverter

source = "/pdf/2408.09869"  # Document path or URLconverter = DocumentConverter()
result = (source)

source: The path or URL of the document.

(): Convert the document to the internal representation format of Docling.

2. export_to_markdown()

Export the document to Markdown format.

markdown_content = .export_to_markdown()
print(markdown_content)

export_to_markdown(): Convert document content to Markdown format.

3. export_to_json()

Export the document to JSON format.

json_content = .export_to_json()
print(json_content)

export_to_json(): Convert document content to JSON format.

4. ()

The document is processed in chunks, returning text content and metadata.

from docling_core. import HierarchicalChunker

chunks = list(HierarchicalChunker().chunk())
print(chunks[0])

HierarchicalChunker(): Creates a chunking device.

chunk(): chunks the document.

5. PdfPipelineOptions

Customize PDF conversion options.

from .pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False

PdfPipelineOptions: Customize PDF conversion options.

do_table_structure: Whether to parse the table structure.

do_cell_matching: Whether to map table cells back to PDF.

5. Examples of usage scenarios

Here are five practical usage scenarios and their code examples:

Scenario 1: Convert PDF to Markdown

from docling.document_converter import DocumentConverter

source = "/pdf/2408.09869"
converter = DocumentConverter()
result = (source)
markdown_content = .export_to_markdown()
print(markdown_content)

convert(): Convert PDF to Docling's internal representation format.

export_to_markdown(): Export the document to Markdown format.

Scenario 2: Limit document size

from docling.document_converter import DocumentConverter

source = "/pdf/2408.09869"
converter = DocumentConverter()
result = (source, max_num_pages=100, max_file_size=20971520)

max_num_pages: Limit the maximum number of pages of a document.

max_file_size: Limits the maximum file size of the document.

Scene 3: Custom PDF conversion options

from .pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter

pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False
converter = DocumentConverter(pipeline_options=pipeline_options)
result = ("path/to/your/")

PdfPipelineOptions: Customize PDF conversion options.

do_table_structure and do_cell_matching: control the analysis method of table structure.

Scenario 4: Document block processing

from docling.document_converter import DocumentConverter
from docling_core. import HierarchicalChunker

converter = DocumentConverter()
result = ("/pdf/2206.01062")
chunks = list(HierarchicalChunker().chunk())
print(chunks[0])

(): Process the document in chunking.

The output contains text content and metadata for easy subsequent processing.

Scenario 5: Use OCR to process scanned version PDF

from .pipeline_options import PipelineOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter

pipeline_options = PipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions()
converter = DocumentConverter(pipeline_options=pipeline_options)
result = ("path/to/scanned_document.pdf")

PipelineOptions and TesseractOcrOptions: Configure OCR options.

do_ocr: Enable OCR function.

6. Frequently Asked Questions and Solutions

Here are three common problems and solutions when using Docling:

Question 1: TensorFlow related warnings

error message:

This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

Solution: Install the TensorFlow version suitable for the CPU.

conda create --name py11 python==3.11
conda activate py11
conda install tensorflow

Issue 2: Tesseract OCR installation issues

Error message: Tesseract OCR is not installed or is configured incorrectly.

Solution: Install Tesseract OCR and set TESSDATA_PREFIX.

# macOS
brew install tesseract leptonica pkg-config
TESSDATA_PREFIX=/opt/homebrew/share/tessdata/

# Linux
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)

Issue 3: Tesserocr installation failed

Error message: Tesserocr installation failed.

Solution: Reinstall Tesserocr.

pip uninstall tesserocr
pip install --no-binary :all: tesserocr

7. Summary

Docling is a powerful document processing library that supports multiple document formats and in-depth parsing functions. It provides a unified document presentation format and rich export options to meet a variety of development needs. With simple installation and use, developers can easily integrate document processing into their projects. Whether it is technical documentation processing or AI application development, Docling is a trustworthy choice.

This is the end of this article about Python using the Docling library to play document processing. For more related content on Python Docling documentation, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!