1. Background
Document processing has always been a headache in daily development. Whether it is technical documents, design documents, or files of various formats (such as PDF, DOCX, PPTX, etc.), it often takes a lot of effort to efficiently parse, convert and extract information. The emergence of Docling provides an elegant solution to this problem. It not only supports a variety of mainstream document formats, but also can deeply analyze PDFs and extract complex information such as page layout and table structure. More importantly, Docling provides a unified document presentation format and a convenient CLI, making document processing simple and efficient.
Next, we will dig into the power of Docling and show how it can help us work on documents efficiently with actual code examples.
2. What is Docling
Docling is a powerful third-party Python library focused on document processing and conversion. It supports a variety of document formats, including PDF, DOCX, PPTX, HTML, pictures, etc. The core function of Docling is in-depth analysis of PDFs, which can recognize page layout, reading order, table structure, and even support OCR functions to process scanned documents. In addition, Docling also provides a unified document representation format (DoclingDocument) to facilitate developers to follow-up processing.
3. Install Docling
As a third-party library, the installation of Docling is very simple. Just use the pip command to complete the installation:
pip install docling
If you need to support the CPU version of PyTorch, you can use the following command:
pip install docling --extra-index-url /whl/cpu
Once installed, you can use the power of Docling.
4. How to use library functions
Here are five commonly used functions of Docling and how to use them:
1. ()
This function is used to convert documents and supports local paths or URLs.
from docling.document_converter import DocumentConverter source = "/pdf/2408.09869" # Document path or URLconverter = DocumentConverter() result = (source)
source: The path or URL of the document.
(): Convert the document to the internal representation format of Docling.
2. export_to_markdown()
Export the document to Markdown format.
markdown_content = .export_to_markdown() print(markdown_content)
export_to_markdown(): Convert document content to Markdown format.
3. export_to_json()
Export the document to JSON format.
json_content = .export_to_json() print(json_content)
export_to_json(): Convert document content to JSON format.
4. ()
The document is processed in chunks, returning text content and metadata.
from docling_core. import HierarchicalChunker chunks = list(HierarchicalChunker().chunk()) print(chunks[0])
HierarchicalChunker(): Creates a chunking device.
chunk(): chunks the document.
5. PdfPipelineOptions
Customize PDF conversion options.
from .pipeline_options import PdfPipelineOptions pipeline_options = PdfPipelineOptions(do_table_structure=True) pipeline_options.table_structure_options.do_cell_matching = False
PdfPipelineOptions: Customize PDF conversion options.
do_table_structure: Whether to parse the table structure.
do_cell_matching: Whether to map table cells back to PDF.
5. Examples of usage scenarios
Here are five practical usage scenarios and their code examples:
Scenario 1: Convert PDF to Markdown
from docling.document_converter import DocumentConverter source = "/pdf/2408.09869" converter = DocumentConverter() result = (source) markdown_content = .export_to_markdown() print(markdown_content)
convert(): Convert PDF to Docling's internal representation format.
export_to_markdown(): Export the document to Markdown format.
Scenario 2: Limit document size
from docling.document_converter import DocumentConverter source = "/pdf/2408.09869" converter = DocumentConverter() result = (source, max_num_pages=100, max_file_size=20971520)
max_num_pages: Limit the maximum number of pages of a document.
max_file_size: Limits the maximum file size of the document.
Scene 3: Custom PDF conversion options
from .pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter pipeline_options = PdfPipelineOptions(do_table_structure=True) pipeline_options.table_structure_options.do_cell_matching = False converter = DocumentConverter(pipeline_options=pipeline_options) result = ("path/to/your/")
PdfPipelineOptions: Customize PDF conversion options.
do_table_structure and do_cell_matching: control the analysis method of table structure.
Scenario 4: Document block processing
from docling.document_converter import DocumentConverter from docling_core. import HierarchicalChunker converter = DocumentConverter() result = ("/pdf/2206.01062") chunks = list(HierarchicalChunker().chunk()) print(chunks[0])
(): Process the document in chunking.
The output contains text content and metadata for easy subsequent processing.
Scenario 5: Use OCR to process scanned version PDF
from .pipeline_options import PipelineOptions, TesseractOcrOptions from docling.document_converter import DocumentConverter pipeline_options = PipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = TesseractOcrOptions() converter = DocumentConverter(pipeline_options=pipeline_options) result = ("path/to/scanned_document.pdf")
PipelineOptions and TesseractOcrOptions: Configure OCR options.
do_ocr: Enable OCR function.
6. Frequently Asked Questions and Solutions
Here are three common problems and solutions when using Docling:
Question 1: TensorFlow related warnings
error message:
This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
Solution: Install the TensorFlow version suitable for the CPU.
conda create --name py11 python==3.11 conda activate py11 conda install tensorflow
Issue 2: Tesseract OCR installation issues
Error message: Tesseract OCR is not installed or is configured incorrectly.
Solution: Install Tesseract OCR and set TESSDATA_PREFIX.
# macOS brew install tesseract leptonica pkg-config TESSDATA_PREFIX=/opt/homebrew/share/tessdata/ # Linux apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)
Issue 3: Tesserocr installation failed
Error message: Tesserocr installation failed.
Solution: Reinstall Tesserocr.
pip uninstall tesserocr pip install --no-binary :all: tesserocr
7. Summary
Docling is a powerful document processing library that supports multiple document formats and in-depth parsing functions. It provides a unified document presentation format and rich export options to meet a variety of development needs. With simple installation and use, developers can easily integrate document processing into their projects. Whether it is technical documentation processing or AI application development, Docling is a trustworthy choice.
This is the end of this article about Python using the Docling library to play document processing. For more related content on Python Docling documentation, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!