Python calls olmOCR model to extract complex PDF files

The video corresponding to this note:/video/BV1gPXXYiETE/allenai/olmocr is an open source toolkit developed by the Allen Institute of Artificial Intelligence (AI2) to efficiently convert PDFs and other documents into structured plain text while maintaining a natural reading order. The following are the main features and functions of the project:

Core Technology

Use a visual language model (VLM) called olmOCR-7B-0225-preview, which is trained based on Qwen2-VL-7B-Instruct.
The model has been trained on approximately 250,000 pages of diverse PDF content (including scans and text-based) that are annotated using GPT-4o and published as the olmOCR-mix-0225 dataset.

Main functions

Efficient batch processing: Using SGLang to optimize the inference pipeline, it can process large amounts of documents at extremely low cost.
Document anchoring: Extract the coordinates of prominent elements (such as text blocks and images) in each page and inject them with the original text extracted from the PDF binary.
Support local and cluster usage: can run on a stand-alone GPU, and also supports multi-node parallel processing using AWS S3.

Performance and Advantages

Cost-effective: Converting 1 million pages of PDF is only $190, which is about 1/32 of the cost of using the GPT-4o API.
High accuracy: In manual evaluation, olmOCR ranks the highest among the ELO ratings of various PDF extraction technologies.
Improve downstream tasks: Use text extracted by olmOCR to train language models, and the accuracy rate is improved by an average of 1.3 percentage points in multiple AI benchmark tasks.

How to use

olmOCR provides Python API and command line tools to easily convert PDFs into structured text. It also includes features such as evaluation toolkit, language filtering, SEO spam removal, and more. Users can get the code through GitHub, or use an online demonstration to test their functionality.

In short, the allenai/olmocr project provides an efficient, accurate and economical solution for large-scale document conversion, which is particularly suitable for research and application scenarios that require processing a large number of PDF documents.

Installation command

# Install Miniconda (if not already installed)wget &lt;/miniconda/Miniconda3-latest-Linux-x86_64.sh&gt; -O ~/ ; bash ~/ -b -p $HOME/miniconda ; eval "$($HOME/miniconda/bin/conda  hook)" ; echo 'export PATH="$HOME/miniconda/bin:$PATH"' &gt;&gt; ~/.bashrc ; source ~/.bashrc

conda create -n ai python=3.11 -y &amp;&amp; conda activate ai

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

conda create -n olmocr python=3.11
conda activate olmocr

git clone &lt;/allenai/&gt;
cd olmocr
pip install -e .

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links &lt;/whl/cu124/torch2.4/flashinfer/&gt;

pip install gradio pandas

python -m  ./localworkspace --pdfs tests/gnarly_pdfs/

cat localworkspace/results/output_*.jsonl

gradio UI code

import os
import json
import gradio as gr
import subprocess
import pandas as pd
from pathlib import Path
import shutil
import time
import re

# Create a working directoryWORKSPACE_DIR = "olmocr_workspace"
(WORKSPACE_DIR, exist_ok=True)

def modify_html_for_better_display(html_content):
    """Modify HTML for better display in Gradio"""
    if not html_content:
        return html_content
    
    # Increase container width    html_content = html_content.replace('&lt;div class="container"&gt;', 
                                       '&lt;div class="container" style="max-width: 100%; width: 100%;"&gt;')
    
    # Increase text size    html_content = html_content.replace('&lt;style&gt;', 
                                       '&lt;style&gt;\nbody {font-size: 16px;}\-content {font-size: 16px; line-height: 1.5;}\n')
    
    # Adjust the size and proportion of image and text parts    html_content = html_content.replace('&lt;div class="row"&gt;', 
                                       '&lt;div class="row" style="display: flex; flex-wrap: wrap;"&gt;')
    html_content = html_content.replace('&lt;div class="col-md-6"&gt;', 
                                       '&lt;div class="col-md-6" style="flex: 0 0 50%; max-width: 50%; padding: 15px;"&gt;')
    
    # Increase the spacing between pages    html_content = html_content.replace('&lt;div class="page"&gt;', 
                                       '&lt;div class="page" style="margin-bottom: 30px; border-bottom: 1px solid #ccc; padding-bottom: 20px;"&gt;')
    
    # Increase image size    html_content = (r'&lt;img([^&gt;]*)style="([^"]*)"', 
                         r'&lt;img\1style="max-width: 100%; height: auto; \2"', 
                         html_content)
    
    # Add zoom control    zoom_controls = """
    &lt;div style="position: fixed; bottom: 20px; right: 20px; background: #fff; padding: 10px; border-radius: 5px; box-shadow: 0 0 10px rgba(0,0,0,0.2); z-index: 1000;"&gt;
        &lt;button onclick=" = parseFloat( || 1) + 0.1;" style="margin-right: 5px;"&gt;enlarge&lt;/button&gt;
        &lt;button onclick=" = parseFloat( || 1) - 0.1;">Shrink</button>
     </div>
     """
    html_content = html_content.replace('&lt;/body&gt;', f'{zoom_controls}&lt;/body&gt;')
    
    return html_content

def process_pdf(pdf_file):
    """Processing PDF files and returning results"""
    if pdf_file is None:
        return "Please uploadPDFdocument", "", None, None
    
    # Create a unique working directory    timestamp = int(())
    work_dir = (WORKSPACE_DIR, f"job_{timestamp}")
    (work_dir, exist_ok=True)
    
    # Copy PDF file    pdf_path = (work_dir, "")
    (pdf_file, pdf_path)
    
    # Build the command and execute it    cmd = ["python", "-m", "", work_dir, "--pdfs", pdf_path]
    
    try:
        # Execute the command and wait for completion        process = (
            cmd,
            stdout=,
            stderr=,
            text=True,
            check=True
        )
        
        # Command output        log_text = 
        
        # Check the results directory        results_dir = (work_dir, "results")
        if not (results_dir):
            return f"Processing is completed，But no results directory was generated\n\nLog output:\n{log_text}", "", None, None
        
        # Find the output file        output_files = list(Path(results_dir).glob("output_*.jsonl"))
        if not output_files:
            return f"Processing is completed，但未找到输出document\n\nLog output:\n{log_text}", "", None, None
        
        # Read JSONL file        output_file = output_files[0]
        with open(output_file, "r") as f:
            content = ().strip()
            if not content:
                return f"输出document为空\n\nLog output:\n{log_text}", "", None, None
            
            # parse JSON            result = (content)
            extracted_text = ("text", "Text content not found")
            
            # Generate HTML preview            try:
                preview_cmd = ["python", "-m", "", str(output_file)]
                (preview_cmd, check=True)
            except Exception as e:
                log_text += f"\ngenerateHTMLPreview failed: {str(e)}"
            
            # Find HTML files            html_files = list(Path("dolma_previews").glob("*.html"))
            html_content = ""
            if html_files:
                try:
                    with open(html_files[0], "r", encoding="utf-8") as hf:
                        html_content = ()
                        # Modify HTML to display better                        html_content = modify_html_for_better_display(html_content)
                except Exception as e:
                    log_text += f"\nReadHTMLPreview failed: {str(e)}"
            
            # Create a metadata table            metadata = ("metadata", {})
            meta_rows = []
            for key, value in ():
                meta_rows.append([key, value])
            
            df = (meta_rows, columns=["property", "value"])
            
            return log_text, extracted_text, html_content, df
        
    except  as e:
        return f"Command execution failed: {}", "", None, None
    except Exception as e:
        return f"An error occurred during processing: {str(e)}", "", None, None

# Create Gradio interfacewith (title="olmOCR PDF Extraction Tool") as app:
    ("# olmOCR PDF text extraction tool")    
    with ():
        with (scale=1):
            pdf_input = (label="Upload PDF file", file_types=[".pdf"])
            process_btn = ("Processing PDF", variant="primary")
        
        with (scale=2):
            tabs = ()
            with tabs:
                with ("Extract text"):
                    text_output = (label="Extracted text", lines=20, interactive=True)
                with ("HTML Preview", ):
                    # Use larger HTML components                    html_output = (label="HTML Preview", elem_)
                with ("Metadata"):
                    meta_output = (label="Document Metadata")
                with ("log"):
                    log_output = (label="Processing logs", lines=15, interactive=False)
    
    # Use CSS to customize HTML preview tab pages and content sizes    ("""
    &lt;style&gt;
    #html_preview_container {
        height: 800px;
        width: 100%; 
        overflow: auto;
        border: 1px solid #ddd;
        border-radius: 4px;
    }
    #html_preview_container iframe {
        width: 100%;
        height: 100%;
        border: none;
    }
    &lt;/style&gt;
    """)
    
    # Add operation instructions    ("""
     ## Instructions for use 1. Upload PDF files
     2. Click "Processing PDF"Button
     3. Wait for processing to complete
     4. View extracted text and HTML previews
    
     ### About HTML Preview - HTML Preview displays the original PDF page and extracted text comparison
     - The accuracy of the OCR process can be clearly seen
     - If the preview content is too small, you can adjust it using the zoom in/out button in the lower right corner
    
     ## Notice - The processing process may take several minutes, please be patient
     - The model will be downloaded for the first run (approximately 7GB)
     """)
    
    # Bind Button Event - Use Blocking Mode    process_btn.click(
        fn=process_pdf,
        inputs=pdf_input,
        outputs=[log_output, text_output, html_output, meta_output],
        api_name="process"
    )

# Start the applicationif __name__ == "__main__":
    (share=True)

The above is the detailed content of Python calling the olmOCR model to extract complex PDF files. For more information about Python olmOCR extracting PDF content, please pay attention to my other related articles!