The video corresponding to this note:/video/BV1gPXXYiETE/allenai/olmocr is an open source toolkit developed by the Allen Institute of Artificial Intelligence (AI2) to efficiently convert PDFs and other documents into structured plain text while maintaining a natural reading order. The following are the main features and functions of the project:
Core Technology
- Use a visual language model (VLM) called olmOCR-7B-0225-preview, which is trained based on Qwen2-VL-7B-Instruct.
- The model has been trained on approximately 250,000 pages of diverse PDF content (including scans and text-based) that are annotated using GPT-4o and published as the olmOCR-mix-0225 dataset.
Main functions
- Efficient batch processing: Using SGLang to optimize the inference pipeline, it can process large amounts of documents at extremely low cost.
- Document anchoring: Extract the coordinates of prominent elements (such as text blocks and images) in each page and inject them with the original text extracted from the PDF binary.
- Support local and cluster usage: can run on a stand-alone GPU, and also supports multi-node parallel processing using AWS S3.
Performance and Advantages
- Cost-effective: Converting 1 million pages of PDF is only $190, which is about 1/32 of the cost of using the GPT-4o API.
- High accuracy: In manual evaluation, olmOCR ranks the highest among the ELO ratings of various PDF extraction technologies.
- Improve downstream tasks: Use text extracted by olmOCR to train language models, and the accuracy rate is improved by an average of 1.3 percentage points in multiple AI benchmark tasks.
How to use
olmOCR provides Python API and command line tools to easily convert PDFs into structured text. It also includes features such as evaluation toolkit, language filtering, SEO spam removal, and more. Users can get the code through GitHub, or use an online demonstration to test their functionality.
In short, the allenai/olmocr project provides an efficient, accurate and economical solution for large-scale document conversion, which is particularly suitable for research and application scenarios that require processing a large number of PDF documents.
Installation command
# Install Miniconda (if not already installed)wget </miniconda/Miniconda3-latest-Linux-x86_64.sh> -O ~/ ; bash ~/ -b -p $HOME/miniconda ; eval "$($HOME/miniconda/bin/conda hook)" ; echo 'export PATH="$HOME/miniconda/bin:$PATH"' >> ~/.bashrc ; source ~/.bashrc conda create -n ai python=3.11 -y && conda activate ai sudo apt-get update sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools conda create -n olmocr python=3.11 conda activate olmocr git clone </allenai/> cd olmocr pip install -e . pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps pip install "sglang[all]==0.4.2" --find-links </whl/cu124/torch2.4/flashinfer/> pip install gradio pandas python -m ./localworkspace --pdfs tests/gnarly_pdfs/ cat localworkspace/results/output_*.jsonl
gradio UI code
import os import json import gradio as gr import subprocess import pandas as pd from pathlib import Path import shutil import time import re # Create a working directoryWORKSPACE_DIR = "olmocr_workspace" (WORKSPACE_DIR, exist_ok=True) def modify_html_for_better_display(html_content): """Modify HTML for better display in Gradio""" if not html_content: return html_content # Increase container width html_content = html_content.replace('<div class="container">', '<div class="container" style="max-width: 100%; width: 100%;">') # Increase text size html_content = html_content.replace('<style>', '<style>\nbody {font-size: 16px;}\-content {font-size: 16px; line-height: 1.5;}\n') # Adjust the size and proportion of image and text parts html_content = html_content.replace('<div class="row">', '<div class="row" style="display: flex; flex-wrap: wrap;">') html_content = html_content.replace('<div class="col-md-6">', '<div class="col-md-6" style="flex: 0 0 50%; max-width: 50%; padding: 15px;">') # Increase the spacing between pages html_content = html_content.replace('<div class="page">', '<div class="page" style="margin-bottom: 30px; border-bottom: 1px solid #ccc; padding-bottom: 20px;">') # Increase image size html_content = (r'<img([^>]*)style="([^"]*)"', r'<img\1style="max-width: 100%; height: auto; \2"', html_content) # Add zoom control zoom_controls = """ <div style="position: fixed; bottom: 20px; right: 20px; background: #fff; padding: 10px; border-radius: 5px; box-shadow: 0 0 10px rgba(0,0,0,0.2); z-index: 1000;"> <button onclick=" = parseFloat( || 1) + 0.1;" style="margin-right: 5px;">enlarge</button> <button onclick=" = parseFloat( || 1) - 0.1;">Shrink</button> </div> """ html_content = html_content.replace('</body>', f'{zoom_controls}</body>') return html_content def process_pdf(pdf_file): """Processing PDF files and returning results""" if pdf_file is None: return "Please uploadPDFdocument", "", None, None # Create a unique working directory timestamp = int(()) work_dir = (WORKSPACE_DIR, f"job_{timestamp}") (work_dir, exist_ok=True) # Copy PDF file pdf_path = (work_dir, "") (pdf_file, pdf_path) # Build the command and execute it cmd = ["python", "-m", "", work_dir, "--pdfs", pdf_path] try: # Execute the command and wait for completion process = ( cmd, stdout=, stderr=, text=True, check=True ) # Command output log_text = # Check the results directory results_dir = (work_dir, "results") if not (results_dir): return f"Processing is completed,But no results directory was generated\n\nLog output:\n{log_text}", "", None, None # Find the output file output_files = list(Path(results_dir).glob("output_*.jsonl")) if not output_files: return f"Processing is completed,但未找到输出document\n\nLog output:\n{log_text}", "", None, None # Read JSONL file output_file = output_files[0] with open(output_file, "r") as f: content = ().strip() if not content: return f"输出document为空\n\nLog output:\n{log_text}", "", None, None # parse JSON result = (content) extracted_text = ("text", "Text content not found") # Generate HTML preview try: preview_cmd = ["python", "-m", "", str(output_file)] (preview_cmd, check=True) except Exception as e: log_text += f"\ngenerateHTMLPreview failed: {str(e)}" # Find HTML files html_files = list(Path("dolma_previews").glob("*.html")) html_content = "" if html_files: try: with open(html_files[0], "r", encoding="utf-8") as hf: html_content = () # Modify HTML to display better html_content = modify_html_for_better_display(html_content) except Exception as e: log_text += f"\nReadHTMLPreview failed: {str(e)}" # Create a metadata table metadata = ("metadata", {}) meta_rows = [] for key, value in (): meta_rows.append([key, value]) df = (meta_rows, columns=["property", "value"]) return log_text, extracted_text, html_content, df except as e: return f"Command execution failed: {}", "", None, None except Exception as e: return f"An error occurred during processing: {str(e)}", "", None, None # Create Gradio interfacewith (title="olmOCR PDF Extraction Tool") as app: ("# olmOCR PDF text extraction tool") with (): with (scale=1): pdf_input = (label="Upload PDF file", file_types=[".pdf"]) process_btn = ("Processing PDF", variant="primary") with (scale=2): tabs = () with tabs: with ("Extract text"): text_output = (label="Extracted text", lines=20, interactive=True) with ("HTML Preview", ): # Use larger HTML components html_output = (label="HTML Preview", elem_) with ("Metadata"): meta_output = (label="Document Metadata") with ("log"): log_output = (label="Processing logs", lines=15, interactive=False) # Use CSS to customize HTML preview tab pages and content sizes (""" <style> #html_preview_container { height: 800px; width: 100%; overflow: auto; border: 1px solid #ddd; border-radius: 4px; } #html_preview_container iframe { width: 100%; height: 100%; border: none; } </style> """) # Add operation instructions (""" ## Instructions for use 1. Upload PDF files 2. Click "Processing PDF"Button 3. Wait for processing to complete 4. View extracted text and HTML previews ### About HTML Preview - HTML Preview displays the original PDF page and extracted text comparison - The accuracy of the OCR process can be clearly seen - If the preview content is too small, you can adjust it using the zoom in/out button in the lower right corner ## Notice - The processing process may take several minutes, please be patient - The model will be downloaded for the first run (approximately 7GB) """) # Bind Button Event - Use Blocking Mode process_btn.click( fn=process_pdf, inputs=pdf_input, outputs=[log_output, text_output, html_output, meta_output], api_name="process" ) # Start the applicationif __name__ == "__main__": (share=True)
The above is the detailed content of Python calling the olmOCR model to extract complex PDF files. For more information about Python olmOCR extracting PDF content, please pay attention to my other related articles!