Use Python to achieve PDF to image and then OCR recognition

1.Pdf to pictures first

Sample code

import os
from pdf2image import convert_from_path

# PDF file pathpdf_path = '/Users/xxx/'
# folder for outputting picturesoutput_folder = './output_images2022'
# Naming format of output imageoutput_name = 'page'

# If the output folder does not exist, create itif not (output_folder):
    (output_folder)

# Convert PDF to image list and set resolution to 300 DPIimages = convert_from_path(pdf_path, dpi=300)

# Save each page as PNG picturefor i, image in enumerate(images):
    (f'{output_folder}/{output_name}_{i+1}.png', 'PNG')

Identification

Sample code

from PIL import ImageEnhance
import pytesseract
from PIL import Image
from openpyxl import Workbook

# Configure the path to Tesseract (if required)# .tesseract_cmd = r'/usr/local/bin/tesseract' # Path to Mac# .tesseract_cmd = r'C:\Program Files\Tesseract-OCR\' # Path to Windows
# Open the picture# image_path = "/Users/xxx/page_3.png" # Replace with your image path

def enhance_image(img):
    img = ('L')  # Turn to grayscale    img = (img).enhance(2.0)
    return img


def allimngs(image_path):

    image = (image_path)

    image = enhance_image(image)

    # OCR using pytesseract    text = pytesseract.image_to_string(image, lang="chi_sim")  # Chinese
    # # Print extracted text    # print("Extracted text:")    # print((' ', ''))

    return (' ', '')


# Statistics the number of substrings occur
class TrieNode:
    def __init__(self):
         = {}
         = []


class Trie:
    def __init__(self):
         = TrieNode()

    def insert(self, keyword):
        node = 
        for char in keyword:
            if char not in :
                [char] = TrieNode()
            node = [char]
        (keyword)


def count_keywords(text, keywords):
    # Deduplicate keywords to ensure uniqueness    keywords = list(set(keywords))

    # Build a Trie Tree    trie = Trie()
    for kw in keywords:
        (kw)

    # Initialize the counter    counters = {kw: 0 for kw in keywords}
    i = 0
    n = len(text)

    while i &lt; n:
        current_node = 
        max_len = 0
        current_len = 0
        end_pos = i

        # Find the longest matching keywords starting from the current location        for j in range(i, n):
            char = text[j]
            if char in current_node.children:
                current_node = current_node.children[char]
                current_len += 1
                if current_node.keywords:  # The current node is the end of a keyword                    max_len = current_len
                    end_pos = j + 1  # Update end position is the next position of the current character            else:
                break  # No subsequent match, exit the loop
        if max_len &gt; 0:
            # Update all matching keyword counters            for kw in current_node.keywords:
                counters[kw] += 1
            i = end_pos  # Jump to the end of the matched part        else:
            i += 1  # No match, move to the next character
    return counters


if __name__ == "__main__":
    keywords = ['Short',
                'Sit at the status quo',
                'Hidden',
                'dim',
                'Dark']
    all_text = ''
    workbook = Workbook()
    sheet = 

    for i in range(108):
        i = i+1
        image_path = f"/Users/xxx/output_images2022/page_{i}.png"
        all_text = all_text + allimngs(image_path)

    all_text = all_text.replace(' ', '').replace('\n', '')

    result = count_keywords(all_text, keywords)

    num = 1

    for k, v in ():
        sheet[f'A{num}'] = k
        sheet[f'B{num}'] = v
        print(k, v, num)
        num = num + 1

    (filename='')

3. Knowledge supplement

OCR method for identifying text with pictures and pdf in Python

1、PaddleOCR：

Developed based on Baidu Paddle Paddle framework, it has rich models and supports multilingual recognition, including Chinese, English, etc. Strong performance, suitable for text recognition in complex scenarios

Install the PaddleOCR library:

pip install paddleocr

Sample code

from paddleocr import PaddleOCR, draw_ocr
from PIL import Image
 
# Initialize PaddleOCR# Parameter explanation:# `lang`: Specify language models, such as 'en' (English), 'ch' (Chinese), etc.# `use_angle_cls`: Whether to enable the text orientation classifier.ocr = PaddleOCR(use_angle_cls=True, lang='en')  # can also be set to 'ch' for Chinese[^28^] 
# Specify the image pathimg_path = ''  # Replace with your image path 
# Perform OCR identificationresult = (img_path, cls=True)  # `cls=True` means that the direction classifier is enabled 
# Print recognition resultsfor line in result:
    print(line)
 
# Optional: Draw the recognition result and save itif result:
    image = (img_path).convert('RGB')
    boxes = [line[0] for line in result]  # Extract text box    txts = [line[1][0] for line in result]  # Extract text content    scores = [line[1][1] for line in result]  # Extract confidence 
    # Draw results    im_show = draw_ocr(image, boxes, txts, scores, font_path='path/to/PaddleOCR/doc/fonts/')
    im_show = (im_show)
    im_show.save('')  # Save the drawn picture[^28^]

2、RapidOCR

First, make sure you have the ONNXRuntime version of RapidOCR installed, a lightweight and efficient reasoning engine:

pip install rapidocr_onnxruntime

Sample code: Identify numbers and letters

The following code shows how to use RapidOCR to identify numbers and letters in images and print only the recognition results:

from rapidocr_onnxruntime import RapidOCR
 
# Initialize the OCR engineocr = RapidOCR()
 
# Specify the image pathimg_path = ''  # Replace with your image path 
# Execute identificationresult, _ = ocr(img_path)
 
# Extract and print recognition results (numbers and letters only)if result:
    for line in result:
        text = line[1]  # Extract text content        # Filter text that contains only numbers and letters        if ():
            print(text)
else:
    print("No text is recognized")

Things to note

Image path: Make sure that the image pointed to by img_path contains numbers or letters.
Language settings: By default, RapidOCR supports mixed recognition in Chinese and English. If you need to identify other languages, you can refer to the documentation to configure it.
Environment Requirements: Ensure that Python version is 3.6 or higher.

3、EasyOCR

Features: Easy to use, supports multiple languages (including Chinese, English, etc.), based on deep learning technology, suitable for beginners and fast integration.

Installation method:

pip install easyocr

Example of usage:

import easyocr
 
reader = (['en', 'ch_sim'])  # Support multilingualimg_path = ''
result = (img_path)
for line in result:
    print(line[1])  # Print recognition results

4、Pytesseract

Features: Tesseract's Python package supports multiple languages, is easy to use, and is suitable for traditional OCR tasks.

Installation method:

pip install pytesseract

You need to install Tesseract OCR first, which can be downloaded from the Tesseract official website.

Example of usage:

from PIL import Image
import pytesseract
 
img_path = ''
text = pytesseract.image_to_string((img_path), lang='eng')
print(text)  # Print recognition results

5、DocTR

Features: Focus on document analysis and table recognition, can extract structured information in documents, and is suitable for handling complex layout documents.

Installation method:

pip install python-doctr

Example of usage:

from  import ocr_predictor
from  import DocumentFile
 
img_path = ''
doc = DocumentFile.from_images(img_path)
model = ocr_predictor(pretrained=True)
result = model(doc)
for block in [0].blocks:
    for line in :
        for word in :
            print()  # Print recognition results

6、PyOCR

Features: It encapsulates multiple OCR engines (such as Tesseract, Cuneiform, etc.) to provide a unified interface.

Installation method:

pip install pyocr

Example of usage:

import pyocr
from PIL import Image
 
tools = pyocr.get_available_tools()
ocr_tool = tools[0]
img_path = ''
text = ocr_tool.image_to_string((img_path), lang='eng')
print(text)  # Print recognition results

Select suggestions:

Speed priority: RapidOCR or EasyOCR is recommended.

Accuracy is preferred: PaddleOCR is recommended.

Ease of use is preferred: EasyOCR is recommended.

Document analysis is preferred: docTR is recommended.

Note: According to your specific needs (such as language support, application scenarios, performance requirements, etc.), you can choose the most suitable OCR library.

This is the article about using Python to realize pdf to image and then OCR recognition. For more related content on Python pdf to image, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!