1.Pdf to pictures first
Sample code
import os from pdf2image import convert_from_path # PDF file pathpdf_path = '/Users/xxx/' # folder for outputting picturesoutput_folder = './output_images2022' # Naming format of output imageoutput_name = 'page' # If the output folder does not exist, create itif not (output_folder): (output_folder) # Convert PDF to image list and set resolution to 300 DPIimages = convert_from_path(pdf_path, dpi=300) # Save each page as PNG picturefor i, image in enumerate(images): (f'{output_folder}/{output_name}_{i+1}.png', 'PNG')
Identification
Sample code
from PIL import ImageEnhance import pytesseract from PIL import Image from openpyxl import Workbook # Configure the path to Tesseract (if required)# .tesseract_cmd = r'/usr/local/bin/tesseract' # Path to Mac# .tesseract_cmd = r'C:\Program Files\Tesseract-OCR\' # Path to Windows # Open the picture# image_path = "/Users/xxx/page_3.png" # Replace with your image path def enhance_image(img): img = ('L') # Turn to grayscale img = (img).enhance(2.0) return img def allimngs(image_path): image = (image_path) image = enhance_image(image) # OCR using pytesseract text = pytesseract.image_to_string(image, lang="chi_sim") # Chinese # # Print extracted text # print("Extracted text:") # print((' ', '')) return (' ', '') # Statistics the number of substrings occur class TrieNode: def __init__(self): = {} = [] class Trie: def __init__(self): = TrieNode() def insert(self, keyword): node = for char in keyword: if char not in : [char] = TrieNode() node = [char] (keyword) def count_keywords(text, keywords): # Deduplicate keywords to ensure uniqueness keywords = list(set(keywords)) # Build a Trie Tree trie = Trie() for kw in keywords: (kw) # Initialize the counter counters = {kw: 0 for kw in keywords} i = 0 n = len(text) while i < n: current_node = max_len = 0 current_len = 0 end_pos = i # Find the longest matching keywords starting from the current location for j in range(i, n): char = text[j] if char in current_node.children: current_node = current_node.children[char] current_len += 1 if current_node.keywords: # The current node is the end of a keyword max_len = current_len end_pos = j + 1 # Update end position is the next position of the current character else: break # No subsequent match, exit the loop if max_len > 0: # Update all matching keyword counters for kw in current_node.keywords: counters[kw] += 1 i = end_pos # Jump to the end of the matched part else: i += 1 # No match, move to the next character return counters if __name__ == "__main__": keywords = ['Short', 'Sit at the status quo', 'Hidden', 'dim', 'Dark'] all_text = '' workbook = Workbook() sheet = for i in range(108): i = i+1 image_path = f"/Users/xxx/output_images2022/page_{i}.png" all_text = all_text + allimngs(image_path) all_text = all_text.replace(' ', '').replace('\n', '') result = count_keywords(all_text, keywords) num = 1 for k, v in (): sheet[f'A{num}'] = k sheet[f'B{num}'] = v print(k, v, num) num = num + 1 (filename='')
3. Knowledge supplement
OCR method for identifying text with pictures and pdf in Python
1、PaddleOCR:
Developed based on Baidu Paddle Paddle framework, it has rich models and supports multilingual recognition, including Chinese, English, etc. Strong performance, suitable for text recognition in complex scenarios
Install the PaddleOCR library:
pip install paddleocr
Sample code
from paddleocr import PaddleOCR, draw_ocr from PIL import Image # Initialize PaddleOCR# Parameter explanation:# `lang`: Specify language models, such as 'en' (English), 'ch' (Chinese), etc.# `use_angle_cls`: Whether to enable the text orientation classifier.ocr = PaddleOCR(use_angle_cls=True, lang='en') # can also be set to 'ch' for Chinese[^28^] # Specify the image pathimg_path = '' # Replace with your image path # Perform OCR identificationresult = (img_path, cls=True) # `cls=True` means that the direction classifier is enabled # Print recognition resultsfor line in result: print(line) # Optional: Draw the recognition result and save itif result: image = (img_path).convert('RGB') boxes = [line[0] for line in result] # Extract text box txts = [line[1][0] for line in result] # Extract text content scores = [line[1][1] for line in result] # Extract confidence # Draw results im_show = draw_ocr(image, boxes, txts, scores, font_path='path/to/PaddleOCR/doc/fonts/') im_show = (im_show) im_show.save('') # Save the drawn picture[^28^]
2、RapidOCR
First, make sure you have the ONNXRuntime version of RapidOCR installed, a lightweight and efficient reasoning engine:
pip install rapidocr_onnxruntime
Sample code: Identify numbers and letters
The following code shows how to use RapidOCR to identify numbers and letters in images and print only the recognition results:
from rapidocr_onnxruntime import RapidOCR # Initialize the OCR engineocr = RapidOCR() # Specify the image pathimg_path = '' # Replace with your image path # Execute identificationresult, _ = ocr(img_path) # Extract and print recognition results (numbers and letters only)if result: for line in result: text = line[1] # Extract text content # Filter text that contains only numbers and letters if (): print(text) else: print("No text is recognized")
Things to note
- Image path: Make sure that the image pointed to by img_path contains numbers or letters.
- Language settings: By default, RapidOCR supports mixed recognition in Chinese and English. If you need to identify other languages, you can refer to the documentation to configure it.
- Environment Requirements: Ensure that Python version is 3.6 or higher.
3、EasyOCR
Features: Easy to use, supports multiple languages (including Chinese, English, etc.), based on deep learning technology, suitable for beginners and fast integration.
Installation method:
pip install easyocr
Example of usage:
import easyocr reader = (['en', 'ch_sim']) # Support multilingualimg_path = '' result = (img_path) for line in result: print(line[1]) # Print recognition results
4、Pytesseract
Features: Tesseract's Python package supports multiple languages, is easy to use, and is suitable for traditional OCR tasks.
Installation method:
pip install pytesseract
You need to install Tesseract OCR first, which can be downloaded from the Tesseract official website.
Example of usage:
from PIL import Image import pytesseract img_path = '' text = pytesseract.image_to_string((img_path), lang='eng') print(text) # Print recognition results
5、DocTR
Features: Focus on document analysis and table recognition, can extract structured information in documents, and is suitable for handling complex layout documents.
Installation method:
pip install python-doctr
Example of usage:
from import ocr_predictor from import DocumentFile img_path = '' doc = DocumentFile.from_images(img_path) model = ocr_predictor(pretrained=True) result = model(doc) for block in [0].blocks: for line in : for word in : print() # Print recognition results
6、PyOCR
Features: It encapsulates multiple OCR engines (such as Tesseract, Cuneiform, etc.) to provide a unified interface.
Installation method:
pip install pyocr
Example of usage:
import pyocr from PIL import Image tools = pyocr.get_available_tools() ocr_tool = tools[0] img_path = '' text = ocr_tool.image_to_string((img_path), lang='eng') print(text) # Print recognition results
Select suggestions:
Speed priority: RapidOCR or EasyOCR is recommended.
Accuracy is preferred: PaddleOCR is recommended.
Ease of use is preferred: EasyOCR is recommended.
Document analysis is preferred: docTR is recommended.
Note: According to your specific needs (such as language support, application scenarios, performance requirements, etc.), you can choose the most suitable OCR library.
This is the article about using Python to realize pdf to image and then OCR recognition. For more related content on Python pdf to image, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!