Quickly preview PDF reports by counting word frequency. This article provides two types of ways to extract text PDF and picture PDF.
For PDFs that are text-based, you can quickly extract the text, but many PDFs are in picture format and cannot directly extract text. This article uses PDF to convert pictures, and uses OCR to identify the text to generate text, and then counts the word frequency of the text for quick preview.
1. PDF to pictures
1. PDF to picture
This article uses the PyMuPDF module for conversion.
1. There is a lot of information on PyMuPD, but most of them are relatively old. The API of this module has changed, and this article has been updated. This module installation requires pip install PyMuPDF, but the import is fitz. This library does not support python 3.10 or above.
2. This article adds the function of file path processing find_pdf_files(directory) function and the following parts.
filename = (pdf) file_name, file_extension = (filename) image_path = f"{imagePath}\{file_name}{pg}.jpg"
Code
import datetime import os import fitz #Output the complete file path of a file ending in .pdfdef find_pdf_files(directory): pdf_files = [] for root, dirs, files in (directory): for file in files: if ('.pdf'): pdf_file_path = (root, file) pdf_files.append(pdf_file_path) return pdf_files def pyMuPDF_fitz(pdf, imagePath): startTime_pdf2img = () # Start time print("imagePath=" + imagePath) pdfDoc = (pdf) for pg in range(pdfDoc.page_count): page = pdfDoc[pg] rotate = int(0) # The scaling factor for each size is 1.3, which will generate images with a resolution of 2.6 for us. # If you do not set it here, the default image size is: 792X612, dpi=96 zoom_x = 1.33333333 # (1.33333333-->1056x816) (2-->1584x1224) zoom_y = 1.33333333 mat = (zoom_x, zoom_y).prerotate(rotate) pix = page.get_pixmap(matrix=mat, alpha=False) if not (imagePath): # Determine whether the folder where the picture exists (imagePath) # Create if the image folder does not exist # Extract file names without extensions filename = (pdf) file_name, file_extension = (filename) image_path = f"{imagePath}\{file_name}{pg}.jpg" (image_path) # Write the image into the specified folder endTime_pdf2img = () # End time print('pdf2imgtime=', (endTime_pdf2img - startTime_pdf2img).seconds) if __name__ == "__main__": path=r"xx" flist=find_pdf_files(path) # 1. PDF address imagePath = r"xx" for pdf in flist: # 2. Table of contents that require storage of pictures pyMuPDF_fitz(pdf, imagePath)
2. OCR picture to text
This article uses Baidu's open source paddleocr library
First, pip3.10 install paddlepaddle and then pip install paddleocr. Note that these two libraries do not support python3.10 or above for the time being. The main reason is that PyMuPDF that paddleocr depends on does not support python3.10 or above.
Convert the specified area on the image to text
from paddleocr import PaddleOCR import os import re #Output result to stringdef text_noposition(data, left, right, bottom, top): text_res = "" # data[0] contains location and text information for i in data[0]: # i[0][0][0] is horizontal, i[0][0][1] is vertical position x, y = i[0][0][0], i[0][0][1] if left < x < right and bottom < y < top: # i[1][0] is text text_res = text_res + i[1][0] return text_res def convert_png_to_txt(dir_path,output_path): # Initialize PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang="ch") output_text = "" # traverse all .png files in the specified directory for filename in (dir_path): if ('.jpg'): file_path = (dir_path, filename) # Use PaddleOCR to extract text from pictures result = (file_path, cls=True) print(result) # Use the text_noposition function to process the extracted text processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000) # Append the processed text to the output text output_text += processed_text + "\n" # Write the output text to the .txt file and close the file with open(output_path, "a") as file: # You can change the file name and path as needed (output_text) # Specify the directory of the pictures to be processeddir_path = r'D:\data\2024\PDF\xx' # File output pathoutput_path = r'D:\data\2024\PDF\xx' convert_png_to_txt(dir_path,output_path)
Convert all pictures to text
The region selection parameter is removed, and you can also sample the selection region function and set the region to an extreme value, for example (0,10000,0,10000)
def text_noposition(data): text_res = "" # data[0] contains location and text information for i in data[0]: # i[1][0] is text text_res = text_res + i[1][0] return text_res def convert_png_to_txt(dir_path,output_path): # Initialize PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang="ch") output_text = "" # traverse all .png files in the specified directory for filename in (dir_path): if ('.jpg'): file_path = (dir_path, filename) # Use PaddleOCR to extract text from pictures result = (file_path, cls=True) print(result) # Use the text_noposition function to process the extracted text processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000) # Append the processed text to the output text output_text += processed_text + "\n" # Write the output text to the .txt file and close the file with open(output_path, "a") as file: # You can change the file name and path as needed (output_text)
3. Read high-frequency words
Read the results, use stutter participle, and count the word frequency.
Depending on the file encoding, if the gbk encoding error occurs, the part that opens the file needs to be changed to:
f = open(file_path, encoding=‘utf-8’)
from collections import Counter import jieba import pandas as pd def cut_with_filter(sentence, stopwords): # Use the exact pattern of stuttering words to segment seg_list = (sentence, cut_all=False) # Remove stop words filtered_seg_list = [word for word in seg_list if word not in stopwords] return filtered_seg_list if __name__ == '__main__': file_path = r"D:\data\2023\pdf\pdf\result.txt" # Text to be participled f = open(file_path) text = () #text = read_doc_file(file_path) # Stop word list, you can add or modify it yourself as needed stopwords = ["of", "It's gone", "exist", "yes", "I", "have", "and", "At once", "No", "people", "All", "one", "one", "superior", "also", "very", "arrive"] word_list=cut_with_filter(text,stopwords) chinese_list = [word for word in word_list if isinstance(word, str) and ()] # Statistics the word frequency of each word counter = Counter(chinese_list) word_freq = dict(counter) keys = (list(word_freq.keys())) values = (list(word_freq.values())) # Save word participle result and word frequency to DataFrame df = ({'word': keys, 'word frequency': values}) print(df) # Save DataFrame to Excel file df.to_excel('Partial Result', index=False)
2. Extract text from a text PDF
Use the PyMuPDF library to quickly extract text from font PDFs. This article is encapsulated.
1. Convert a single file, use the pdf2txt function, the input path is the path of the PDF file
2. Convert a folder and use the pdf2txt_multi function. The input path is the folder where the PDF is located.
import fitz import os #Convert a single filedef pdf2txt(input_file,output_file): with (input_file) as doc: text = "" for page in (): text += page.get_text() # Note that the get_text() method is used here with open(output_file, "w", encoding="utf-8") as f: (text) return text def pdf2txt_multi(input_folder, output_file): # traverse all files in the input folder for file_name in (input_folder): if file_name.endswith(".pdf"): print(file_name) # Build the input file path input_file = (input_folder, file_name) # Open PDF file and write to txt file with (input_file) as doc: text = "" for page in (): text += page.get_text() # Note that the get_text() method is used here with open(output_file, "w", encoding="utf-8") as f: (text) return text if __name__ == "__main__": # Input and output file paths, when you want to convert a folder, it is the folder path. input_file =r"xx" output_file = "" pdf2txt_multi(input_file,output_file)
3. Merge text PDF
Use Python's fitz library to merge pdfs in a folder into one pdf
import os import fitz def merge_pdfs(pdf_folder, output_pdf_path): pdf_files = [f for f in (pdf_folder) if (".pdf")] # Create a new PDF file merged_pdf = () # traverse each PDF file and insert it into the merged file for pdf_file in pdf_files: pdf_path = (pdf_folder, pdf_file) pdf = (pdf_path) merged_pdf.insert_pdf(pdf) merged_pdf.save(output_pdf_path) merged_pdf.close() if __name__ == '__main__': # Get the path to all PDF files in the folder pdf_folder = r"D:\Work\Science and Technology Innovation Special Class\Data\Conference Report\2024 Seminar" # Save the merged PDF file output_pdf_path = "output_merged_pdf.pdf" print(f"Merge is completed,Save as:{output_pdf_path}")
4. Convert image PDF to word and merge
1. Single-process mode
import datetime import os import fitz from paddleocr import PaddleOCR from docx import Document # Output the complete file path of a file ending in .pdfdef find_pdf_files(directory): pdf_files = [] for root, dirs, files in (directory): for file in files: if ('.pdf'): pdf_file_path = (root, file) pdf_files.append(pdf_file_path) return pdf_files def text_noposition(data, left, right, bottom, top): text_res = "" # data[0] contains location and text information for i in data[0]: # i[0][0][0] is horizontal, i[0][0][1] is vertical position x, y = i[0][0][0], i[0][0][1] if left < x < right and bottom < y < top: # i[1][0] is text text_res = text_res + i[1][0] return text_res def pyMuPDF_fitz(pdf, image_folder_path): start_time_pdf2img = () # Start time pdf_doc = (pdf) # Create an OCR model ocr = PaddleOCR(use_angle_cls=True, lang="ch") # Create Word Document doc = Document() for pg in range(pdf_doc.page_count): page = pdf_doc[pg] rotate = int(0) # The scaling factor for each size is 1.3, which will generate images with a resolution of 2.6 for us. # If you do not set it here, the default image size is: 792X612, dpi=96 zoom_x = 1.33333333 # (1.33333333-->1056x816) (2-->1584x1224) zoom_y = 1.33333333 mat = (zoom_x, zoom_y).prerotate(rotate) pix = page.get_pixmap(matrix=mat, alpha=False) if not (image_folder_path): # Determine whether the folder where the picture exists (image_folder_path) # Create if the image folder does not exist # Extract file names without extensions filename = (pdf) file_name, file_extension = (filename) print(file_name,"page number",pg) image_path = f"{image_folder_path}\\{file_name}_{pg}.jpg" (image_path) # Write the image into the specified folder # Identify text in the picture result = (image_path) # Write recognition results to Word documents doc.add_paragraph(f"{file_name}") doc.add_paragraph(f"page number:{pg+1}") processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000) doc.add_paragraph(processed_text) if not (word_path): # Determine whether the folder where Word exists (word_path) # Create if the word folder does not exist # Save Word Documents doc_name = f"{file_name}.docx" doc_path=f"{word_path}\\{doc_name}" (doc_path) end_time_pdf2img = () # End time print('pdf2imgtime=', (end_time_pdf2img - start_time_pdf2img).seconds) return doc_name if __name__ == '__main__': # test pdf_folder = r"D:\Work\Science and Technology Innovation Team\Data\Conference Report\24 Year Work Conference" image_folder = r"xx" word_path = r"xx" pdf_files = find_pdf_files(pdf_folder) for pdf_file in pdf_files: pyMuPDF_fitz(pdf_file, image_folder)
2. Multi-process mode
Multiple processes need to pay attention to importing the paddleocr package inside each process, otherwise the package cannot be serialized, resulting in the program being unable to run.
def pyMuPDF_fitz(pdf, image_folder_path): from paddleocr import PaddleOCR
You can't import packages at the beginning
import datetime import os import fitz #from paddleocr import PaddleOCR from joblib import Parallel, delayed from docx import Document # Output the complete file path of a file ending in .pdfdef find_pdf_files(directory): pdf_files = [] for root, dirs, files in (directory): for file in files: if ('.pdf'): pdf_file_path = (root, file) pdf_files.append(pdf_file_path) return pdf_files def text_noposition(data, left, right, bottom, top): text_res = "" # data[0] contains location and text information for i in data[0]: # i[0][0][0] is horizontal, i[0][0][1] is vertical position x, y = i[0][0][0], i[0][0][1] if left < x < right and bottom < y < top: # i[1][0] is text text_res = text_res + i[1][0] return text_res def pyMuPDF_fitz(pdf, image_folder_path): from paddleocr import PaddleOCR start_time_pdf2img = () # Start time pdf_doc = (pdf) # Create an OCR model ocr = PaddleOCR(use_angle_cls=True, lang="ch") # Create Word Document doc = Document() for pg in range(pdf_doc.page_count): page = pdf_doc[pg] rotate = int(0) # The scaling factor for each size is 1.3, which will generate images with a resolution of 2.6 for us. # If you do not set it here, the default image size is: 792X612, dpi=96 zoom_x = 1.33333333 # (1.33333333-->1056x816) (2-->1584x1224) zoom_y = 1.33333333 mat = (zoom_x, zoom_y).prerotate(rotate) pix = page.get_pixmap(matrix=mat, alpha=False) if not (image_folder_path): # Determine whether the folder where the picture exists (image_folder_path) # Create if the image folder does not exist # Extract file names without extensions filename = (pdf) file_name, file_extension = (filename) print(file_name,"page number",pg) image_path = f"{image_folder_path}\\{file_name}_{pg}.jpg" (image_path) # Write the image into the specified folder # Identify text in the picture result = (image_path) # Write recognition results to Word documents doc.add_paragraph(f"{file_name}") doc.add_paragraph(f"page number:{pg+1}") processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000) doc.add_paragraph(processed_text) if not (word_path): # Determine whether the folder where Word exists (word_path) # Create if the word folder does not exist # Save Word Documents doc_name = f"{file_name}.docx" doc_path=f"{word_path}\\{doc_name}" (doc_path) end_time_pdf2img = () # End time print('pdf2imgtime=', (end_time_pdf2img - start_time_pdf2img).seconds) return doc_name if __name__ == '__main__': # test pdf_folder = r"D:\data\2024\Conference Report\24 Annual Work Conference" image_folder = r"D:\data\2024\picture" word_path = r"D:\data\2024\word" pdf_files = find_pdf_files(pdf_folder) # for pdf_file in pdf_files: # pyMuPDF_fitz(pdf_file, image_folder) #paddleocr cannot be serialized and cannot be used results = Parallel(n_jobs=-1,)( delayed(pyMuPDF_fitz)(pdf_file, image_folder) for pdf_file in pdf_files)
5. Merge the word in the folder
from docx import Document import os def merge_word_documents(folder_path, output_path): # Create a new Word document merged_document = Document() # traverse each Word document in the folder for filename in (folder_path): if (".docx"): file_path = (folder_path, filename) # Open the current Word document current_document = Document(file_path) # Copy the contents of the current document into the merged document for element in current_document.: merged_document.(element) # Save the merged document merged_document.save(output_path) # Program entryif __name__ == "__main__": folder_path = r"your_folder_path" output_path = r"output_folder/merged_document.docx" merge_word_documents(folder_path, output_path)
6. Convert the entire picture into text and write it to the pandas table
It is implemented by first writing the internal example to the list, and then turning the list into DataFrame.
import os import pandas as pd from paddleocr import PaddleOCR from pandasrw import view #Output result to stringdef text_noposition(data): data_res = [] # data[0] contains location and text information for i in data[0]: # i[0][0][0] is horizontal, i[0][0][1] is vertical position x, y = i[0][0][0], i[0][0][1] # i[1][0] is text data_res.append(i[1][0]) return data_res def convert_png_to_txt(dir_path): # Initialize PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang="ch") list_text =[] data_text=[] # traverse all .png files in the specified directory for filename in (dir_path): if ('.jpg'): file_path = (dir_path, filename) # Use PaddleOCR to extract text from pictures result = (file_path, cls=True) data_text = text_noposition(result) list_text=list_text+data_text print(list_text) # Create DataFrame and write data df = (list_text) return df # Specify the directory of the pictures to be processeddir_path = r"D:\data\2024\pic\d" df=convert_png_to_txt(dir_path) view(df)
7. Extract some PDF pages
extract_pages_to_new_pdf(input_path, output_path, [start page, end page]) Extracts continuous PDFs from the start to end page, and converts them into multiple consecutive page extractions for discontinuous page extractions.
import fitz # PyMuPDF def extract_pages_to_new_pdf(input_pdf_path, output_pdf_path, page_numbers): # Open the original PDF file pdf_doc = (input_pdf_path) # Create a new PDF document object new_pdf = () # traverse the page index to be extracted new_pdf.insert_pdf(pdf_doc, from_page=min(page_numbers), to_page=max(page_numbers)) # Save a new PDF file new_pdf.save(output_pdf_path) # Close the document pdf_doc.close() new_pdf.close() if __name__ == '__main__': input_path=r file name output_path=r"" extract_pages_to_new_pdf(input_path, output_path, [25, 42])
This is the article about Python implementing PDF extraction and stating word frequency. For more related Python PDF operation content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!