Python implements PDF to extract text and count word frequency

Quickly preview PDF reports by counting word frequency. This article provides two types of ways to extract text PDF and picture PDF.

For PDFs that are text-based, you can quickly extract the text, but many PDFs are in picture format and cannot directly extract text. This article uses PDF to convert pictures, and uses OCR to identify the text to generate text, and then counts the word frequency of the text for quick preview.

1. PDF to pictures

1. PDF to picture

This article uses the PyMuPDF module for conversion.

1. There is a lot of information on PyMuPD, but most of them are relatively old. The API of this module has changed, and this article has been updated. This module installation requires pip install PyMuPDF, but the import is fitz. This library does not support python 3.10 or above.

2. This article adds the function of file path processing find_pdf_files(directory) function and the following parts.

filename = (pdf)
file_name, file_extension = (filename)
image_path = f"{imagePath}\{file_name}{pg}.jpg"

Code

import datetime
import os
import fitz  


#Output the complete file path of a file ending in .pdfdef find_pdf_files(directory):
    pdf_files = []
    for root, dirs, files in (directory):
        for file in files:
            if ('.pdf'):
                pdf_file_path = (root, file)
                pdf_files.append(pdf_file_path)
    return pdf_files

def pyMuPDF_fitz(pdf, imagePath):
    startTime_pdf2img = ()  # Start time    print("imagePath=" + imagePath)
    pdfDoc = (pdf)
    for pg in range(pdfDoc.page_count):
        page = pdfDoc[pg]
        rotate = int(0)
        # The scaling factor for each size is 1.3, which will generate images with a resolution of 2.6 for us.        # If you do not set it here, the default image size is: 792X612, dpi=96        zoom_x = 1.33333333  # (1.33333333--&gt;1056x816)   (2--&gt;1584x1224)
        zoom_y = 1.33333333
        mat = (zoom_x, zoom_y).prerotate(rotate)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        if not (imagePath):  # Determine whether the folder where the picture exists            (imagePath)  # Create if the image folder does not exist        # Extract file names without extensions        filename = (pdf)
        file_name, file_extension = (filename)
        image_path = f"{imagePath}\{file_name}{pg}.jpg"
        (image_path)  # Write the image into the specified folder    endTime_pdf2img = ()  # End time    print('pdf2imgtime=', (endTime_pdf2img - startTime_pdf2img).seconds)


if __name__ == "__main__":
    path=r"xx"
    flist=find_pdf_files(path)
    # 1. PDF address    imagePath = r"xx"
    for pdf in flist:
    # 2. Table of contents that require storage of pictures        pyMuPDF_fitz(pdf, imagePath)

2. OCR picture to text

This article uses Baidu's open source paddleocr library

First, pip3.10 install paddlepaddle and then pip install paddleocr. Note that these two libraries do not support python3.10 or above for the time being. The main reason is that PyMuPDF that paddleocr depends on does not support python3.10 or above.

Convert the specified area on the image to text

from paddleocr import PaddleOCR
import os
import re


#Output result to stringdef text_noposition(data, left, right, bottom, top):
    text_res = ""
    # data[0] contains location and text information    for i in data[0]:
        # i[0][0][0] is horizontal, i[0][0][1] is vertical position        x, y = i[0][0][0], i[0][0][1]
        if left &lt; x &lt; right and bottom &lt; y &lt; top:
            # i[1][0] is text                text_res = text_res + i[1][0]
    return text_res
def convert_png_to_txt(dir_path,output_path):
    # Initialize PaddleOCR    ocr = PaddleOCR(use_angle_cls=True, lang="ch")

    output_text = ""

    # traverse all .png files in the specified directory    for filename in (dir_path):
        if ('.jpg'):
            file_path = (dir_path, filename)

            # Use PaddleOCR to extract text from pictures            result = (file_path, cls=True)
            print(result)
            # Use the text_noposition function to process the extracted text            processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000)

            # Append the processed text to the output text            output_text += processed_text + "\n"

            # Write the output text to the .txt file and close the file    with open(output_path, "a") as file:  # You can change the file name and path as needed        (output_text)



# Specify the directory of the pictures to be processeddir_path = r'D:\data\2024\PDF\xx'
# File output pathoutput_path = r'D:\data\2024\PDF\xx'
convert_png_to_txt(dir_path,output_path)

Convert all pictures to text

The region selection parameter is removed, and you can also sample the selection region function and set the region to an extreme value, for example (0,10000,0,10000)

def text_noposition(data):
    text_res = ""
    # data[0] contains location and text information    for i in data[0]:
    # i[1][0] is text        text_res = text_res + i[1][0]
    return text_res

def convert_png_to_txt(dir_path,output_path):
    # Initialize PaddleOCR    ocr = PaddleOCR(use_angle_cls=True, lang="ch")

    output_text = ""

    # traverse all .png files in the specified directory    for filename in (dir_path):
        if ('.jpg'):
            file_path = (dir_path, filename)

            # Use PaddleOCR to extract text from pictures            result = (file_path, cls=True)
            print(result)
            # Use the text_noposition function to process the extracted text            processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000)

            # Append the processed text to the output text            output_text += processed_text + "\n"

            # Write the output text to the .txt file and close the file    with open(output_path, "a") as file:  # You can change the file name and path as needed        (output_text)

3. Read high-frequency words

Read the results, use stutter participle, and count the word frequency.

Depending on the file encoding, if the gbk encoding error occurs, the part that opens the file needs to be changed to:

f = open(file_path, encoding=‘utf-8’)

from collections import Counter
import jieba
import pandas as pd


def cut_with_filter(sentence, stopwords):
    # Use the exact pattern of stuttering words to segment    seg_list = (sentence, cut_all=False)

    # Remove stop words    filtered_seg_list = [word for word in seg_list if word not in stopwords]

    return filtered_seg_list


if __name__ == '__main__':

    file_path = r"D:\data\2023\pdf\pdf\result.txt"

    # Text to be participled    f = open(file_path)
    text = ()
    #text = read_doc_file(file_path)
    # Stop word list, you can add or modify it yourself as needed    stopwords = ["of", "It's gone", "exist", "yes", "I", "have", "and", "At once", "No", "people", "All", "one", "one", "superior", "also", "very", "arrive"]
    word_list=cut_with_filter(text,stopwords)
    chinese_list = [word for word in word_list if isinstance(word, str) and ()]
    # Statistics the word frequency of each word    counter = Counter(chinese_list)
    word_freq = dict(counter)

    keys = (list(word_freq.keys()))
    values = (list(word_freq.values()))
        # Save word participle result and word frequency to DataFrame    df = ({'word': keys, 'word frequency': values})

    print(df)

    # Save DataFrame to Excel file    df.to_excel('Partial Result', index=False)

2. Extract text from a text PDF

Use the PyMuPDF library to quickly extract text from font PDFs. This article is encapsulated.

1. Convert a single file, use the pdf2txt function, the input path is the path of the PDF file

2. Convert a folder and use the pdf2txt_multi function. The input path is the folder where the PDF is located.

import fitz  
import os
#Convert a single filedef pdf2txt(input_file,output_file):
    with (input_file) as doc:
        text = ""
        for page in ():
            text += page.get_text()  # Note that the get_text() method is used here    with open(output_file, "w", encoding="utf-8") as f:
        (text)

    return text


def pdf2txt_multi(input_folder, output_file):
    # traverse all files in the input folder    for file_name in (input_folder):
        if file_name.endswith(".pdf"):
            print(file_name)
            # Build the input file path            input_file = (input_folder, file_name)
            # Open PDF file and write to txt file            with (input_file) as doc:
                text = ""
                for page in ():
                    text += page.get_text()  # Note that the get_text() method is used here
    with open(output_file, "w", encoding="utf-8") as f:
        (text)

    return text



if __name__ == "__main__":
    # Input and output file paths, when you want to convert a folder, it is the folder path.    input_file =r"xx"
    output_file = ""
    pdf2txt_multi(input_file,output_file)

3. Merge text PDF

Use Python's fitz library to merge pdfs in a folder into one pdf

import os
import fitz

def merge_pdfs(pdf_folder, output_pdf_path):
    pdf_files = [f for f in (pdf_folder) if (".pdf")]

    # Create a new PDF file    merged_pdf = ()

    # traverse each PDF file and insert it into the merged file    for pdf_file in pdf_files:
        pdf_path = (pdf_folder, pdf_file)
        pdf = (pdf_path)
        merged_pdf.insert_pdf(pdf)
        merged_pdf.save(output_pdf_path)
        merged_pdf.close()

if __name__ == '__main__':

# Get the path to all PDF files in the folder    pdf_folder = r"D:\Work\Science and Technology Innovation Special Class\Data\Conference Report\2024 Seminar"
    # Save the merged PDF file    output_pdf_path = "output_merged_pdf.pdf"
    print(f"Merge is completed，Save as：{output_pdf_path}")

4. Convert image PDF to word and merge

1. Single-process mode

import datetime
import os
import fitz
from paddleocr import PaddleOCR
from docx import Document


# Output the complete file path of a file ending in .pdfdef find_pdf_files(directory):
    pdf_files = []
    for root, dirs, files in (directory):
        for file in files:
            if ('.pdf'):
                pdf_file_path = (root, file)
                pdf_files.append(pdf_file_path)
    return pdf_files

def text_noposition(data, left, right, bottom, top):
    text_res = ""
    # data[0] contains location and text information    for i in data[0]:
        # i[0][0][0] is horizontal, i[0][0][1] is vertical position        x, y = i[0][0][0], i[0][0][1]
        if left &lt; x &lt; right and bottom &lt; y &lt; top:
            # i[1][0] is text                text_res = text_res + i[1][0]
    return text_res
def pyMuPDF_fitz(pdf, image_folder_path):
    start_time_pdf2img = ()  # Start time    pdf_doc = (pdf)
    # Create an OCR model    ocr = PaddleOCR(use_angle_cls=True, lang="ch")

    # Create Word Document    doc = Document()

    for pg in range(pdf_doc.page_count):
        page = pdf_doc[pg]
        rotate = int(0)
        # The scaling factor for each size is 1.3, which will generate images with a resolution of 2.6 for us.        # If you do not set it here, the default image size is: 792X612, dpi=96        zoom_x = 1.33333333  # (1.33333333--&gt;1056x816)   (2--&gt;1584x1224)
        zoom_y = 1.33333333
        mat = (zoom_x, zoom_y).prerotate(rotate)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        if not (image_folder_path):  # Determine whether the folder where the picture exists            (image_folder_path)  # Create if the image folder does not exist        # Extract file names without extensions        filename = (pdf)
        file_name, file_extension = (filename)
        print(file_name,"page number",pg)
        image_path = f"{image_folder_path}\\{file_name}_{pg}.jpg"
        (image_path)  # Write the image into the specified folder
        # Identify text in the picture        result = (image_path)

        # Write recognition results to Word documents        doc.add_paragraph(f"{file_name}")
        doc.add_paragraph(f"page number：{pg+1}")
        processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000)
        doc.add_paragraph(processed_text)
    if not (word_path):  # Determine whether the folder where Word exists        (word_path)  # Create if the word folder does not exist    # Save Word Documents    doc_name = f"{file_name}.docx"
    doc_path=f"{word_path}\\{doc_name}"

    (doc_path)

    end_time_pdf2img = ()  # End time    print('pdf2imgtime=', (end_time_pdf2img - start_time_pdf2img).seconds)

    return doc_name

if __name__ == '__main__':

    # test    pdf_folder = r"D:\Work\Science and Technology Innovation Team\Data\Conference Report\24 Year Work Conference"
    image_folder = r"xx"
    word_path = r"xx"
    pdf_files = find_pdf_files(pdf_folder)
    for pdf_file in pdf_files:
        pyMuPDF_fitz(pdf_file, image_folder)

2. Multi-process mode

Multiple processes need to pay attention to importing the paddleocr package inside each process, otherwise the package cannot be serialized, resulting in the program being unable to run.

def pyMuPDF_fitz(pdf, image_folder_path):
      from paddleocr import PaddleOCR

You can't import packages at the beginning

import datetime
import os
import fitz
#from paddleocr import PaddleOCR
from joblib import Parallel, delayed
from docx import Document


# Output the complete file path of a file ending in .pdfdef find_pdf_files(directory):
    pdf_files = []
    for root, dirs, files in (directory):
        for file in files:
            if ('.pdf'):
                pdf_file_path = (root, file)
                pdf_files.append(pdf_file_path)
    return pdf_files

def text_noposition(data, left, right, bottom, top):
    text_res = ""
    # data[0] contains location and text information    for i in data[0]:
        # i[0][0][0] is horizontal, i[0][0][1] is vertical position        x, y = i[0][0][0], i[0][0][1]
        if left &lt; x &lt; right and bottom &lt; y &lt; top:
            # i[1][0] is text                text_res = text_res + i[1][0]
    return text_res
def pyMuPDF_fitz(pdf, image_folder_path):
    from paddleocr import PaddleOCR
    start_time_pdf2img = ()  # Start time    pdf_doc = (pdf)
    # Create an OCR model    ocr = PaddleOCR(use_angle_cls=True, lang="ch")

    # Create Word Document    doc = Document()

    for pg in range(pdf_doc.page_count):
        page = pdf_doc[pg]
        rotate = int(0)
        # The scaling factor for each size is 1.3, which will generate images with a resolution of 2.6 for us.        # If you do not set it here, the default image size is: 792X612, dpi=96        zoom_x = 1.33333333  # (1.33333333--&gt;1056x816)   (2--&gt;1584x1224)
        zoom_y = 1.33333333
        mat = (zoom_x, zoom_y).prerotate(rotate)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        if not (image_folder_path):  # Determine whether the folder where the picture exists            (image_folder_path)  # Create if the image folder does not exist        # Extract file names without extensions        filename = (pdf)
        file_name, file_extension = (filename)
        print(file_name,"page number",pg)
        image_path = f"{image_folder_path}\\{file_name}_{pg}.jpg"
        (image_path)  # Write the image into the specified folder
        # Identify text in the picture        result = (image_path)
        # Write recognition results to Word documents        doc.add_paragraph(f"{file_name}")
        doc.add_paragraph(f"page number：{pg+1}")
        processed_text = text_noposition(result, left=0, right=10000, bottom=500, top=2000)
        doc.add_paragraph(processed_text)
    if not (word_path):  # Determine whether the folder where Word exists        (word_path)  # Create if the word folder does not exist    # Save Word Documents    doc_name = f"{file_name}.docx"
    doc_path=f"{word_path}\\{doc_name}"

    (doc_path)

    end_time_pdf2img = ()  # End time    print('pdf2imgtime=', (end_time_pdf2img - start_time_pdf2img).seconds)

    return doc_name

if __name__ == '__main__':

    # test    pdf_folder = r"D:\data\2024\Conference Report\24 Annual Work Conference"
    image_folder = r"D:\data\2024\picture"
    word_path = r"D:\data\2024\word"
    pdf_files = find_pdf_files(pdf_folder)

    # for pdf_file in pdf_files:
    #     pyMuPDF_fitz(pdf_file, image_folder)
#paddleocr cannot be serialized and cannot be used    results = Parallel(n_jobs=-1,)(
        delayed(pyMuPDF_fitz)(pdf_file, image_folder) for pdf_file in pdf_files)

5. Merge the word in the folder

from docx import Document
import os


def merge_word_documents(folder_path, output_path):
    # Create a new Word document    merged_document = Document()

    # traverse each Word document in the folder    for filename in (folder_path):
        if (".docx"):
            file_path = (folder_path, filename)

            # Open the current Word document            current_document = Document(file_path)

            # Copy the contents of the current document into the merged document            for element in current_document.:
                merged_document.(element)

    # Save the merged document    merged_document.save(output_path)


# Program entryif __name__ == "__main__":
    folder_path = r"your_folder_path"
    output_path = r"output_folder/merged_document.docx"
    merge_word_documents(folder_path, output_path)

6. Convert the entire picture into text and write it to the pandas table

It is implemented by first writing the internal example to the list, and then turning the list into DataFrame.

import os
import pandas as pd
from paddleocr import PaddleOCR
from  pandasrw import view

#Output result to stringdef text_noposition(data):
    data_res = []
    # data[0] contains location and text information    for i in data[0]:
        # i[0][0][0] is horizontal, i[0][0][1] is vertical position        x, y = i[0][0][0], i[0][0][1]
            # i[1][0] is text        data_res.append(i[1][0])
    return data_res
def convert_png_to_txt(dir_path):
    # Initialize PaddleOCR    ocr = PaddleOCR(use_angle_cls=True, lang="ch")

    list_text =[]
    data_text=[]
    # traverse all .png files in the specified directory    for filename in (dir_path):
        if ('.jpg'):
            file_path = (dir_path, filename)

            # Use PaddleOCR to extract text from pictures            result = (file_path, cls=True)
            data_text = text_noposition(result)
        list_text=list_text+data_text
        print(list_text)
    # Create DataFrame and write data    df = (list_text)

    return df


# Specify the directory of the pictures to be processeddir_path = r"D:\data\2024\pic\d"
df=convert_png_to_txt(dir_path)

view(df)

7. Extract some PDF pages

extract_pages_to_new_pdf(input_path, output_path, [start page, end page]) Extracts continuous PDFs from the start to end page, and converts them into multiple consecutive page extractions for discontinuous page extractions.

import fitz  # PyMuPDF

def extract_pages_to_new_pdf(input_pdf_path, output_pdf_path, page_numbers):
    # Open the original PDF file    pdf_doc = (input_pdf_path)

    # Create a new PDF document object    new_pdf = ()
    # traverse the page index to be extracted    new_pdf.insert_pdf(pdf_doc, from_page=min(page_numbers), to_page=max(page_numbers))

    # Save a new PDF file    new_pdf.save(output_pdf_path)

    # Close the document    pdf_doc.close()
    new_pdf.close()

if __name__ == '__main__':
    input_path=r file name
    output_path=r""
    extract_pages_to_new_pdf(input_path, output_path, [25, 42])

This is the article about Python implementing PDF extraction and stating word frequency. For more related Python PDF operation content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!