PDF (Portable Document Format) files are a flexible file format created by Adobe that allows documents to be displayed consistently across different software, hardware and operating systems. Each PDF file contains a comprehensive description of the fixed layout document, including text, fonts, graphics, and other necessary display elements. pdfs are usually used for document sharing because they are able to maintain the original format. However, parsing and interpreting PDF content programmatically can be a challenge. These difficulties include the complex structure of PDF, different text encodings, complex layouts, compressed content and embedded fonts.
We have recently evaluated several popular Python PDF libraries such as PyPDF/PyPDF2, PyMuPDF, PDFplumber2, etc. Some libraries are suitable for extracting text, some are suitable for extracting images, some are fast, and so on. In this article, we will focus on how to get started. Please keep an eye on the official website for the latest information.
Environmental preparation
Install dependency package:
pip install pip install '[image]'
The sample PDF file can be found here, and of course you can prepare it yourself. Let's see how to use these APIs:
Extract text from PDF
Extract images from PDF
Iterate over all objects in PDF
Extract TableOfContent from PDF (ToC)
Extract text
It can be used to extract text from PDF via advanced API.
from pdfminer.high_level import extract_text from os import path path = ((__file__)) print(path) pdf_file = path + '/' text = extract_text(pdf_file) print(text)
Extract each page
from io import StringIO from import TextConverter from import LAParams from import PDFResourceManager, PDFPageInterpreter from import PDFPage from import open_filename from os import path path = ((__file__)) print(path) def iter_text_per_page(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None): if laparams is None: laparams = LAParams() with open_filename(pdf_file, "rb") as fp: rsrcmgr = PDFResourceManager(caching=caching) idx = 1 for page in PDFPage.get_pages( fp, page_numbers, maxpages=maxpages, password=password, caching=caching, ): with StringIO() as output_string: device = TextConverter(rsrcmgr, output_string, codec=codec, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) interpreter.process_page(page) yield idx, output_string.getvalue() idx += 1 def main(): pdf_file = path + '/' for count, page_text in iter_text_per_page(pdf_file): print(f'page# {count}:\n{page_text}') print() if __name__ == "__main__": main()
The output content is intercepted as follows:
page# 1:
The main functions of the product include data collection, data governance and data product application. Typical application scenarios of enterprises use AI algorithms to realize business classification, clustering, regression prediction, time series prediction, etc. In the sales field, sales forecast is realized based on historical data, and customer classification is realized based on user characteristic data; in the procurement field, historical data is used to predict procurement prices, and comprehensive supplier evaluation model is realized based on multi-dimensional indicators.
page# 2:
Various policies and regulations are sorted and summarized to help users obtain the required policy information more conveniently and quickly. . . .
Extract images
The easiest way to extract images is to call command line tools. It is installed when PDFMiner is installed and is located in the same location as the Python executable. The operating system used. The executable file’ finds the location of the Python binary.
Here is an example usage:
usage: [-h] [--version] [--debug] [--disable-caching] [--page-numbers PAGE_NUMBERS [PAGE_NUMBERS ...]] [--pagenos PAGENOS] [--maxpages MAXPAGES] [--password PASSWORD] [--rotation ROTATION] [--no-laparams] [--detect-vertical] [--line-overlap LINE_OVERLAP] [--char-margin CHAR_MARGIN] [--word-margin WORD_MARGIN] [--line-margin LINE_MARGIN] [--boxes-flow BOXES_FLOW] [--all-texts] [--outfile OUTFILE] [--output_type OUTPUT_TYPE] [--codec CODEC] [--output-dir OUTPUT_DIR] [--layoutmode LAYOUTMODE] [--scale SCALE] [--strip-control] files [files ...] To extract all text from pdf: --all-texts ../samples/ To extract all images from pdf: --output-dir images ../
If you want to integrate it into your application, just copy the source code from it.
Get the page count
from import PDFDocument from import PDFParser from import resolve1 pdf_file = '../samples/' with open(pdf_file, 'rb') as f: parser = PDFParser(f) doc = PDFDocument(parser) parser.set_document(doc) pages = resolve1(['Pages']) pages_count = ('Count', 0) print(pages_count)
Extract table data
The output of the pdfminer extracting table looks much better than PyPDF2, and we can easily extract the required data using regex or split(). But in the real world, PDF documents contain a lot of noise, ids can be in different formats, etc. I can't imagine an algorithm that thinks about everything. To simplify and speed up our work, I recommend converting PDF files to HTML format:
from io import StringIO from pdfminer.high_level import extract_text_to_fp from import LAParams output = StringIO() with open('', 'rb') as pdf_file: extract_text_to_fp(pdf_file, output, laparams=LAParams(), output_type='html', codec=None) with open('', 'a') as html_file: html_file.write(())
Then use the html tag processing library to extract text, and the accuracy of this method should be guaranteed.
This is the end of this article about the detailed explanation of Python's use of parsing PDF data. For more related Python's parsing PDF content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!