C# implements example sharing of pdf to text

How to parse PDF files

Several main methods for extracting text from PDF files in .NET are:

1. Microsoft's IFilter interface and Adobe's IFilter implementation;

2、iTextSharp；

3、PDFBox。

Unfortunately, none of these PDF parsing schemes are perfect. We will discuss these approaches below.

Adobe PDF IFilter

In order to use the IFilter interface to parse PDF files, you need:

Windows 2000 or later versions

Adobe Acrobat or Reader 7.0.5+ (or separate Adobe PDF IFilter [])

IFilter COM Encapsulation Class []

Sample code:

Copy the codeThe code is as follows:

using IFilter;
public static string ExtractTextFromPdf(string path) {
  return (path); 
}

shortcoming:

Unreliable COM interoperability is used to handle the IFilter interface (and combining IFilter COM and Adobe PDF IFilter is particularly troublesome).

Adobe IFilter needs to be installed separately on the target system. It can be painful if you need to post indexable solutions to others.

iTextSharp
iTextSharp(/projects/itextsharp/) is a Java PDF operation library iText(/). It mainly focuses on editing PDFs rather than reading, but it certainly supports extracting text from PDFs (although a bit overkill).

Routine:

Copy the codeThe code is as follows:

using ;
using ;  
public static string ExtractTextFromPdf(string path)
{
  using (PdfReader reader = new PdfReader(path))
  {
    StringBuilder text = new StringBuilder();

    for (int i = 1; i <= ; i++)
    {
        ((reader, i));
    }

    return ();
  }
}

Letter of Credit: Member No. 10364982

shortcoming:

License required (if you don't like AGPL license)

PDFBox

PDFBox is another Java PDF class library. It can also be used with the original Java Lucene (see Lucene PDFDocument).

Fortunately, PDFBox has a .NET version developed using (just access the PDFBox download page).

Using PDFBox in .NET requires reference:

pdfbox-1.8.

And copy the following files to the bin folder:

fontbox-1.8.

It is very easy to parse PDF using PDFBox:

Copy the codeThe code is as follows:

using ;
using ; 
private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = (path)
    PDFTextStripper stripper = new PDFTextStripper();
    return (doc);
  }
  finally {
    if (doc != null) {
      ();
    }
  }
}

The compiled size adds up to 18MB:

(4 MB)

(6 MB)

pdfbox-1.8. (4 MB)

(82 kB)

fontbox-1.8. (180 kB)

(2 MB)

(1 MB)

Speed is OK: parsing. Copyright Act PDF (5.1 MB) file took 13 seconds.

Thanks to bobrien100 for the improvement suggestions.

shortcoming:

Dependency (18 MB)

Speed (especially startup time)