Java practical tutorial on using Tesseract-OCR

Java uses Tesseract-OCR

Optical Character Recognition (OCR, Optical Character Recognition) technology can convert text in an image into editable text.

Tesseract is one of the most popular open source OCR engines at present, supporting multiple languages and efficient text recognition.

This article will introduce in detail how to use Tesseract-OCR for text extraction in Java, including the installation of Tesseract-OCR, the configuration of Chinese training library, the introduction of dependency libraries, and specific code implementation. Through this process, we will demonstrate how to extract text from video frames.

Tesseract-OCR installation

First, we need to install Tesseract-OCR on the system. Installation packages for Windows can be downloaded via the following link:

Download Tesseract-OCR installation package

After the download is completed, run the installer and select the installation directory. The next step is to install it by default.

Configure Chinese training library

In order for Tesseract to recognize Chinese, we need to download the Chinese simplified training library filechi_sim.traineddataand place it in Tesseract'stessdataIn the directory.

For example:

makefile
D:\Program Files\Tesseract-OCR\tessdata

The Chinese training library can be downloaded from the following link:

Download Chinese training library

More training libraries can be found inTesseract official GitHub repositoryturn up.

Introduce dependencies

In order to use Tesseract in Java, we need to introducetess4jlibrary.tess4jIt is a Java Tesseract API encapsulation, which can easily use Tesseract in Java projects. In addition, in order to process video frames, we also needjavacvlibrary.

Here are the dependencies that need to be introduced in the Maven project:

```xml
<dependency>
    <groupId>.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>5.3.0</version>
</dependency>
<!-- JavaCV: Java interface to OpenCV, FFmpeg, and more -->
<dependency>
    <groupId></groupId>
    <artifactId>javacv-platform</artifactId>
    <version>1.5.7</version>
</dependency>

Code implementation

Next, we will implement a Java classVideoTextExtractor, This class is used to extract text from video.

The complete code is as follows:

```java
import .;
import ;
import ;
import .Java2DFrameUtils;

import ;
import .*;
import ;
import ;
import ;
import ;
import ;

public class VideoTextExtractor {
    // Tesseract-OCR installation path    public static final String pathToTessdataFolder = "D:\\Program Files\\Tesseract-OCR\\tessdata\\";
    // Load video    public static final String pathToVideoFile = "C:\\Users\\lixiewen\\Documents\\oCam\\Recording_2023_05_31_09_39_51_172.mp4";
    // Analysis results    public static final String resultFile = "E:\\tmp\\";

    public static void main(String[] args) throws TesseractException {
        extracted();
    }

    private static void extracted() {
        // Set the path of the Tesseract OCR library        File tessDataFolder = new File(pathToTessdataFolder);
        ("TESSDATA_PREFIX", ());
        FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(pathToVideoFile);
        try {
            ();
            Set&lt;String&gt; set = new LinkedHashSet&lt;&gt;();
            // traverse video frames            int lengthInFrames = ();
            for (int i = 0; i &lt; lengthInFrames; i++) {
                ("Progress" + i + " / " + lengthInFrames);
                try {
                    Frame frame = ();
                    if (frame == null) continue;
                    BufferedImage bufferedImage = (frame);

                    // Convert frames to grayscale images                    BufferedImage grayImage = new BufferedImage((), (), BufferedImage.TYPE_BYTE_GRAY);
                    Graphics2D graphics = ();
                    (bufferedImage, 0, 0, null);
                    ();

                    // Create temporary files to save images                    File tempImageFile = ("frame", ".png");
                    (grayImage, "png", tempImageFile);

                    Tesseract tesseract = getTesseract(tessDataFolder);
                    String result = (tempImageFile);
                    (result);
                    // Delete temporary files                    ();
                } catch (Exception e) {
                    ();
                }
            }
            File file = new File(resultFile);

            FileUtils.write2File(file, new ArrayList&lt;&gt;(set));

            ();
        } catch (Exception e) {
            ();
        }
    }

    private static Tesseract getTesseract(File tessDataFolder) {
        // Use Tesseract OCR for text recognition        Tesseract tesseract = new Tesseract();
        // Set up Chinese training library        ("chi_sim");
        (());
        return tesseract;
    }
}

No installation method

If you do not want to install Tesseract-OCR, you can directly introduce the training library into the project. This approach is suitable for developers who want to manage dependencies more easily.

Introducing Maven dependencies
Introducing training library in code

```java
import .;

public class OCRUtil {
    public static ITesseract getTesseract() throws Exception {
        // Use Tesseract to identify text        ITesseract tesseract = new Tesseract();
        // Set the path of the training data folder        ("src/main/resources/traineddata");
        // Set to Simplified Chinese        ("chi_sim");
        return tesseract;
    }
}

Optimization and improvement

In practical applications, we can optimize the processing of video frames and OCR recognition to improve recognition efficiency and accuracy. Here are some suggestions:

Image preprocessing: Before OCR recognition, pre-processing of the image can be performed by denoising, binarization, rotation correction, etc. to improve the recognition rate.
Multithreaded processing: For long-term video processing, multithreading can be used to improve frame processing speed.
Custom training data: If the default training data is not effective, you can customize the training data through Tesseract's training tool to improve the recognition accuracy in specific scenarios.
Post-processing of results: The text recognized by OCR may contain some noisy characters, and the results can be cleaned and corrected through methods such as regular expressions.

Here is an optimized image preprocessing example:

```java
    // Convert to grayscale image    BufferedImage grayImage = new BufferedImage((), (), BufferedImage.TYPE_BYTE_GRAY);
    Graphics2D graphics = ();
    (image, 0, 0, null);
    ();

    // Binary processing    for (int y = 0; y &lt; (); y++) {
        for (int x = 0; x &lt; (); x++) {
            int rgb = (x, y);
            int gray = (rgb &amp; 0xff);
            gray = gray &gt; 128 ? 255 : 0;
            (x, y, (gray &lt;&lt; 16) | (gray &lt;&lt; 8) | gray);
        }
    }
    return grayImage;
}

Summarize

Through the introduction of this article, we explain in detail how to use Tesseract-OCR for text extraction in Java. It includes the installation of Tesseract-OCR, the configuration of the Chinese training library, the introduction of the dependency library, and the specific code implementation, and provides some optimization suggestions.

These contents can help you better apply Tesseract-OCR for text recognition in real projects.

The above is personal experience. I hope you can give you a reference and I hope you can support me more.