Java uses Tesseract-OCR
Optical Character Recognition (OCR, Optical Character Recognition) technology can convert text in an image into editable text.
Tesseract is one of the most popular open source OCR engines at present, supporting multiple languages and efficient text recognition.
This article will introduce in detail how to use Tesseract-OCR for text extraction in Java, including the installation of Tesseract-OCR, the configuration of Chinese training library, the introduction of dependency libraries, and specific code implementation. Through this process, we will demonstrate how to extract text from video frames.
Tesseract-OCR installation
First, we need to install Tesseract-OCR on the system. Installation packages for Windows can be downloaded via the following link:
Download Tesseract-OCR installation package
After the download is completed, run the installer and select the installation directory. The next step is to install it by default.
Configure Chinese training library
In order for Tesseract to recognize Chinese, we need to download the Chinese simplified training library filechi_sim.traineddata
and place it in Tesseract'stessdata
In the directory.
For example:
makefile D:\Program Files\Tesseract-OCR\tessdata
The Chinese training library can be downloaded from the following link:
Download Chinese training library
More training libraries can be found inTesseract official GitHub repositoryturn up.
Introduce dependencies
In order to use Tesseract in Java, we need to introducetess4j
library.tess4j
It is a Java Tesseract API encapsulation, which can easily use Tesseract in Java projects. In addition, in order to process video frames, we also needjavacv
library.
Here are the dependencies that need to be introduced in the Maven project:
```xml <dependency> <groupId>.tess4j</groupId> <artifactId>tess4j</artifactId> <version>5.3.0</version> </dependency> <!-- JavaCV: Java interface to OpenCV, FFmpeg, and more --> <dependency> <groupId></groupId> <artifactId>javacv-platform</artifactId> <version>1.5.7</version> </dependency>
Code implementation
Next, we will implement a Java classVideoTextExtractor
, This class is used to extract text from video.
The complete code is as follows:
```java import .; import ; import ; import .Java2DFrameUtils; import ; import .*; import ; import ; import ; import ; import ; public class VideoTextExtractor { // Tesseract-OCR installation path public static final String pathToTessdataFolder = "D:\\Program Files\\Tesseract-OCR\\tessdata\\"; // Load video public static final String pathToVideoFile = "C:\\Users\\lixiewen\\Documents\\oCam\\Recording_2023_05_31_09_39_51_172.mp4"; // Analysis results public static final String resultFile = "E:\\tmp\\"; public static void main(String[] args) throws TesseractException { extracted(); } private static void extracted() { // Set the path of the Tesseract OCR library File tessDataFolder = new File(pathToTessdataFolder); ("TESSDATA_PREFIX", ()); FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(pathToVideoFile); try { (); Set<String> set = new LinkedHashSet<>(); // traverse video frames int lengthInFrames = (); for (int i = 0; i < lengthInFrames; i++) { ("Progress" + i + " / " + lengthInFrames); try { Frame frame = (); if (frame == null) continue; BufferedImage bufferedImage = (frame); // Convert frames to grayscale images BufferedImage grayImage = new BufferedImage((), (), BufferedImage.TYPE_BYTE_GRAY); Graphics2D graphics = (); (bufferedImage, 0, 0, null); (); // Create temporary files to save images File tempImageFile = ("frame", ".png"); (grayImage, "png", tempImageFile); Tesseract tesseract = getTesseract(tessDataFolder); String result = (tempImageFile); (result); // Delete temporary files (); } catch (Exception e) { (); } } File file = new File(resultFile); FileUtils.write2File(file, new ArrayList<>(set)); (); } catch (Exception e) { (); } } private static Tesseract getTesseract(File tessDataFolder) { // Use Tesseract OCR for text recognition Tesseract tesseract = new Tesseract(); // Set up Chinese training library ("chi_sim"); (()); return tesseract; } }
No installation method
If you do not want to install Tesseract-OCR, you can directly introduce the training library into the project. This approach is suitable for developers who want to manage dependencies more easily.
- Introducing Maven dependencies
- Introducing training library in code
```java import .; public class OCRUtil { public static ITesseract getTesseract() throws Exception { // Use Tesseract to identify text ITesseract tesseract = new Tesseract(); // Set the path of the training data folder ("src/main/resources/traineddata"); // Set to Simplified Chinese ("chi_sim"); return tesseract; } }
Optimization and improvement
In practical applications, we can optimize the processing of video frames and OCR recognition to improve recognition efficiency and accuracy. Here are some suggestions:
- Image preprocessing: Before OCR recognition, pre-processing of the image can be performed by denoising, binarization, rotation correction, etc. to improve the recognition rate.
- Multithreaded processing: For long-term video processing, multithreading can be used to improve frame processing speed.
- Custom training data: If the default training data is not effective, you can customize the training data through Tesseract's training tool to improve the recognition accuracy in specific scenarios.
- Post-processing of results: The text recognized by OCR may contain some noisy characters, and the results can be cleaned and corrected through methods such as regular expressions.
Here is an optimized image preprocessing example:
```java // Convert to grayscale image BufferedImage grayImage = new BufferedImage((), (), BufferedImage.TYPE_BYTE_GRAY); Graphics2D graphics = (); (image, 0, 0, null); (); // Binary processing for (int y = 0; y < (); y++) { for (int x = 0; x < (); x++) { int rgb = (x, y); int gray = (rgb & 0xff); gray = gray > 128 ? 255 : 0; (x, y, (gray << 16) | (gray << 8) | gray); } } return grayImage; }
Summarize
Through the introduction of this article, we explain in detail how to use Tesseract-OCR for text extraction in Java. It includes the installation of Tesseract-OCR, the configuration of the Chinese training library, the introduction of the dependency library, and the specific code implementation, and provides some optimization suggestions.
These contents can help you better apply Tesseract-OCR for text recognition in real projects.
The above is personal experience. I hope you can give you a reference and I hope you can support me more.