How to extract Chinese characters from a file

1. Source of the problem

In practical applications, sometimes a file containing Chinese characters needs to be processed, such as word segmentation, text analysis, text mining and other operations on text content. These operations require first extracting Chinese characters from the file and then processing them accordingly. In addition, some data sources, such as crawled Chinese web pages and collected Chinese articles, also need to extract the Chinese characters in them for some conventional processing, such as combining Chinese keywords for analysis, extracting topics, etc. Generally, when performing natural language processing, text processing, data analysis and mining, Chinese characters need to be obtained from the file for the next step of processing.

The above are all relatively advanced operations. If we need to deal with multilingual internationalization in the project, we will generally check whether there is Chinese in the project. Therefore, we will make some tools to check the locations containing Chinese characters and print the specific number of lines so that we can check or replace them.

2. Solution process

Use familiar tools to process, use the readFile method in the fs module in the readFile method to read files containing Chinese characters

For example, we create the file content as

('Test file, I'm in Chinese');
function onChange() {
  ('change');
  ('change method');
}
onChange();

Create an execution code file and follow the steps below to copy the code into verification.

To get Chinese characters in a file, you can use the following steps:

Read file: Use the readFile method in the fs module in , to read a file containing Chinese characters. For example:

const fs = require('fs');
// Read file content('', 'utf8', (error, data) =&gt; {
  if (error) {
    (error);
    return;
  }
  (data);
});

In the above code, ‘’ is the file name containing Chinese characters, and the ‘utf8’ parameter indicates that the encoding type is UTF-8.

implementnode After that, all contents in the file will be printed out

Extract Chinese characters: You can use regular expressions to extract Chinese characters. For example:

const chineseRegex = /[\u4e00-\u9fa5]/g;
const chineseChars = (chineseRegex);
(chineseChars);

In the above code, chineseRegex specifies the Unicode code range of Chinese characters. The match() method extracts Chinese characters from the read file content and stores them in the chineseChars variable.

Combine the contents of the second step together and we can print out all the Chinese involved and see the current results

const fs = require('fs');
const chineseRegex = /[\u4e00-\u9fa5]/g;
('./', 'utf8', (error, data) =&gt; {
  if (error) {
    (error);
    return;
  }
//   (data);
  const chineseChars = (chineseRegex);
  (chineseChars);
});
[
  'Test', 'try', 'arts',
  'Piece', 'I', 'yes',
  'middle', 'arts', 'square',
  'Law'
]

Obviously, this result is still different from our expectations, and we still have to consider the display of the number of rows.

Divide the number of rows in Chinese

After studying the data, we found that we were able to divide each line by using newlines to distinguish and divide each line, so that the second step occurred in this case, printing all Chinese into an array.

  // Split the file content by line  const lines = ('\n');

Organize the distinguished code into our existing code and run it again to see the number of print trips

const fs = require('fs');
const chineseRegex = /[\u4e00-\u9fa5]/g;
('./', 'utf8', (error, data) =&gt; {
  if (error) {
    (error);
    return;
  }
    // Split the file content by line    const lines = ('\n');
  // traverse each line and find all Chinese characters  for (let i = 0; i &lt; ; i++) {
    const line = lines[i];
    const chineseCharacters = (chineseRegex);
    if (chineseCharacters) {
      // If this line contains Chinese characters, print them out      (`Line ${i + 1}: ${('')}`);
    }
  }
});
// Line 1: Test file I am Chinese// Line 5: method

3. Plan summary and extended thinking

Use the file processing system FS processing file reading in nodejs
Using regular expressions/[\u4e00-\u9fa5]/gMatch the corresponding Chinese
Use newlines to divide each independent line, and you can also read it by line by (filePath);

Here we just read the Chinese characters of a file, but our project contains many files. We need to traverse the entire project to get the file name and the corresponding number of lines. This can be considered using readdir in fs for obtaining it, and at the same time, using recursion. Interested friends can try how to deal with it.

This is the end of this article about how to extract Chinese characters from a file. For more relevant content on extracting Chinese characters, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!