Using C# and implementing the extraction function of mixed text keywords in Chinese and English

Implementation steps

Create Windows Forms Application
Add the following controls:
- TextBox: Enter text (supports multiple lines)
- Button: Trigger word segmentation
- ListBox: Display keywords and word frequency
Install NuGet package

Install-Package

Complete code implementation

using System;
using ;
using ;
using ;
using ;
using ;

public partial class MainForm : Form
{
    private TextBox inputBox;
    private Button analyzeButton;
    private ListBox resultList;
    private HashSet&lt;string&gt; stopWords;

    public MainForm()
    {
        InitializeComponent();
        InitializeStopWords(); // Initialize the stop word list    }

    private void InitializeStopWords()
    {
        // Chinese and English stop word list (example)        stopWords = new HashSet&lt;string&gt;
        {
            // Chinese stop words            "of", "It's gone", "exist", "yes", "I", "and", "have", "At once", "No", "people",
            // English stop words            "a", "an", "the", "is", "are", "and", "in", "on", "at"
        };
    }

    private void AnalyzeButton_Click(object sender, EventArgs e)
    {
        string inputText = ();
        if ((inputText))
        {
            ("Please enter text!");
            return;
        }

        // Use Jieba for word segmentation (processing mix of Chinese and English)        var segmenter = new JiebaSegmenter();
        var segments = (inputText);

        // Extract English words (replenished by regular expressions)        var allWords = new List&lt;string&gt;();
        foreach (var seg in segments)
        {
            // Process mixed Chinese and English words (such as "C#Programming" -> ["C#", "Programming"]) var words = (seg, @"([A-Za-z0-9#+]+)|([\u4e00-\u9fa5]+)")                .Cast&lt;Match&gt;()
                .Select(m =&gt; ());
            (words);
        }

        // Filter stop words and single words        var filteredWords = allWords
            .Where(word =&gt; !(word) &amp;&amp;  &gt;= 2);

        // Statistics word frequency and sorting        var keywordCounts = filteredWords
            .GroupBy(word =&gt; word)
            .OrderByDescending(g =&gt; ())
            .Select(g =&gt; $"{} ({()})")
            .ToList();

        // Show results         = keywordCounts;
    }

    // Initialize the form control    private void InitializeComponent()
    {
         = new TextBox();
         = new Button();
         = new ListBox();

        // Layout controls         = true;
         = new (20, 20);
         = new (400, 150);

         = "Extract keywords";
         = new (20, 180);
         += AnalyzeButton_Click;

         = new (20, 220);
         = new (400, 200);

         = new (440, 440);
        (inputBox);
        (analyzeButton);
        (resultList);
    }
}

Function description

Mixed word segmentation in Chinese and English
- useProcess Chinese participle.
- Pass regular expression([A-Za-z0-9#+]+)Extract English words and numbers (such asC#、Python3）。
Stop word filtering
- Built-in Chinese and English stop word list (such as "" and "and") to filter meaningless vocabulary.
- Filter characters with length less than 2 (such as single words).
Word frequency statistics
- Statistics the number of occurrences of keywords and arrange them in descending order of frequency.

Extension suggestions

Load external stop word list
Loading more comprehensive stop words from files (e.g.）：

private void LoadStopWordsFromFile(string path)
{
    stopWords = new HashSet<string>((path));
}

Part of speech filtering
useThe part-of-speech labeling function only retains nouns, verbs and other keywords:

var posSegmenter = new PosSegmenter();
var posTags = (inputText);
var nouns = (tag => ("n"));

TF-IDF algorithm
Implement more advanced keyword weight calculation (need to introduce the TF-IDF library).

Use Chinese word segmentation

Once the installation is complete, you can use Minutes in your .NET project for Chinese participle. Here is a simple example:

using ;
using System;
 
class Program
{
    static void Main(string[] args)
    {
        var segmenter = new JiebaSegmenter();
        string text = "I love Beijing * Square";
        var words = (text);
        foreach (var word in words)
        {
            (word);
        }
    }
}

In the example above, we first create aJiebaSegmenterInstance, then useCutMethod to string"I love Beijing * Square"Perform word segmentation. The participle result is returned in the form of IEnumerable, we can traverse this result and print out each word.

Word participle mode selection

Three word segmentation modes are provided: precision mode, full mode and search engine mode. You can choose the appropriate mode as you want.

Precision mode: Try to cut the sentences most accurately, suitable for text analysis.
Full mode: Scan all the words in the sentence that can be used as words. It is very fast, but it cannot solve the ambiguity problem.
Search engine mode: Based on the precise mode, the long words are segmented again to improve the recall rate, which is suitable for search engine word segmentation.

You can passCutAn overloaded version of the method to specify word participle pattern, for example:

var words = (text, cutMode: ); // Use full mode for word segmentation

Add a custom dictionary

Custom dictionary features are also supported, where you can add specific vocabulary to the dictionary to ensure they can be correctly recognized as a word. For example:

("* Square"); // Will“* Square”Add to the dictionary

After adding a custom dictionary, when you participle the text that contains these words, they are segmented as a whole.

The above is the detailed content of using C# and implementing the extraction function of mixed Chinese and English text keywords. For more information about C# keyword extraction, please pay attention to my other related articles!