introduction
In Chinese Natural Language Processing (NLP), word segmentation is a basic and critical step. Since there are no spaces in Chinese, word segmentation can help us better understand and process text.jieba
is a popular Chinese word segmentation tool that is powerful and easy to use.
Install jieba
First, make sure to installjieba
Module, you can use the following command:
pip install jieba
Word participle pattern
jieba
The module supports three word segmentation modes:
- Precision mode: Exactly divide sentences, suitable for text analysis.
- Full mode: Scan out all possible words in the sentence, which are fast, but cannot be disambiguated.
- Search engine mode: Based on the precise mode, the long words are segmented to improve the recall rate.
Use word segmentation
import jieba text = "I came to Tsinghua University in Beijing" # Full modefull_mode = (text, cut_all=True) print("Full Mode: " + "/ ".join(full_mode)) # Accurate modeexact_mode = (text, cut_all=False) print("Precise Mode: " + "/ ".join(exact_mode)) # Default mode (precision mode)default_mode = ("He came to NetEase Hangzhou Yan Building") print("Default Mode: " + "/ ".join(default_mode))
Search engine mode
usecut_for_search
Method, suitable for building inverted indexes for search engines.
search_mode = jieba.cut_for_search("Xiao Ming graduated from the Institute of Computing, Chinese Academy of Sciences and studied at Kyoto University in Japan") print(", ".join(search_mode))
Custom Dictionary
Add a custom dictionary
jieba
Allow users to add custom dictionaries to improve word segmentation accuracy.
jieba.load_userdict("")
The format of the user dictionary is:
Words Word frequency(Can be omitted) Word nature(Can be omitted)
Adjust the dictionary
-
Add words:use
add_word(word, freq=None, tag=None)
Method to add words. -
Delete words:use
del_word(word)
Method to delete words. -
Adjust word frequency:use
suggest_freq(segment, tune=True)
Methods to adjust the frequency of words so that specific words can (or cannot) be separated.
Keyword extraction
TF-IDF keyword extraction
Can be usedextract_tags
The method extracts keywords based on the TF-IDF algorithm.
import text = "I love natural language processing. Chinese word segmentation is very interesting, and Chinese processing requires a lot of tools." keywords = .extract_tags(text, topK=5) print("Keywords:", keywords)
TextRank keyword extraction
textrank
The method provides keyword extraction based on the TextRank algorithm.
keywords = (text, topK=5) print("Keywords:", keywords)
Part of speech annotation
jieba
It also supports part-of-speech labeling function and usesposseg
The module can mark the part of speech of each word.
import as pseg words = ("I love Beijing * Square") for word, flag in words: print(f'{word}, {flag}')
Get word location
usetokenize
Methods can obtain the starting and ending positions of words in the original text.
result = ("Yonghe Clothing Ornament Co., Ltd.") for tk in result: print(f"word {tk[0]}\t\t start: {tk[1]}\t\t end: {tk[2]}")
Keyword extraction
TF-IDF keyword extraction
TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technology for information retrieval and text mining. It evaluates the importance of a word by combining it with the frequency (TF) of the word appearing in the document and the rarity (IDF) of the word appearing in all documents.
- Term Frequency (TF): The ratio of the number of times a word appears in a document to the total number of words in the document.
-
Inverse Document Frequency (IDF): indicates the importance of a word, and the calculation formula is: [IDF(w) = \log(\frac{N}{n(w)}) ]
- (N): Total number of documents
- ( n(w) ): Number of documents containing the word ( w )
Sample code:
import text = "I love natural language processing. Chinese word segmentation is very interesting, and Chinese processing requires a lot of tools." keywords = .extract_tags(text, topK=5, withWeight=True) for word, weight in keywords: print(f"Keywords: {word}, Weight: {weight}")
TextRank keyword extraction
TextRank is an unsupervised graph model algorithm that is often used for keyword extraction and digest generation. It identifies important words based on the association between words by constructing word graphs and calculating similarity between nodes.
Sample code:
text = "In addition, the company plans to increase its capital by 430 million yuan in its wholly-owned subsidiary Jilin Eurasia Real Estate Co., Ltd. After the capital increase, Jilin Eurasia Real Estate's registered capital will increase from 70 million yuan to 500 million yuan." keywords = (text, topK=5, withWeight=True) for word, weight in keywords: print(f"Keywords: {word}, Weight: {weight}")
Performance comparison
In practical applications,jieba
The different participle patterns of had a significant impact on performance and accuracy. The following is a comparative analysis of different modes:
model | speed | Accuracy | Application scenarios |
---|---|---|---|
Precision mode | medium | high | Text analysis, content extraction |
Full mode | quick | Low | Keyword extraction, rapid preliminary analysis |
Search engine mode | Slower | medium | Search engine inverted index |
Example performance comparison code:
import time text = "I came to Tsinghua University in Beijing" # Accurate modestart = () (text, cut_all=False) print("Precision mode takes time: ", () - start) # Full modestart = () (text, cut_all=True) print("Full mode takes time: ", () - start) # Search engine modestart = () jieba.cut_for_search(text) print("Search engine mode takes time: ", () - start)
FAQ
Inaccurate participle
question: Certain words are misdivided, especially professional terms or names.
Solution:useadd_word()
Methods: Add specific vocabulary or load custom dictionaries to improve the accuracy of word segmentation.
Coding issues
question: When using GBK-encoded text, garbled or word segmentation errors occur.
Solution: Try to use UTF-8 encoded strings to avoid directly entering GBK strings.
How to deal with ambiguity
question: Some words have multiple meanings, but the word participle results are not ideal.
Solution:usesuggest_freq()
Methods adjust the word frequency and guide the word participle to prioritize the identification of specific word meanings.
Summarize
jieba
It is a flexible and feature-rich Chinese word segmentation tool. Through different word segmentation patterns and custom dictionaries, users can optimize for specific needs. Whether it is text analysis or keyword extraction,jieba
All can provide you with strong support.
References
- jieba GitHub repository
- jieba official document
This is the end of this article about the detailed explanation of how to use the jieba module in Python. For more detailed explanation of the Python jieba module, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!