Detailed explanation of how to use jieba module in Python

introduction

In Chinese Natural Language Processing (NLP), word segmentation is a basic and critical step. Since there are no spaces in Chinese, word segmentation can help us better understand and process text.jiebais a popular Chinese word segmentation tool that is powerful and easy to use.

Install jieba

First, make sure to installjiebaModule, you can use the following command:

pip install jieba

Word participle pattern

jiebaThe module supports three word segmentation modes:

Precision mode: Exactly divide sentences, suitable for text analysis.
Full mode: Scan out all possible words in the sentence, which are fast, but cannot be disambiguated.
Search engine mode: Based on the precise mode, the long words are segmented to improve the recall rate.

Use word segmentation

import jieba

text = "I came to Tsinghua University in Beijing"

# Full modefull_mode = (text, cut_all=True)
print("Full Mode: " + "/ ".join(full_mode))

# Accurate modeexact_mode = (text, cut_all=False)
print("Precise Mode: " + "/ ".join(exact_mode))

# Default mode (precision mode)default_mode = ("He came to NetEase Hangzhou Yan Building")
print("Default Mode: " + "/ ".join(default_mode))

Search engine mode

usecut_for_searchMethod, suitable for building inverted indexes for search engines.

search_mode = jieba.cut_for_search("Xiao Ming graduated from the Institute of Computing, Chinese Academy of Sciences and studied at Kyoto University in Japan")
print(", ".join(search_mode))

Custom Dictionary

Add a custom dictionary

jiebaAllow users to add custom dictionaries to improve word segmentation accuracy.

jieba.load_userdict("")

The format of the user dictionary is:

Words Word frequency（Can be omitted） Word nature（Can be omitted）

Adjust the dictionary

Add words:useadd_word(word, freq=None, tag=None)Method to add words.
Delete words:usedel_word(word)Method to delete words.
Adjust word frequency:usesuggest_freq(segment, tune=True)Methods to adjust the frequency of words so that specific words can (or cannot) be separated.

Keyword extraction

TF-IDF keyword extraction

Can be usedextract_tagsThe method extracts keywords based on the TF-IDF algorithm.

import 

text = "I love natural language processing. Chinese word segmentation is very interesting, and Chinese processing requires a lot of tools."
keywords = .extract_tags(text, topK=5)
print("Keywords:", keywords)

TextRank keyword extraction

textrankThe method provides keyword extraction based on the TextRank algorithm.

keywords = (text, topK=5)
print("Keywords:", keywords)

Part of speech annotation

jiebaIt also supports part-of-speech labeling function and usespossegThe module can mark the part of speech of each word.

import  as pseg

words = ("I love Beijing * Square")
for word, flag in words:
    print(f'{word}, {flag}')

Get word location

usetokenizeMethods can obtain the starting and ending positions of words in the original text.

result = ("Yonghe Clothing Ornament Co., Ltd.")
for tk in result:
    print(f"word {tk[0]}\t\t start: {tk[1]}\t\t end: {tk[2]}")

Keyword extraction

TF-IDF keyword extraction

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technology for information retrieval and text mining. It evaluates the importance of a word by combining it with the frequency (TF) of the word appearing in the document and the rarity (IDF) of the word appearing in all documents.

Term Frequency (TF): The ratio of the number of times a word appears in a document to the total number of words in the document.
Inverse Document Frequency (IDF): indicates the importance of a word, and the calculation formula is: [IDF(w) = \log(\frac{N}{n(w)}) ]
- (N): Total number of documents
- ( n(w) ): Number of documents containing the word ( w )

Sample code:

import 

text = "I love natural language processing. Chinese word segmentation is very interesting, and Chinese processing requires a lot of tools."
keywords = .extract_tags(text, topK=5, withWeight=True)
for word, weight in keywords:
    print(f"Keywords: {word}, Weight: {weight}")

TextRank keyword extraction

TextRank is an unsupervised graph model algorithm that is often used for keyword extraction and digest generation. It identifies important words based on the association between words by constructing word graphs and calculating similarity between nodes.

Sample code:

text = "In addition, the company plans to increase its capital by 430 million yuan in its wholly-owned subsidiary Jilin Eurasia Real Estate Co., Ltd. After the capital increase, Jilin Eurasia Real Estate's registered capital will increase from 70 million yuan to 500 million yuan."
keywords = (text, topK=5, withWeight=True)
for word, weight in keywords:
    print(f"Keywords: {word}, Weight: {weight}")

Performance comparison

In practical applications,jiebaThe different participle patterns of had a significant impact on performance and accuracy. The following is a comparative analysis of different modes:

model	speed	Accuracy	Application scenarios
Precision mode	medium	high	Text analysis, content extraction
Full mode	quick	Low	Keyword extraction, rapid preliminary analysis
Search engine mode	Slower	medium	Search engine inverted index

Example performance comparison code:

import time

text = "I came to Tsinghua University in Beijing"

# Accurate modestart = ()
(text, cut_all=False)
print("Precision mode takes time: ", () - start)

# Full modestart = ()
(text, cut_all=True)
print("Full mode takes time: ", () - start)

# Search engine modestart = ()
jieba.cut_for_search(text)
print("Search engine mode takes time: ", () - start)

FAQ

Inaccurate participle

question: Certain words are misdivided, especially professional terms or names.

Solution:useadd_word()Methods: Add specific vocabulary or load custom dictionaries to improve the accuracy of word segmentation.

Coding issues

question: When using GBK-encoded text, garbled or word segmentation errors occur.

Solution: Try to use UTF-8 encoded strings to avoid directly entering GBK strings.

How to deal with ambiguity

question: Some words have multiple meanings, but the word participle results are not ideal.

Solution:usesuggest_freq()Methods adjust the word frequency and guide the word participle to prioritize the identification of specific word meanings.

Summarize

jiebaIt is a flexible and feature-rich Chinese word segmentation tool. Through different word segmentation patterns and custom dictionaries, users can optimize for specific needs. Whether it is text analysis or keyword extraction,jiebaAll can provide you with strong support.

References

jieba GitHub repository
jieba official document

This is the end of this article about the detailed explanation of how to use the jieba module in Python. For more detailed explanation of the Python jieba module, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!