SoFunction
Updated on 2024-10-30

Python Machine Learning NLP Natural Language Processing Basic Operations Keywords

summarize

Today we begin our journey into Natural Language Processing (NLP). Natural Language Processing allows us to process, understand, and utilize human language, bridging the gap between machine language and human language.

在这里插入图片描述

byword

Keywords, i.e. key phrases. Keywords can describe the essence of the article, in the literature search, automatic abstract, text clustering / classification and other important applications.

在这里插入图片描述

Methods of keyword extraction

Keyword Extraction: Algorithmic analysis of a new document. Extract some words in the document as the document's keywords

Keyword assignment: Given an existing keyword database, for a new document, assign a few words from the database as keywords for the document.

TF-IDF Keyword Extraction

TF-IDF (Term Frequency-Inverse Document Frequency), i.e., Term Frequency-Inverse Document Frequency is a common weighting technique used in information retrieval and data mining. TF-IDF can help us mine the keywords in articles. Through numerical statistics, it reflects the importance of a word to a certain article in the corpus.

TF

TF (Term Frequency). It represents the frequency of occurrence of a word in the text.

Formula.

在这里插入图片描述

IDF

IDF (Inverse Document Frequency). It represents the inverse of the number of documents containing words in the corpus.

Formula.

在这里插入图片描述

TF-IDF

Formula.

在这里插入图片描述

TF-IDF = (word frequency / total number of words in the sentence) × (total number of documents / number of documents containing the word)

If a word is very common, then the IDF will be low, and vice versa. TF-IDF can help us to filter common words and extract keywords.

jieba TF-IDF keyword extraction

Format.

.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

Parameters.

sentence: Text corpus to be extracted

topK: the number of keywords returned, default is 20

withWeight: whether to return keyword weight, default is False

allowPOS: Includes only words of the specified gender, defaults to empty, i.e., no filtering.

jieba part of speech (noun, verb, adjective etc)

serial number part of speech (noun, verb, adjective etc) descriptive
Ag morpheme Adjectival morphemes. The adjective code is a and the morpheme code g is preceded by A.
a conjunctions Take the first letter of the English adjective adjective.
ad adverbial Adjectives that are direct gerunds. Adjective code a and adverbial code d together.
an nominal classifier (in Chinese grammar) Adjectives with noun function. The adjective code a and the noun code n are merged together.
b distinctive word Takes the consonant of the Chinese character "classify".
c conjunctions Take the first letter of the English conjunction conjunction.
dg auxiliary morpheme Adverbial morphemes. The adverbial code is d and the morpheme code g is preceded by D.
d adverb Take the 2nd letter of adverb, as its 1st letter is already used in adjectives.
e exclamation Take the first letter of the English exclamation exclamation.
f noun of locality (linguistics) Take the Chinese character "abbr. for square or cubic meter".
g individual characters (making up an expression) Most morphemes can be used as the "root" of a compound word, taking the consonant of the "root" of a Chinese character.
h antecedent (logic) Take the first letter of the English word head.
i adage Take the first letter of the English idiom idiom.
j abbreviation Takes the consonant of the Chinese character "bamboo strips used for writing (old)".
k subordinate component
l idiom The idiom has not yet become an idiom, and is somewhat "provisional", taking the vowel of "Lin".
m numeral Take the 3rd letter of the English numeral, n, u has been used in other ways.
Ng nominal particle Noun morphemes. The noun code is n and the morpheme code g is preceded by N.
n noun (part of speech) Take the 1st letter of the English noun noun.
nr personal name The noun code n is combined with the consonant for "person (ren)".
ns toponymy The noun code n and the place code s are concatenated together.
nt body "Troupe" has the consonant t, and the noun codes n and t are joined together.
nz Other monikers The first letter of the vowel in "specialized" is z. The noun codes n and z are combined.
o onomatopoeia Take the 1st letter of the English onomatopoeia.
p prepositions Take the first letter of the English prepositional word prepositional.
q classifier (in Chinese grammar) Take the first letter of the English quantity.
r pronoun Take the second letter of the English pronoun pronoun, since p is already used in prepositions.
s word for place Take the first letter of the English word space.
tg Tokoyami, one of the indigenous peoples of * Time lexical morphemes. The time word code is t, which is placed before the code g of the morpheme.
t time word Take the first letter of the English word time.
u particle (grammatical) Take English auxiliary
vg morpheme Verbal morphemes. The verb code is v. The code g of the morpheme is preceded by V.
v prepositions Take the first letter of the English verb verb.
vd subjunctive Direct gerunds. Codes for verbs and adverbs are merged together.
vn nominal verb Refers to verbs that function as nouns. The codes for verbs and nouns are merged together.
w a punctuation mark
x unwritten character A non-syntactic word is just a symbol, and the letter x is often used to represent unknown numbers, symbols.
y modal particle Takes the consonant of the Chinese character "tell to".
z status word Takes the first letter of the consonant of the Chinese character "accusation".
un unknown word

No keyword weights

Example.

import 
# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It investigates various theories and methods that enable effective communication between humans and computers in natural language." \
       "Natural language processing is a science that integrates linguistics, computer science, and mathematics." \
       "Thus, research in this area will involve natural language, the language that people use every day," \
       "So it's closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not the study of natural language in general," \
       "Rather, it lies in the development of computer systems, especially the software systems therein, that can effectively implement natural language communication." \
       "and thus it is part of computer science."
# Extract keywords
keywords = .extract_tags(text, topK=20, withWeight=False)
# Debug Outputs
print([i for i in keywords])

Output.

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\
Loading model cost 0.890 seconds.
Prefix dict has been built successfully.
['Natural language', 'Computer Science', 'Linguistics', 'Research', 'Fields', 'Processing', 'Communication', 'Effective', 'Software systems', 'Artificial Intelligence', 'Realization', 'Computer systems', 'Important', 'One', 'The One Door', 'Everyday', 'Computer', 'Close', 'Math', 'Development']

Incidental keyword weights

import 
# Define text
content = "Natural Language Processing is a sub-discipline in the field of Artificial Intelligence and Linguistics. This field explores how natural language is processed and utilized; natural language processing includes multiple aspects and steps, basically cognitive, comprehension, and generative components."
# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It investigates various theories and methods that enable effective communication between humans and computers in natural language." \
       "Natural language processing is a science that integrates linguistics, computer science, and mathematics." \
       "Thus, research in this area will involve natural language, the language that people use every day," \
       "So it's closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not the study of natural language in general," \
       "Rather, it lies in the development of computer systems, especially the software systems therein, that can effectively implement natural language communication." \
       "and thus it is part of computer science."
# Extract keywords (with weights)
keywords = .extract_tags(text, topK=20, withWeight=True)
# Debug Outputs
print([i for i in keywords])

Output.

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\
Loading model cost 1.110 seconds.
Prefix dict has been built successfully.
[('Natural language', 1.1237629576061539), ('Computer Science', 0.4503481350267692), ('Linguistics', 0.27566262244215384), ('Research', 0.2660770221507693), ('Fields', 0.24979825580353845), ('Processing', 0.24973179957046154), ('Communication', 0.2043557391963077), ('Effective', 0.16296019853692306), ('Software systems', 0.16102600688461538), ('Artificial Intelligence', 0.14550809839215384), ('Realization', 0.14389939312584615), ('Computer systems', 0.1402028601413846), ('Important', 0.12347581087876922), ('One', 0.11349408224353846), ('The One Door', 0.11300493477184616), ('Everyday', 0.10913612756276922), ('Computer', 0.1046889912443077), ('Close', 0.10181409957492307), ('Math', 0.10166677655076924), ('Development', 0.09868653898630769)]

TextRank

TextRank constructs the network by the neighbor relationship between words and then iteratively calculates the rank value of each node using PageRank. Sort the rank values to get the keywords.

import 
# Define text
content = "Natural Language Processing is a sub-discipline in the field of Artificial Intelligence and Linguistics. This field explores how natural language is processed and utilized; natural language processing includes multiple aspects and steps, basically cognitive, comprehension, and generative components."
# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It investigates various theories and methods that enable effective communication between humans and computers in natural language." \
       "Natural language processing is a science that integrates linguistics, computer science, and mathematics." \
       "Thus, research in this area will involve natural language, the language that people use every day," \
       "So it's closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not the study of natural language in general," \
       "Rather, it lies in the development of computer systems, especially the software systems therein, that can effectively implement natural language communication." \
       "and thus it is part of computer science."
# TextRank extracts keywords
keywords = (text, topK=20, withWeight=False)
# Debug Outputs
print([i for i in keywords])

Debug Outputs.

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\
['Research', 'Fields', 'Computer Science', 'Realization', 'Processing', 'Linguistics', 'Math', 'People', 'Computer', 'Involved', 'has', 'One', 'Methods', 'Language', 'Development', 'Use', 'Artificial Intelligence', 'lies', 'Contact', 'Science']
Loading model cost 1.062 seconds.
Prefix dict has been built successfully.

在这里插入图片描述

Above is the detailed content of Python Machine Learning NLP Natural Language Processing basic operation keywords, more information about Python Machine Learning NLP Natural Language Processing please pay attention to my other related articles!