summarize
Today we begin our journey into Natural Language Processing (NLP). Natural Language Processing allows us to process, understand, and utilize human language, bridging the gap between machine language and human language.
byword
Keywords, i.e. key phrases. Keywords can describe the essence of the article, in the literature search, automatic abstract, text clustering / classification and other important applications.
Methods of keyword extraction
Keyword Extraction: Algorithmic analysis of a new document. Extract some words in the document as the document's keywords
Keyword assignment: Given an existing keyword database, for a new document, assign a few words from the database as keywords for the document.
TF-IDF Keyword Extraction
TF-IDF (Term Frequency-Inverse Document Frequency), i.e., Term Frequency-Inverse Document Frequency is a common weighting technique used in information retrieval and data mining. TF-IDF can help us mine the keywords in articles. Through numerical statistics, it reflects the importance of a word to a certain article in the corpus.
TF
TF (Term Frequency). It represents the frequency of occurrence of a word in the text.
Formula.
IDF
IDF (Inverse Document Frequency). It represents the inverse of the number of documents containing words in the corpus.
Formula.
TF-IDF
Formula.
TF-IDF = (word frequency / total number of words in the sentence) × (total number of documents / number of documents containing the word)
If a word is very common, then the IDF will be low, and vice versa. TF-IDF can help us to filter common words and extract keywords.
jieba TF-IDF keyword extraction
Format.
.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
Parameters.
sentence
: Text corpus to be extracted
topK
: the number of keywords returned, default is 20
withWeight
: whether to return keyword weight, default is False
allowPOS
: Includes only words of the specified gender, defaults to empty, i.e., no filtering.
jieba part of speech (noun, verb, adjective etc)
serial number | part of speech (noun, verb, adjective etc) | descriptive |
---|---|---|
Ag | morpheme | Adjectival morphemes. The adjective code is a and the morpheme code g is preceded by A. |
a | conjunctions | Take the first letter of the English adjective adjective. |
ad | adverbial | Adjectives that are direct gerunds. Adjective code a and adverbial code d together. |
an | nominal classifier (in Chinese grammar) | Adjectives with noun function. The adjective code a and the noun code n are merged together. |
b | distinctive word | Takes the consonant of the Chinese character "classify". |
c | conjunctions | Take the first letter of the English conjunction conjunction. |
dg | auxiliary morpheme | Adverbial morphemes. The adverbial code is d and the morpheme code g is preceded by D. |
d | adverb | Take the 2nd letter of adverb, as its 1st letter is already used in adjectives. |
e | exclamation | Take the first letter of the English exclamation exclamation. |
f | noun of locality (linguistics) | Take the Chinese character "abbr. for square or cubic meter". |
g | individual characters (making up an expression) | Most morphemes can be used as the "root" of a compound word, taking the consonant of the "root" of a Chinese character. |
h | antecedent (logic) | Take the first letter of the English word head. |
i | adage | Take the first letter of the English idiom idiom. |
j | abbreviation | Takes the consonant of the Chinese character "bamboo strips used for writing (old)". |
k | subordinate component | |
l | idiom | The idiom has not yet become an idiom, and is somewhat "provisional", taking the vowel of "Lin". |
m | numeral | Take the 3rd letter of the English numeral, n, u has been used in other ways. |
Ng | nominal particle | Noun morphemes. The noun code is n and the morpheme code g is preceded by N. |
n | noun (part of speech) | Take the 1st letter of the English noun noun. |
nr | personal name | The noun code n is combined with the consonant for "person (ren)". |
ns | toponymy | The noun code n and the place code s are concatenated together. |
nt | body | "Troupe" has the consonant t, and the noun codes n and t are joined together. |
nz | Other monikers | The first letter of the vowel in "specialized" is z. The noun codes n and z are combined. |
o | onomatopoeia | Take the 1st letter of the English onomatopoeia. |
p | prepositions | Take the first letter of the English prepositional word prepositional. |
q | classifier (in Chinese grammar) | Take the first letter of the English quantity. |
r | pronoun | Take the second letter of the English pronoun pronoun, since p is already used in prepositions. |
s | word for place | Take the first letter of the English word space. |
tg | Tokoyami, one of the indigenous peoples of * | Time lexical morphemes. The time word code is t, which is placed before the code g of the morpheme. |
t | time word | Take the first letter of the English word time. |
u | particle (grammatical) | Take English auxiliary |
vg | morpheme | Verbal morphemes. The verb code is v. The code g of the morpheme is preceded by V. |
v | prepositions | Take the first letter of the English verb verb. |
vd | subjunctive | Direct gerunds. Codes for verbs and adverbs are merged together. |
vn | nominal verb | Refers to verbs that function as nouns. The codes for verbs and nouns are merged together. |
w | a punctuation mark | |
x | unwritten character | A non-syntactic word is just a symbol, and the letter x is often used to represent unknown numbers, symbols. |
y | modal particle | Takes the consonant of the Chinese character "tell to". |
z | status word | Takes the first letter of the consonant of the Chinese character "accusation". |
un | unknown word |
No keyword weights
Example.
import # Define text text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \ "It investigates various theories and methods that enable effective communication between humans and computers in natural language." \ "Natural language processing is a science that integrates linguistics, computer science, and mathematics." \ "Thus, research in this area will involve natural language, the language that people use every day," \ "So it's closely related to the study of linguistics, but there are important differences." \ "Natural language processing is not the study of natural language in general," \ "Rather, it lies in the development of computer systems, especially the software systems therein, that can effectively implement natural language communication." \ "and thus it is part of computer science." # Extract keywords keywords = .extract_tags(text, topK=20, withWeight=False) # Debug Outputs print([i for i in keywords])
Output.
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Windows\AppData\Local\Temp\ Loading model cost 0.890 seconds. Prefix dict has been built successfully. ['Natural language', 'Computer Science', 'Linguistics', 'Research', 'Fields', 'Processing', 'Communication', 'Effective', 'Software systems', 'Artificial Intelligence', 'Realization', 'Computer systems', 'Important', 'One', 'The One Door', 'Everyday', 'Computer', 'Close', 'Math', 'Development']
Incidental keyword weights
import # Define text content = "Natural Language Processing is a sub-discipline in the field of Artificial Intelligence and Linguistics. This field explores how natural language is processed and utilized; natural language processing includes multiple aspects and steps, basically cognitive, comprehension, and generative components." # Define text text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \ "It investigates various theories and methods that enable effective communication between humans and computers in natural language." \ "Natural language processing is a science that integrates linguistics, computer science, and mathematics." \ "Thus, research in this area will involve natural language, the language that people use every day," \ "So it's closely related to the study of linguistics, but there are important differences." \ "Natural language processing is not the study of natural language in general," \ "Rather, it lies in the development of computer systems, especially the software systems therein, that can effectively implement natural language communication." \ "and thus it is part of computer science." # Extract keywords (with weights) keywords = .extract_tags(text, topK=20, withWeight=True) # Debug Outputs print([i for i in keywords])
Output.
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Windows\AppData\Local\Temp\ Loading model cost 1.110 seconds. Prefix dict has been built successfully. [('Natural language', 1.1237629576061539), ('Computer Science', 0.4503481350267692), ('Linguistics', 0.27566262244215384), ('Research', 0.2660770221507693), ('Fields', 0.24979825580353845), ('Processing', 0.24973179957046154), ('Communication', 0.2043557391963077), ('Effective', 0.16296019853692306), ('Software systems', 0.16102600688461538), ('Artificial Intelligence', 0.14550809839215384), ('Realization', 0.14389939312584615), ('Computer systems', 0.1402028601413846), ('Important', 0.12347581087876922), ('One', 0.11349408224353846), ('The One Door', 0.11300493477184616), ('Everyday', 0.10913612756276922), ('Computer', 0.1046889912443077), ('Close', 0.10181409957492307), ('Math', 0.10166677655076924), ('Development', 0.09868653898630769)]
TextRank
TextRank constructs the network by the neighbor relationship between words and then iteratively calculates the rank value of each node using PageRank. Sort the rank values to get the keywords.
import # Define text content = "Natural Language Processing is a sub-discipline in the field of Artificial Intelligence and Linguistics. This field explores how natural language is processed and utilized; natural language processing includes multiple aspects and steps, basically cognitive, comprehension, and generative components." # Define text text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \ "It investigates various theories and methods that enable effective communication between humans and computers in natural language." \ "Natural language processing is a science that integrates linguistics, computer science, and mathematics." \ "Thus, research in this area will involve natural language, the language that people use every day," \ "So it's closely related to the study of linguistics, but there are important differences." \ "Natural language processing is not the study of natural language in general," \ "Rather, it lies in the development of computer systems, especially the software systems therein, that can effectively implement natural language communication." \ "and thus it is part of computer science." # TextRank extracts keywords keywords = (text, topK=20, withWeight=False) # Debug Outputs print([i for i in keywords])
Debug Outputs.
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Windows\AppData\Local\Temp\ ['Research', 'Fields', 'Computer Science', 'Realization', 'Processing', 'Linguistics', 'Math', 'People', 'Computer', 'Involved', 'has', 'One', 'Methods', 'Language', 'Development', 'Use', 'Artificial Intelligence', 'lies', 'Contact', 'Science'] Loading model cost 1.062 seconds. Prefix dict has been built successfully.
Above is the detailed content of Python Machine Learning NLP Natural Language Processing basic operation keywords, more information about Python Machine Learning NLP Natural Language Processing please pay attention to my other related articles!