1. Problem
To implement sentences for text, roughly speaking, it is mainly based on Chinese symbols such as periods, exclamations, question marks, etc. The difficulty is that direct sentences may cause the words that characters speak to be separated!
2. Steps
Segmentation
First, read the text. After reading the text, the overall text is a string, and each paragraph is blank, so the segments are separated by blank spaces. Finally, a paragraph_list is stored. Note that the subscript of the list is the sequence number of the paragraphs! I won’t go into details about the others here! (You can view the overall code at the end)
Sentence
First, get the paragraph_list divided above, loop to get each paragraph, and then directly classify each paragraph according to the clause rules (regular expressions). Refer to the article
import re def cut_sent(para): para = ('([。!?\?])([^”'])', r"\1\n\2", para) para = ('(\.{6})([^”'])', r"\1\n\2", para) para = ('(\…{2})([^”'])', r"\1\n\2", para) para = ('([。!?\?][”'])([^,。!?\?])', r'\1\n\2', para) para = () return ("\n") # The results that should be obtained after this passage of texts = 'The weather is good today! ' \ 'Is the temperature high? Hello, I'm glad to meet you, it's really good. ' \ 'Xiao Ming met Xiaohong and said, "Your clothes look good!"' \ 'Xiaohong said: "What? The clothes look so good? Really?"' \ 'Xiao Ming replied: "Well, really! I want to buy it, too."' for i in cut_sent(s): print(i) #The result is to separate the characters' sentences""" The weather is good today! Is the temperature high? Hello, I'm glad to meet you, it's really good. Xiao Ming met Xiaohong and said: "Your clothes look good! "Xiaohong said:What? The clothes look so good? Really? "Xiao Ming replied:"Um,real! I want to buy it, too。 " """
connect
The solution here is to loop through each sentence and identify: "and"
- Both symbols are present, then the sentence is directly a whole sentence and will be added directly.
- There are no two symbols, then the sentence is just a whole sentence and it is added directly
- If there is only the previous symbol but no subsequent symbol, record the sentence with the previous symbol and splice it down in sequence until the character is finally ", put the spliced sentence as a whole sentence
def connect(paragraph): sentence_before = [] sentence_after = [] for each_para in paragraph: sentence_before.append(cut(each_para)) # Core code! (Connect the missed statements) for each in sentence_before: list = [] sentence = "" FLAG = True # Very critical! Determine whether the statements following the symbol "' continue to be spliced for i in each: if (':“') * ('”') >= 0 and FLAG: (i + sentence) else: FLAG = False sentence = sentence + i if ('”') > 0: (sentence) sentence = "" FLAG = True sentence_after.append(list) return sentence_after
3. The final overall code
import re import pandas as pd # Segment the entire articledef segments(url): raw = pd.read_csv(url,names=['txt'], sep='aaa', encoding="GBK" ,engine='python') def m_head(tem_str): return tem_str[:1] def m_mid(tmp_str): return tmp_str.find("Back") raw['head'] = (m_head) raw['mid'] = (m_mid) raw['len'] = (len) chap_num = 0 for i in range(len(raw)): if raw['head'][i] == "Third" and raw['mid'][i] > 0 and raw['len'][i] < 30: chap_num += 1 if chap_num >= 40 and raw['txt'][i] == "Appendix 1: Genghis Khan's Family": chap_num = 0 [i, 'chap'] = chap_num del raw['head'] del raw['mid'] del raw['len'] tmp_chap = raw[raw['chap'] == 7].copy() tmp_chap.reset_index(drop=True, inplace=True) tmp_chap['paraidx'] = tmp_chap.index paragraph = tmp_chap['txt'].() return paragraph # Make each paragraph sentencedef cut(para): # Related Rules pattern = ['([。!?\?])([^”'])','(\.{6})([^”'])','(\…{2})([^”'])','([。!?\?][”'])([^,。!?\?])'] for i in pattern: para = (i, r"\1\n\2", para) para = () return ("\n") # Connect the missed statements (mainly for discourse)def connect(paragraph): sentence_before = [] sentence_after = [] for each_para in paragraph: sentence_before.append(cut(each_para)) # Core code! (Connect the missed statements) for each in sentence_before: list = [] sentence = "" FLAG = True # Very critical! Determine whether the statements following the symbol "' continue to be spliced for i in each: if (':“') * ('”') >= 0 and FLAG: (i + sentence) else: FLAG = False sentence = sentence + i if ('”') > 0: (sentence) sentence = "" FLAG = True sentence_after.append(list) return sentence_after # Save the last result to DataFramedef toDataFrame(list3): df = (columns=["content","paragraph","sentence"]) for para_num,i in enumerate(list3): for sentence_num,j in enumerate(i): df_ = ({"content": j, "paragraph": para_num,"sentence":sentence_num+1},index=[para_num]) df = (df_,ignore_index=True) for i in df['content'].(): print(i) def main(): # URL = "/Users/dengzhao/Downloads/Jin Yong-The Legend of the Condor Heroes txt Precision Version.txt" URL = input("Please enter the file address:") para = segments(URL) result = connect(para) print(result) flag = input("byDataFrameForm output data(Y,N):") if flag == 'Y': toDataFrame(result) elif flag == 'N': print("Thanks!!!!") else: print("The program ends! Please check your input!") if __name__ == '__main__': main()
This is the end of this article about Python implementing Chinese text segmentation and sentence segmentation. For more related contents of Python Chinese text segmentation and sentence segmentation, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!