present (sb for a job etc)
In this article, we will show how to implement poetry solitaire using a Python crawler.
The idea of the project is as follows:
Crawling poems using a crawler to create a corpus of poems;
Split the poem into stanzas to form a dictionary: the key is the pinyin of the first character of the stanza, the value is the stanza corresponding to the pinyin, and save the dictionary as a pickle file;
Read the pickle file, write a program to run the program as an exe file.
This project implements a poem solitaire with the rule that the first character of the next line matches the pinyin (including the tone) of the last character of the previous line. The following is a step-by-step description of the realization of the project.
Poetry Corpus
First, we utilize a Python crawler to crawl the poems and create a corpus. The URL for crawling is:, the page is as follows:
Since this article is mainly for trying to show the idea of the project, therefore, only crawled the page of the 300 Tang poems, 300 ancient poems, 300 Song lyrics, Song lyrics selection, a total of about 1100 poems. In order to speed up the crawler, concurrent implementation of the crawler is used and saved to a file. The complete Python program is as follows:
import re import requests from bs4 import BeautifulSoup from import ThreadPoolExecutor, wait, ALL_COMPLETED # Poetry URLs to crawl urls = ['/gushi/', '/gushi/', '/gushi/', '/gushi/' ] poem_links = [] # Poetry Web site for url in urls: # Request header headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} req = (url, headers=headers) soup = BeautifulSoup(, "lxml") content = soup.find_all('div', class_="sons")[0] links = content.find_all('a') for link in links: poem_links.append(''+link['href']) poem_list = [] # Crawl the poetry page def get_poem(url): #url = '/shiwenv_45c396367f59.aspx' # Request header headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} req = (url, headers=headers) soup = BeautifulSoup(, "lxml") poem = ('div', class_='contson').() poem = (' ', '') poem = ((r"\([\s\S]*?\)"), '', poem) poem = ((r"([\s\S]*?)"), '', poem) poem = ((r"。\([\s\S]*?)"), '', poem) poem = ('!', '!').replace('?', '?') poem_list.append(poem) # Utilize concurrent crawling executor = ThreadPoolExecutor(max_workers=10) # You can adjust max_workers, the number of threads, by yourself. # submit()'s arguments: the first is the function, then the incoming arguments to the function, more than one is allowed future_tasks = [(get_poem, url) for url in poem_links] # Wait for all threads to complete before proceeding to subsequent executions wait(future_tasks, return_when=ALL_COMPLETED) # Write the crawled verses to a txt file poems = list(set(poem_list)) poems = sorted(poems, key=lambda x:len(x)) for poem in poems: poem = ('《','').replace('》','') \ .replace(':', '').replace('“', '') print(poem) with open('F://', 'a') as f: (poem) ('\n')
The program crawls more than 1100 poems and saves them to a file to form our poetry corpus. Of course, these poems can not be used directly, you need to clean up the data, for example, some poems are not standardized punctuation, some are not poems, but only the sequence of poems and so on, this process requires manual operation, although slightly troublesome, but for the effect of the later poems in the clauses, it is worth it.
verse subdivision
With a corpus of poems, we need to split the poems into clauses according to the criterion of ending with.? This can be done with regular expressions. After that, we will write the poem into a dictionary: the key is the pinyin of the first character of the sentence, and the value is the poem corresponding to the pinyin, and save the dictionary as a pickle file. The complete Python code is as follows:
import re import pickle from xpinyin import Pinyin from collections import defaultdict def main(): with open('F://', 'r') as f: poems = () sents = [] for poem in poems: parts = (r'[\s\S]*?[。?!]', ()) for part in parts: if len(part) >= 5: (part) poem_dict = defaultdict(list) for sent in sents: print(part) head = Pinyin().get_pinyin(sent, tone_marks='marks', splitter=' ').split()[0] poem_dict[head].append(sent) with open('./', 'wb') as f: (poem_dict, f) main()
We can look at the contents of that pickle file ():
Of course, one pinyin can correspond to more than one poem.
Poetry Solitaire
Read the pickle file, write a program to run the program as an exe file.
In order to be able to compile the exe file without errors, we need to rewrite the _init_.py file of the xpinyin module by copying all the code from that file and adding the following line to the code
data_path = (((__file__)), '')
rewrite
data_path = ((), '')
This completes our documentation.
Next, we need to write the code for Poetry Solitaire (Poem_Jielong.py), complete with the following code:
import pickle from mypinyin import Pinyin import random import ctypes STD_INPUT_HANDLE = -10 STD_OUTPUT_HANDLE = -11 STD_ERROR_HANDLE = -12 FOREGROUND_DARKWHITE = 0x07 # Dark white FOREGROUND_BLUE = 0x09 # Blue FOREGROUND_GREEN = 0x0a # Green FOREGROUND_SKYBLUE = 0x0b # Sky blue FOREGROUND_RED = 0x0c # Red FOREGROUND_PINK = 0x0d # Pink FOREGROUND_YELLOW = 0x0e # Yellow FOREGROUND_WHITE = 0x0f # White std_out_handle = .(STD_OUTPUT_HANDLE) # Set CMD text color def set_cmd_text_color(color, handle=std_out_handle): Bool = .(handle, color) return Bool # Reset text color to dark white def resetColor(): set_cmd_text_color(FOREGROUND_DARKWHITE) # Output text in CMD in specified colors def cprint(mess, color): color_dict = { 'Blue': FOREGROUND_BLUE, 'Green': FOREGROUND_GREEN, 'Sky blue': FOREGROUND_SKYBLUE, 'Red': FOREGROUND_RED, 'Pink': FOREGROUND_PINK, 'Yellow': FOREGROUND_YELLOW, 'White': FOREGROUND_WHITE } set_cmd_text_color(color_dict[color]) print(mess) resetColor() color_list = ['Blue','Green','Sky blue','Red','Pink','Yellow','White'] # Get the dictionary with open('./', 'rb') as f: poem_dict = (f) #for key, value in poem_dict.items(): #print(key, value) MODE = str(input('Choose MODE(1 for solitaire (puzzle), 2 for solitaire (puzzle)): ')) while True: try: if MODE == '1': enter = str(input('\n Please enter a poem or a word to start with:')) while enter != 'exit': test = Pinyin().get_pinyin(enter, tone_marks='marks', splitter=' ') tail = ()[-1] if tail not in poem_dict.keys(): cprint('Couldn't pick up the verse. \n', 'Red') MODE = 0 break else: cprint('\n machine reply: %s'%(poem_dict[tail], 1)[0], (color_list, 1)[0]) enter = str(input('Your response:'))[:-1] MODE = 0 if MODE == '2': enter = input('\n Please enter a poem or a word to start with:') for i in range(10): test = Pinyin().get_pinyin(enter, tone_marks='marks', splitter=' ') tail = ()[-1] if tail not in poem_dict.keys(): cprint('------>Can't take it anymore...', 'Red') MODE = 0 break else: answer = (poem_dict[tail], 1)[0] cprint('(%d)--> %s' % (i+1, answer), (color_list, 1)[0]) enter = answer[:-1] print('\n (***** shows up to the first 10 solitaires. *****)') MODE = 0 except Exception as err: print(err) finally: if MODE not in ['1','2']: MODE = str(input('\nChoose MODE(1 for solitaire (puzzle), 2 for solitaire (puzzle)): '))
The entire project is now structured as follows (files copied from the folder corresponding to the xpinyin module):
Switch to that folder and enter the following command to generate the exe file:
pyinstaller -F Poem_jielong.py
There are two modes of poetry solitaire in this project, one is manual solitaire, that is, you input a line or a word first, then it is the computer replying a line and you replying a line, which is in charge of the rules of poetry solitaire; the other mode is machine solitaire, that is, you input a line or a word first, and the machine will output the solitaire verses after it automatically (up to 10).
Test the manual solitaire mode first:
Test the machine solitaire mode again:
summarize
The Github address for this project is:/percent4/Shicijielong