Python Crawling Idioms Solitaire Class Website

present (sb for a job etc)

In this article, we will show how to implement poetry solitaire using a Python crawler.

The idea of the project is as follows:

Crawling poems using a crawler to create a corpus of poems;

Split the poem into stanzas to form a dictionary: the key is the pinyin of the first character of the stanza, the value is the stanza corresponding to the pinyin, and save the dictionary as a pickle file;
Read the pickle file, write a program to run the program as an exe file.

This project implements a poem solitaire with the rule that the first character of the next line matches the pinyin (including the tone) of the last character of the previous line. The following is a step-by-step description of the realization of the project.

Poetry Corpus

First, we utilize a Python crawler to crawl the poems and create a corpus. The URL for crawling is:, the page is as follows:

Since this article is mainly for trying to show the idea of the project, therefore, only crawled the page of the 300 Tang poems, 300 ancient poems, 300 Song lyrics, Song lyrics selection, a total of about 1100 poems. In order to speed up the crawler, concurrent implementation of the crawler is used and saved to a file. The complete Python program is as follows:

import re
import requests
from bs4 import BeautifulSoup
from  import ThreadPoolExecutor, wait, ALL_COMPLETED

# Poetry URLs to crawl
urls = ['/gushi/',
  '/gushi/',
  '/gushi/',
  '/gushi/'
  ]

poem_links = []
# Poetry Web site
for url in urls:
 # Request header
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
 req = (url, headers=headers)

 soup = BeautifulSoup(, "lxml")
 content = soup.find_all('div', class_="sons")[0]
 links = content.find_all('a')

 for link in links:
  poem_links.append(''+link['href'])

poem_list = []
# Crawl the poetry page
def get_poem(url):
 #url = '/shiwenv_45c396367f59.aspx'
 # Request header
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
 req = (url, headers=headers)
 soup = BeautifulSoup(, "lxml")
 poem = ('div', class_='contson').()
 poem = (' ', '')
 poem = ((r"\([\s\S]*?\)"), '', poem)
 poem = ((r"（[\s\S]*?）"), '', poem)
 poem = ((r"。\([\s\S]*?）"), '', poem)
 poem = ('!', '！').replace('?', '？')
 poem_list.append(poem)

# Utilize concurrent crawling
executor = ThreadPoolExecutor(max_workers=10) # You can adjust max_workers, the number of threads, by yourself.
# submit()'s arguments: the first is the function, then the incoming arguments to the function, more than one is allowed
future_tasks = [(get_poem, url) for url in poem_links]
# Wait for all threads to complete before proceeding to subsequent executions
wait(future_tasks, return_when=ALL_COMPLETED)

# Write the crawled verses to a txt file
poems = list(set(poem_list))
poems = sorted(poems, key=lambda x:len(x))
for poem in poems:
 poem = ('《','').replace('》','') \
    .replace('：', '').replace('“', '')
 print(poem)
 with open('F://', 'a') as f:
  (poem)
  ('\n')

The program crawls more than 1100 poems and saves them to a file to form our poetry corpus. Of course, these poems can not be used directly, you need to clean up the data, for example, some poems are not standardized punctuation, some are not poems, but only the sequence of poems and so on, this process requires manual operation, although slightly troublesome, but for the effect of the later poems in the clauses, it is worth it.

verse subdivision

With a corpus of poems, we need to split the poems into clauses according to the criterion of ending with.? This can be done with regular expressions. After that, we will write the poem into a dictionary: the key is the pinyin of the first character of the sentence, and the value is the poem corresponding to the pinyin, and save the dictionary as a pickle file. The complete Python code is as follows:

import re
import pickle
from xpinyin import Pinyin
from collections import defaultdict

def main():
 with open('F://', 'r') as f:
  poems = ()

 sents = []
 for poem in poems:
  parts = (r'[\s\S]*?[。？！]', ())
  for part in parts:
   if len(part) >= 5:
    (part)

 poem_dict = defaultdict(list)
 for sent in sents:
  print(part)
  head = Pinyin().get_pinyin(sent, tone_marks='marks', splitter=' ').split()[0]
  poem_dict[head].append(sent)

 with open('./', 'wb') as f:
  (poem_dict, f)

main()

We can look at the contents of that pickle file ():

Of course, one pinyin can correspond to more than one poem.

Poetry Solitaire

Read the pickle file, write a program to run the program as an exe file.

In order to be able to compile the exe file without errors, we need to rewrite the _init_.py file of the xpinyin module by copying all the code from that file and adding the following line to the code

data_path = (((__file__)),
        '')

rewrite

data_path = ((), '')

This completes our documentation.

Next, we need to write the code for Poetry Solitaire (Poem_Jielong.py), complete with the following code:

import pickle
from mypinyin import Pinyin
import random
import ctypes

STD_INPUT_HANDLE = -10
STD_OUTPUT_HANDLE = -11
STD_ERROR_HANDLE = -12

FOREGROUND_DARKWHITE = 0x07 # Dark white
FOREGROUND_BLUE = 0x09 # Blue
FOREGROUND_GREEN = 0x0a # Green
FOREGROUND_SKYBLUE = 0x0b # Sky blue
FOREGROUND_RED = 0x0c # Red
FOREGROUND_PINK = 0x0d # Pink
FOREGROUND_YELLOW = 0x0e # Yellow
FOREGROUND_WHITE = 0x0f # White

std_out_handle = .(STD_OUTPUT_HANDLE)

# Set CMD text color
def set_cmd_text_color(color, handle=std_out_handle):
 Bool = .(handle, color)
 return Bool

# Reset text color to dark white
def resetColor():
 set_cmd_text_color(FOREGROUND_DARKWHITE)

# Output text in CMD in specified colors
def cprint(mess, color):
 color_dict = {
     'Blue': FOREGROUND_BLUE,
     'Green': FOREGROUND_GREEN,
     'Sky blue': FOREGROUND_SKYBLUE,
     'Red': FOREGROUND_RED,
     'Pink': FOREGROUND_PINK,
     'Yellow': FOREGROUND_YELLOW,
     'White': FOREGROUND_WHITE
     }
 set_cmd_text_color(color_dict[color])
 print(mess)
 resetColor()

color_list = ['Blue','Green','Sky blue','Red','Pink','Yellow','White']

# Get the dictionary
with open('./', 'rb') as f:
 poem_dict = (f)

#for key, value in poem_dict.items():
 #print(key, value)

MODE = str(input('Choose MODE(1 for solitaire (puzzle), 2 for solitaire (puzzle)): '))

while True:
 try:
  if MODE == '1':
   enter = str(input('\n Please enter a poem or a word to start with:'))
   while enter != 'exit':
    test = Pinyin().get_pinyin(enter, tone_marks='marks', splitter=' ')
    tail = ()[-1]
    if tail not in poem_dict.keys():
     cprint('Couldn't pick up the verse. \n', 'Red')
     MODE = 0
     break
    else:
     cprint('\n machine reply: %s'%(poem_dict[tail], 1)[0], (color_list, 1)[0])
     enter = str(input('Your response:'))[:-1]

   MODE = 0

  if MODE == '2':
   enter = input('\n Please enter a poem or a word to start with:')

   for i in range(10):
    test = Pinyin().get_pinyin(enter, tone_marks='marks', splitter=' ')
    tail = ()[-1]
    if tail not in poem_dict.keys():
     cprint('------>Can't take it anymore...', 'Red')
     MODE = 0
     break
    else:
     answer = (poem_dict[tail], 1)[0]
     cprint('（%d）--> %s' % (i+1, answer), (color_list, 1)[0])
     enter = answer[:-1]

   print('\n (***** shows up to the first 10 solitaires. *****)')
   MODE = 0

 except Exception as err:
  print(err)
 finally:
  if MODE not in ['1','2']:
   MODE = str(input('\nChoose MODE(1 for solitaire (puzzle), 2 for solitaire (puzzle)): '))

The entire project is now structured as follows (files copied from the folder corresponding to the xpinyin module):

Switch to that folder and enter the following command to generate the exe file:

pyinstaller -F Poem_jielong.py

There are two modes of poetry solitaire in this project, one is manual solitaire, that is, you input a line or a word first, then it is the computer replying a line and you replying a line, which is in charge of the rules of poetry solitaire; the other mode is machine solitaire, that is, you input a line or a word first, and the machine will output the solitaire verses after it automatically (up to 10).

Test the manual solitaire mode first:

Test the machine solitaire mode again:

summarize

The Github address for this project is:/percent4/Shicijielong