Recently, when crawling a web page, I found that using requests cannot get all the content of the web page, so I used selenium to simulate the browser opening the web page, then get the source code of the web page, and then parsed it through BeautifulSoup to get the example sentences in the web page. In order to keep the loop going, we added refresh() to the loop body, so that when the browser gets a new URL, the web page content is updated by refreshing. Note that in order to better obtain the web page content, the settings will stay for 2 seconds after refreshing, which can reduce the chance of not being able to catch the web page content. In order to reduce the possibility of being blocked, we have also added Chrome, please see the following code:
from selenium import webdriver from import Options from import Service from bs4 import BeautifulSoup import time,re path = Service("D:\\MyDrivers\\")# # Configure not to display the browserchrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36') # Create a Chrome instance. driver = (service=path,options=chrome_options) lst=["happy","help","evening","great","think","adapt"] for word in lst: url="/#result?lang=en&query="+word+"&f=concordance" (url) # Refresh the web page to get new data () (2) # page_source——》Get the page source code resp=driver.page_source # parse source code soup=BeautifulSoup(resp,"") table = soup.find_all("td") with open("",'a+',encoding='utf-8') as f: (f"\n{word}Examples\n") for i in table[0:6]: text= #Replace extra spaces new=("\s+"," ",text) #Write txt text with open("",'a+',encoding='utf-8') as f: ((r"^(\d+\.)",r"\n\1",new)) ()
1. In order to speed up access, we set the browser not to display, and implement it
2. Recently, the format was cleaned up through re-regex.
3. We set table[0:6] to get the content of the first three sentences, and the final result is as follows.
Happy example
1. This happy mood lasted roughly until last autumn.
2. The lodging was neither convenient nor happy .
3. One big happy family "fighting communism".
Example of help
1. Applying hot moist towels may help relieve discomfort.
2. The intense light helps reproduce colors more effectively.
3. My survival route are self help books.
Evening example
1. The evening feast costs another $10.
2. My evening hunt was pretty flat overall.
3. The area nightclubs were active during evenings .
Example of great
1. The three countries represented here are three great democracies.
2. Our three different tour guides were great .
3. Your receptionist "crew" is great !
Think example
1. I said yes immediately without thinking everything through.
2. This book was shocking yet thought provoking.
3. He thought "disgusting" was more appropriate.
Example of adapt
1. The novel has been adapted several times.
2. There are many ways plants can adapt .
3. They must adapt quickly to changing deadlines.
Supplement: After the code optimization, the crawling of example sentences is faster, and the code is as follows:
from selenium import webdriver from import Options from import Service from bs4 import BeautifulSoup import time,re import os # Configure the location of the simulated browserpath = Service("D:\\MyDrivers\\")# # Configure not to display the browserchrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36') # Create a Chrome instance. def get_wordlist(): wordlist=[] with open("",'r',encoding='utf-8') as f: lines=() for line in lines: word=() (word) return wordlist def main(lst): driver = (service=path,options=chrome_options) for word in lst: url="/#result?lang=en&query="+word+"&f=concordance" (url) () (2) # page_source——》 page source code resp=driver.page_source # parse source code soup=BeautifulSoup(resp,"") table = soup.find_all("td") with open("",'a+',encoding='utf-8') as f: (f"\n{word}Examples\n") for i in table[0:6]: text= new=("\s+"," ",text) with open("",'a+',encoding='utf-8') as f: (new) # (("(\.\s)(\d+\.)","\1\n\2",new)) if __name__=="__main__": lst=get_wordlist() main(lst) ("")
Summarize
This is the article about Python’s inability to obtain web source code using requests. For more related requests to obtain web source code, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!