This article introduces the Python how to use BeautifulSoup to crawl the web page information, the text through the sample code is very detailed, for everyone's learning or work has a certain reference value of learning, you can refer to the next!
The idea of simply crawling a web page for information is generally
1、View the web page source code
2、Capture web page information
3、Parse web page content
4. Storage to file
Now using BeautifulSoup parser library to crawl Hedgehog internship Python jobs salaries
I. Viewing the web page source code
This part is what we need and the corresponding source code is:
Analyze the source code to find out:
1, job information list in <section class="widget-job-list"> in the
2, each piece of information in <article class="widget item"> in the
3, for each piece of information, we need to extract the contents of the company name, position, salary
Second, crawl the web page information
Using the () grab, the returned soup is the text information of the page
def get_one_page(url): response = (url) soup = BeautifulSoup(, "") return soup
Third, parsing web page content
1. Find the starting position <section>.
2, in & lt;article> in the match to the information
3. Return information list for storage
def parse_page(soup): # List of messages to be stored return_list = [] # Starting position grid = ('section', attrs={"class": "widget-job-list"}) if grid: #Find all job listings job_list = soup.find_all('article', attrs={"class": "widget item"}) # Match the contents for job in job_list: #find() is looking for the first tag that matches company = ('a', attrs={"class": "crop"}).get_text().strip()# Return type is string, use strip() to remove whitespace, newline characters title = ('code').get_text() salary = ('span', attrs={"class": "color-3"}).get_text() # Save the information to the list and return return_list.append(company + " " + title + " " + salary) return return_list
IV. Storage to file
Storing list information in a file
def write_to_file(content): # Open as append, set encoding format to prevent garbled code with open("", "a", encoding="gb18030")as f: ("\n".join(content))
V. Crawling multiple pages of information
In the page url, you can see that the last page represents the number of pages.
So pass in a page in the main method and run main(page) in a loop to crawl multiple pages of information
def main(page): url = '/search?key=python&page=' + str(page) soup = get_one_page(url) return_list = parse_page(soup) write_to_file(return_list) if __name__ == "__main__": for i in range(4): main(i)
VI. Operational results
VII. Complete code
import requests import re from bs4 import BeautifulSoup def get_one_page(url): response = (url) soup = BeautifulSoup(, "") return soup def parse_page(soup): # List of messages to be stored return_list = [] # Starting position grid = ('section', attrs={"class": "widget-job-list"}) if grid: #Find all job listings job_list = soup.find_all('article', attrs={"class": "widget item"}) # Match the contents for job in job_list: #find() is looking for the first tag that matches company = ('a', attrs={"class": "crop"}).get_text().strip()# Return type is string, use strip() to remove whitespace, newline characters title = ('code').get_text() salary = ('span', attrs={"class": "color-3"}).get_text() # Save the information to the list and return return_list.append(company + " " + title + " " + salary) return return_list def write_to_file(content): # Open as append, set encoding format to prevent garbled code with open("", "a", encoding="gb18030")as f: ("\n".join(content)) def main(page): url = '/search?key=python&page=' + str(page) soup = get_one_page(url) return_list = parse_page(soup) write_to_file(return_list) if __name__ == "__main__": for i in range(4): main(i)
This is the whole content of this article.