SoFunction
Updated on 2024-10-29

Python how to use BeautifulSoup crawl web page information

This article introduces the Python how to use BeautifulSoup to crawl the web page information, the text through the sample code is very detailed, for everyone's learning or work has a certain reference value of learning, you can refer to the next!

The idea of simply crawling a web page for information is generally

1、View the web page source code

2、Capture web page information

3、Parse web page content

4. Storage to file

Now using BeautifulSoup parser library to crawl Hedgehog internship Python jobs salaries

I. Viewing the web page source code

This part is what we need and the corresponding source code is:

Analyze the source code to find out:

1, job information list in <section class="widget-job-list"> in the

2, each piece of information in <article class="widget item"> in the

3, for each piece of information, we need to extract the contents of the company name, position, salary

Second, crawl the web page information

Using the () grab, the returned soup is the text information of the page

def get_one_page(url):
  response = (url)
  soup = BeautifulSoup(, "")
  return soup

Third, parsing web page content

1. Find the starting position <section>.

2, in & lt;article> in the match to the information

3. Return information list for storage

def parse_page(soup):
  # List of messages to be stored
  return_list = []
  # Starting position
  grid = ('section', attrs={"class": "widget-job-list"})
  if grid:
    #Find all job listings
    job_list = soup.find_all('article', attrs={"class": "widget item"})

    # Match the contents
    for job in job_list:
      #find() is looking for the first tag that matches
      company = ('a', attrs={"class": "crop"}).get_text().strip()# Return type is string, use strip() to remove whitespace, newline characters
      title = ('code').get_text()
      salary = ('span', attrs={"class": "color-3"}).get_text()
      # Save the information to the list and return
      return_list.append(company + " " + title + " " + salary)
  return return_list

IV. Storage to file

Storing list information in a file

def write_to_file(content):
  # Open as append, set encoding format to prevent garbled code
  with open("", "a", encoding="gb18030")as f:
    ("\n".join(content))

V. Crawling multiple pages of information

In the page url, you can see that the last page represents the number of pages.

So pass in a page in the main method and run main(page) in a loop to crawl multiple pages of information

def main(page):
  url = '/search?key=python&page=' + str(page)
  soup = get_one_page(url)
  return_list = parse_page(soup)
  write_to_file(return_list)
if __name__ == "__main__":
  for i in range(4):
    main(i)

VI. Operational results

VII. Complete code

import requests
import re
from bs4 import BeautifulSoup

def get_one_page(url):
  response = (url)
  soup = BeautifulSoup(, "")
  return soup

def parse_page(soup):
  # List of messages to be stored
  return_list = []
  # Starting position
  grid = ('section', attrs={"class": "widget-job-list"})
  if grid:
    #Find all job listings
    job_list = soup.find_all('article', attrs={"class": "widget item"})

    # Match the contents
    for job in job_list:
      #find() is looking for the first tag that matches
      company = ('a', attrs={"class": "crop"}).get_text().strip()# Return type is string, use strip() to remove whitespace, newline characters
      title = ('code').get_text()
      salary = ('span', attrs={"class": "color-3"}).get_text()
      # Save the information to the list and return
      return_list.append(company + " " + title + " " + salary)
  return return_list

def write_to_file(content):
  # Open as append, set encoding format to prevent garbled code
  with open("", "a", encoding="gb18030")as f:
    ("\n".join(content))
def main(page):
  url = '/search?key=python&page=' + str(page)
  soup = get_one_page(url)
  return_list = parse_page(soup)
  write_to_file(return_list)
if __name__ == "__main__":
  for i in range(4):
    main(i)

This is the whole content of this article.