SoFunction
Updated on 2024-10-30

Super-detailed tutorial on using the Beautiful Soup library in Python

1. Introduction to Beautiful Soup

Simply put, Beautiful Soup is a python library whose main function is to grab data from web pages. The official explanation is as follows:

Beautiful Soup provides some simple, python-style functions to handle navigation, searching, modifying the analytic tree, and more. It is a toolkit that provides users with the data they need to crawl by parsing documents, and because of its simplicity, it doesn't take much code to write a complete application.

Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding. You don't need to take the encoding method into account, unless the document doesn't specify an encoding method, in which case Beautiful Soup doesn't automatically recognize the encoding method. Then, you simply need to specify the original encoding method.

Beautiful Soup has become as good a python interpreter as lxml and html6lib, providing users with the flexibility to use different parsing strategies or strong speed.

Without further ado, let's try it out~
2. Beautiful Soup Installation

Beautiful Soup 3 is now out of development, and it is recommended to use Beautiful Soup 4 in current projects, but it has been ported to BS4, which means we need to import bs4 when importing. So the version we use here is Beautiful Soup 4.3.2 (BS4 for short), and it is said that BS4 does not support Python3 well enough, but I am using Python2.7.7, so if any of you are using the Python3 version, you can consider downloading the BS3 version.

If you are using a newer version of Debain or Ubuntu, you can install it through your system's package manager, but it's not the latest version, it's currently version 4.2.1.
 

sudo apt-get install Python-bs4

If you want to install the latest version, please directly download the installation package to install manually, is also very convenient method. Here I installed Beautiful Soup 4.3.2.

Beautiful Soup 3.2.1Beautiful Soup 4.3.2

Unzip the file after the download is complete

Run the following command to complete the installation
 

sudo python  install

As shown in the picture below, it proves that the installation was successful

 (722×462)

Then you need to install lxml
 

sudo apt-get install Python-lxml

Beautiful Soup supports the Python standard library in the HTML parser , but also supports some third-party parsers , if we do not install it , then Python will use Python's default parser , the lxml parser is more powerful and faster , it is recommended to install .
3. Starting the Beautiful Soup Journey

Here first share the link to the official document, but the content is some more, and not organized enough, in this post to do a bit of organization to facilitate everyone's reference.

official document
4. Creating Beautiful Soup Objects

First you must import the bs4 library
 

from bs4 import BeautifulSoup

We create a string, which we'll use in the examples that follow.
 

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="/elsie" class="sister" ><!-- Elsie --></a>,
<a href="/lacie" class="sister" >Lacie</a> and
<a href="/tillie" class="sister" >Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

Create beautifulsoup object
 

soup = BeautifulSoup(html)

Alternatively, we can create objects from native HTML files, for example
 

soup = BeautifulSoup(open(''))

The above code opens the local file and uses it to create the soup object.

Let's print the contents of the soup object, formatting the output as follows
 

print ()
 
<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title" name="dromouse">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="/elsie" >
  <!-- Elsie -->
  </a>
  ,
  <a class="sister" href="/lacie" >
  Lacie
  </a>
  and
  <a class="sister" href="/tillie" >
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

The above is the output results, formatting and printing out its contents, this function is often used, partners to remember.
5. Four main object categories

Beautiful Soup will be a complex HTML document into a complex tree structure , each node is a Python object , all objects can be categorized into four kinds of .

  1.     Tag
  2.     NavigableString
  3.     BeautifulSoup
  4.     Comment

We will introduce them one by one below
(1)Tag

What is a Tag? In layman's terms, it's a tag in HTML, such as
 

<title>The Dormouse's story</title>
 
<a class="sister" href="/elsie" >Elsie</a>

Above the title a and so on HTML tags plus the content included is Tag, the following we feel how to use Beautiful Soup to easily get Tags

The commented portion of each of the following code snippets is the result of running the program.
 

print 
#<title>The Dormouse's story</title>
 
print 
#<head><title>The Dormouse's story</title></head>
 
print 
#<a class="sister" href="/elsie" ><!-- Elsie --></a>
 
print 
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

We can use soup plus label name to easily get the content of these labels, is not it feel much more convenient than regular expressions? However, one thing is that it looks for the first in all the contents of the label that meets the requirements, if you want to query all the labels, we will be introduced later.

We can verify the types of these objects
 

print type()
#<class ''>

For Tag, it has two important attributes, which are name and attrs. Here's a taste of each

name
 
print 
print 
#[document]
#head

The soup object itself is special in that its name is [document], and for other internal tags, the output is the name of the tag itself.

attrs
 
print 
#{'class': ['title'], 'name': 'dromouse'}

Here, we have printed out all the attributes of the p tag, and the resulting type is a dictionary.

If we want to get a property on its own, we can do this, for example, we get the class name of it
 

print ['class']
#['title']

You can also do this, using the get method and passing in the name of the property, both of which are equivalent
 

print ('class')
#['title']

We can make changes to these attributes and contents, etc., for example
 

['class']="newClass"
print 
#<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

It is also possible to remove this attribute, for example
 

del ['class']
print 
#<p name="dromouse"><b>The Dormouse's story</b></p>

However, for the operation of modifying and deleting, it is not our main use, so we will not do a detailed introduction here, if you need, please check the official documentation provided earlier
(2)NavigableString

Now that we have the content of the label, the question arises, what do we do if we want to get the text inside the label? It's easy, just use .string, for example
 

print 
#The Dormouse's story

This way we can easily get the contents of the label, just think how much trouble it would be to use regular expressions. Its type is a NavigableString, which translates to a string that can be traversed, but we'd rather call it that.
 

print type()
#<class ''>

Let's check its type.
 

print type()
#<class ''>

(3)BeautifulSoup

The BeautifulSoup object represents the entire content of a document. Most of the time, you can think of it as a Tag object, which is a special kind of Tag, and we can get its type, name, and attributes to get a feel for it.
 

print type()
#<type 'unicode'>
print  
# [document]
print  
#{} Empty dictionary

(4)Comment

Comment object is a special type of NavigableString object, in fact, the output still does not include the comment symbols, but if you do not deal with it properly, it may cause unexpected trouble to our text processing.

Let's find a label with comments
 

print 
print 
print type()

The results of the run are as follows
 

<a class="sister" href="/elsie" ><!-- Elsie --></a>
 Elsie 
<class ''>

The contents of the a tag are actually comments, but if we utilize .string to output its contents, we find that it has removed the comment symbols, so this may cause us unnecessary trouble.

In addition, we print out its type, and found that it is a Comment type, so it is best to do a little judgment before we use the judgment code is as follows
 

if type()==:
  print 

In the above code, we first determine its type, whether it is of type Comment or not, and then perform other operations, such as printout.
6. Traversing the document tree
(1) Direct child nodes

Essentials: .contents .children property

.contents

The .content attribute of a tag outputs the child nodes of the tag as a list.
 

print  
#[<title>The Dormouse's story</title>]

The output is a list, and we can use the list index to get one of its elements
 

print [0]
#<title>The Dormouse's story</title>

.children

It doesn't return a list, but we can get all the children by iterating over them.

If we print out .children, we can see that it's a list generator object.
 

print 
#<listiterator object at 0x7f71457f5710>

How do we get the contents? It's easy, just iterate over it, the code and the result is as follows
 

for child in :
  print child
 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="/elsie" ><!-- Elsie --></a>,
<a class="sister" href="/lacie" >Lacie</a> and
<a class="sister" href="/tillie" >Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>

(2) All descendant nodes

Knowledge Points: .descendants property

.descendants

The .contents and .children attributes contain only the direct children of the tag, and the .descendants attribute recursively loops through all of the tag's descendants, similar to children, which we need to iterate through to get the contents.
 

for child in :
  print child

The result is as follows, you can find that all the nodes are printed out, Mr. outermost HTML tags, followed by peeling off from the head tag one by one, and so on.
 

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="/elsie" ><!-- Elsie --></a>,
<a class="sister" href="/lacie" >Lacie</a> and
<a class="sister" href="/tillie" >Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story
 
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="/elsie" ><!-- Elsie --></a>,
<a class="sister" href="/lacie" >Lacie</a> and
<a class="sister" href="/tillie" >Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="/elsie" ><!-- Elsie --></a>,
<a class="sister" href="/lacie" >Lacie</a> and
<a class="sister" href="/tillie" >Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were
 
<a class="sister" href="/elsie" ><!-- Elsie --></a>
 Elsie 
,
 
<a class="sister" href="/lacie" >Lacie</a>
Lacie
 and
 
<a class="sister" href="/tillie" >Tillie</a>
Tillie
;
and they lived at the bottom of a well.
 
<p class="story">...</p>
...

(3) Node content

Knowledge Points: .string property

If a tag has only one child node of type NavigableString, then the tag can use .string to get the child node. If a tag has only one child node, then the tag can also use the .string method, the output is the same as the .string result of the current unique child node.

In layman's terms, if there are no more tags inside a label, then .string will return the contents of the label. If there is only one label inside a tag, then .string will also return the innermost content. For example
 

print 
#The Dormouse's story
print 
#The Dormouse's story

If the tag contains more than one child node, the tag is unable to determine which child node should be called by the string method, and the output of .string is None.
 

print 
# None

(4) Multiple contents

Knowledge: .strings .stripped_strings property

.strings

Get multiple contents, but you need to iterate through them, as in the following example
 

for string in :
  print(repr(string))
  # u"The Dormouse's story"
  # u'\n\n'
  # u"The Dormouse's story"
  # u'\n\n'
  # u'Once upon a time there were three little sisters; and their names were\n'
  # u'Elsie'
  # u',\n'
  # u'Lacie'
  # u' and\n'
  # u'Tillie'
  # u';\nand they lived at the bottom of a well.'
  # u'\n\n'
  # u'...'
  # u'\n'

.stripped_strings

The output string may contain a lot of spaces or blank lines, use .stripped_strings to remove them.
 

for string in soup.stripped_strings:
  print(repr(string))
  # u"The Dormouse's story"
  # u"The Dormouse's story"
  # u'Once upon a time there were three little sisters; and their names were'
  # u'Elsie'
  # u','
  # u'Lacie'
  # u'and'
  # u'Tillie'
  # u';\nand they lived at the bottom of a well.'
  # u'...'

(5) Parent node

Knowledge: .parent property

 

p = 
print 
#body
 
content = 
print 
#title

(6) All parent nodes

Knowledge: .parents property

All parents of an element can be obtained recursively through the element's .parents attribute, for example
 

content = 
for parent in :
  print 

 

title
head
html
[document]

(7) Sibling nodes

Knowledge: .next_sibling .previous_sibling properties

Sibling nodes are understood to be nodes at the same level as this node. The .next_sibling attribute gets the next sibling node of the node, and .previous_sibling returns None if the node does not exist.

Note: The .next_sibling and .previous_sibling attributes of tags in the actual document are usually strings or whitespaces, and since a whitespace or newline can also be treated as a node, the result may be a whitespace or newline.
 

print .next_sibling
# Actual place is blank
print .prev_sibling
#None Returns None if there is no previous sibling.
print .next_sibling.next_sibling
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="/elsie" ><!-- Elsie --></a>,
#<a class="sister" href="/lacie" >Lacie</a> and
#<a class="sister" href="/tillie" >Tillie</a>;
#and they lived at the bottom of a well.</p>
# The next sibling of the next node is the node that we can see

(8) All sibling nodes

Knowledge: .next_siblings .previous_siblings properties

The .next_siblings and .previous_siblings attributes allow you to iterate over the siblings of the current node.
 

for sibling in .next_siblings:
  print(repr(sibling))
  # u',\n'
  # <a class="sister" href="/lacie" >Lacie</a>
  # u' and\n'
  # <a class="sister" href="/tillie" >Tillie</a>
  # u'; and they lived at the bottom of a well.'
  # None

(9) Front and back nodes

Knowledge: .next_element .previous_element properties

Unlike .next_sibling .previous_sibling, it is not specific to sibling nodes, but at all nodes, regardless of hierarchy

For example, the head node is
 
<head><title>The Dormouse's story</title></head>

Then its next node is the title, which is not hierarchical.
 

print .next_element
#<title>The Dormouse's story</title>

(10) All front and back nodes

Knowledge: .next_elements .previous_elements property

The .next_elements and .previous_elements iterators allow you to access the parsed content of a document either forward or backward, as if the document were being parsed.
 

for element in last_a_tag.next_elements:
  print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None

7. Search the document tree
(1)find_all( name , attrs , recursive , text , **kwargs )

The find_all() method searches all the tag children of the current tag and determines if the filter conditions are met.

1) name parameter

The name parameter finds all tags with the name, string objects are automatically ignored.

A.Pass String

The simplest filter is a string. In the search method, pass a string parameter, Beautiful Soup will look for a complete match with the string of content, the following example is used to find all the <b> tags in the document

soup.find_all('b')
# [<b>The Dormouse's story</b>]
 
print soup.find_all('a')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]

B. Passing Regular Expressions

If you pass a regular expression as an argument, Beautiful Soup will match the content with the regular expression's match(). The following example finds all tags that start with b. This means that both the <body> and <b> tags should be found.
 

import re
for tag in soup.find_all(("^b")):
  print()
# body
# b

C. Pass List

If you pass in a list parameter, Beautiful Soup will return the content that matches any of the elements in the list. The following code finds all <a> tags and <b> tags in the document
 

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="/elsie" >Elsie</a>,
# <a class="sister" href="/lacie" >Lacie</a>,
# <a class="sister" href="/tillie" >Tillie</a>]

D. Pass True

True can match any value, the following code finds all the tags, but does not return the string node
 

for tag in soup.find_all(True):
  print()
# html
# head
# title
# body
# p
# b
# p
# a
# a

E. Methods of transmission

If there is no suitable filter, then you can also define a method that takes only one element argument [4] , if the method returns True then the current element matches and is found, if not it returns False.

The following method checks the current element and returns True if it contains the class attribute but not the id attribute.
 

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')

Passing this as an argument to the find_all() method will get all <p> tags.
 

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]

2) keyword parameter

Note: If a parameter with a specified name is not searching for a built-in parameter name, the search will treat the parameter as an attribute of the specified name tag, if it contains a parameter with the name id, Beautiful Soup will search for the "id" attribute of each tag.

 

soup.find_all(id='link2')
# [<a class="sister" href="/lacie" >Lacie</a>]

If you pass in the href parameter, Beautiful Soup will search for the "href" attribute of each tag.
 

soup.find_all(href=("elsie"))
# [<a class="sister" href="/elsie" >Elsie</a>]

Multiple attributes of a tag can be filtered at the same time by using multiple parameters with the specified names.
 

soup.find_all(href=("elsie"), id='link1')
# [<a class="sister" href="/elsie" >three</a>]

Here we want to filter by class, but class is a python keyword. Just add an underscore.
 

soup.find_all("a", class_="sister")
# [<a class="sister" href="/elsie" >Elsie</a>,
# <a class="sister" href="/lacie" >Lacie</a>,
# <a class="sister" href="/tillie" >Tillie</a>]

Some tag attributes can't be used in search, such as the data-* attribute in HTML5.
 

soup.find_all("a", class_="sister")
# [<a class="sister" href="/elsie" >Elsie</a>,
# <a class="sister" href="/lacie" >Lacie</a>,
# <a class="sister" href="/tillie" >Tillie</a>]

However, it is possible to search for tags containing special attributes by defining a dictionary parameter in the attrs parameter of the find_all() method
 

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

3) text parameter

The text parameter can be used to search the contents of the document for strings. As with the optional name parameter, the text parameter accepts strings, regular expressions, lists, True
 

soup.find_all(text="Elsie")
# [u'Elsie']
 
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
 
soup.find_all(text=("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

4) Limit Parameter

The find_all() method returns the entire search structure, which can be slow if the document tree is large. If we don't need all the results, we can use the limit parameter to limit the number of results returned. The effect is similar to the limit keyword in SQL; when the number of results reaches the limit, the search stops returning results.

There are 3 tags in the document tree that match the search criteria, but only 2 are returned because we limit the number of returns.
 

soup.find_all("a", limit=2)
# [<a class="sister" href="/elsie" >Elsie</a>,
# <a class="sister" href="/lacie" >Lacie</a>]

5) recursive parameter

When calling the find_all() method of a tag, Beautiful Soup will retrieve all the children of the current tag, if you only want to search the direct children of the tag, you can use the parameter recursive=False .

A simple piece of documentation.

Copy Code The code is as follows.
 
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

Search results with or without recursive parameter.
 

.find_all("title")
# [<title>The Dormouse's story</title>]
 
.find_all("title", recursive=False)
# []

(2)find( name , attrs , recursive , text , **kwargs )

The only difference between this and the find_all() method is that find_all() returns a list of elements with a single value, whereas find() returns the result directly
(3)find_parents() find_parent()

find_all() and find() only search all children, grandchildren, etc. of the current node. find_parents() and find_parent() are used to search for the parents of the current node, the search method is the same as that of a normal tag, searching the document for the contents of the document.
(4)find_next_siblings() find_next_sibling()

These two methods iterate through the .next_siblings attribute over all of the later resolved sibling tag nodes of the current tag, find_next_siblings() returns all of the later resolved sibling nodes that meet the criteria, find_next_sibling() returns only the first tag node after the later resolved sibling nodes that meet the criteria.
(5)find_previous_siblings() find_previous_sibling()

These two methods iterate over the previously resolved sibling tag nodes of the current tag using the .previous_siblings attribute, the find_previous_siblings() method returns all previous sibling nodes that match the condition, and the find_previous_sibling() method returns the first sibling node that matches the condition. previous sibling node
(6)find_all_next() find_next()

These two methods iterate through the tags and strings after the current tag using the .next_elements attribute, the find_all_next() method returns all eligible nodes, and the find_next() method returns the first eligible node.
(7)find_all_previous() cap (a poem) find_previous()

These two methods iterate over the tags and strings in front of the current node using the .previous_elements attribute, the find_all_previous() method returns all eligible nodes, and the find_previous() method returns the first eligible node.

Note: The above (2) (3) (4) (5) (6) (7) method parameter usage and find_all() is exactly the same, the principle is similar, will not repeat here.

picker

When we write CSS, we write tag names without any modifiers, class names with a dot, and id names with a #. Here we can use a similar approach to filter elements, using the method (), with a return type of list
(1) Search by tag name
 

print ('title') 
#[<title>The Dormouse's story</title>]
 
print ('a')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]
 
print ('b')
#[<b>The Dormouse's story</b>]

(2) Search by class name
 

print ('.sister')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]

(3) Search by id name
 

print ('#link1')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]

(4) Combined search

Combination of search that is and write class file, label name and class name, id name for the combination of the same principle, for example, to find p tags, id is equal to the contents of link1, the two need to be separated by a space
 

print ('p #link1')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]

Direct sub-tab lookup
 

print ("head > title")
#[<title>The Dormouse's story</title>]

(5) Attribute Lookup

Find can also add attribute elements, attributes need to be enclosed in parentheses, note that the attributes and labels belong to the same node, so you can not add spaces in the middle, otherwise it will not be able to match to.
 

print ("head > title")
#[<title>The Dormouse's story</title>]
 
print ('a[href="/elsie"]')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]

Similarly, attributes can still be combined with the above lookups, with spaces separating those not in the same node and no spaces in the same node
 

print ('p a[href="/elsie"]')
#[<a class="sister" href="/elsie" ><!-- Elsie --></a>]

Well, this is another method of finding that is similar to the find_all method, doesn't it feel convenient?
summarize

The content of this article is more, the Beautiful Soup method for most of the organization and summary, but this is not complete, there are still Beautiful Soup modify delete function, but these features used less, only collated to find the method of extraction, I hope to help you! I hope this will help you!

Proficiency in Beautiful Soup will surely bring you too much convenience, go for it!