1. Introduction to Beautiful Soup
Simply put, Beautiful Soup is a python library whose main function is to grab data from web pages. The official explanation is as follows:
Beautiful Soup provides some simple, python-style functions to handle navigation, searching, modifying the analytic tree, and more. It is a toolkit that provides users with the data they need to crawl by parsing documents, and because of its simplicity, it doesn't take much code to write a complete application.
Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding. You don't need to take the encoding method into account, unless the document doesn't specify an encoding method, in which case Beautiful Soup doesn't automatically recognize the encoding method. Then, you simply need to specify the original encoding method.
Beautiful Soup has become as good a python interpreter as lxml and html6lib, providing users with the flexibility to use different parsing strategies or strong speed.
Without further ado, let's try it out~
2. Beautiful Soup Installation
Beautiful Soup 3 is now out of development, and it is recommended to use Beautiful Soup 4 in current projects, but it has been ported to BS4, which means we need to import bs4 when importing. So the version we use here is Beautiful Soup 4.3.2 (BS4 for short), and it is said that BS4 does not support Python3 well enough, but I am using Python2.7.7, so if any of you are using the Python3 version, you can consider downloading the BS3 version.
If you are using a newer version of Debain or Ubuntu, you can install it through your system's package manager, but it's not the latest version, it's currently version 4.2.1.
sudo apt-get install Python-bs4
If you want to install the latest version, please directly download the installation package to install manually, is also very convenient method. Here I installed Beautiful Soup 4.3.2.
Beautiful Soup 3.2.1Beautiful Soup 4.3.2
Unzip the file after the download is complete
Run the following command to complete the installation
sudo python install
As shown in the picture below, it proves that the installation was successful
Then you need to install lxml
sudo apt-get install Python-lxml
Beautiful Soup supports the Python standard library in the HTML parser , but also supports some third-party parsers , if we do not install it , then Python will use Python's default parser , the lxml parser is more powerful and faster , it is recommended to install .
3. Starting the Beautiful Soup Journey
Here first share the link to the official document, but the content is some more, and not organized enough, in this post to do a bit of organization to facilitate everyone's reference.
official document
4. Creating Beautiful Soup Objects
First you must import the bs4 library
from bs4 import BeautifulSoup
We create a string, which we'll use in the examples that follow.
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="/elsie" class="sister" ><!-- Elsie --></a>, <a href="/lacie" class="sister" >Lacie</a> and <a href="/tillie" class="sister" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
Create beautifulsoup object
soup = BeautifulSoup(html)
Alternatively, we can create objects from native HTML files, for example
soup = BeautifulSoup(open(''))
The above code opens the local file and uses it to create the soup object.
Let's print the contents of the soup object, formatting the output as follows
print () <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" > <!-- Elsie --> </a> , <a class="sister" href="/lacie" > Lacie </a> and <a class="sister" href="/tillie" > Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
The above is the output results, formatting and printing out its contents, this function is often used, partners to remember.
5. Four main object categories
Beautiful Soup will be a complex HTML document into a complex tree structure , each node is a Python object , all objects can be categorized into four kinds of .
- Tag
- NavigableString
- BeautifulSoup
- Comment
We will introduce them one by one below
(1)Tag
What is a Tag? In layman's terms, it's a tag in HTML, such as
<title>The Dormouse's story</title> <a class="sister" href="/elsie" >Elsie</a>
Above the title a and so on HTML tags plus the content included is Tag, the following we feel how to use Beautiful Soup to easily get Tags
The commented portion of each of the following code snippets is the result of running the program.
print #<title>The Dormouse's story</title> print #<head><title>The Dormouse's story</title></head> print #<a class="sister" href="/elsie" ><!-- Elsie --></a> print #<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
We can use soup plus label name to easily get the content of these labels, is not it feel much more convenient than regular expressions? However, one thing is that it looks for the first in all the contents of the label that meets the requirements, if you want to query all the labels, we will be introduced later.
We can verify the types of these objects
print type() #<class ''>
For Tag, it has two important attributes, which are name and attrs. Here's a taste of each
name print print #[document] #head
The soup object itself is special in that its name is [document], and for other internal tags, the output is the name of the tag itself.
attrs print #{'class': ['title'], 'name': 'dromouse'}
Here, we have printed out all the attributes of the p tag, and the resulting type is a dictionary.
If we want to get a property on its own, we can do this, for example, we get the class name of it
print ['class'] #['title']
You can also do this, using the get method and passing in the name of the property, both of which are equivalent
print ('class') #['title']
We can make changes to these attributes and contents, etc., for example
['class']="newClass" print #<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
It is also possible to remove this attribute, for example
del ['class'] print #<p name="dromouse"><b>The Dormouse's story</b></p>
However, for the operation of modifying and deleting, it is not our main use, so we will not do a detailed introduction here, if you need, please check the official documentation provided earlier
(2)NavigableString
Now that we have the content of the label, the question arises, what do we do if we want to get the text inside the label? It's easy, just use .string, for example
print #The Dormouse's story
This way we can easily get the contents of the label, just think how much trouble it would be to use regular expressions. Its type is a NavigableString, which translates to a string that can be traversed, but we'd rather call it that.
print type() #<class ''>
Let's check its type.
print type() #<class ''>
(3)BeautifulSoup
The BeautifulSoup object represents the entire content of a document. Most of the time, you can think of it as a Tag object, which is a special kind of Tag, and we can get its type, name, and attributes to get a feel for it.
print type() #<type 'unicode'> print # [document] print #{} Empty dictionary
(4)Comment
Comment object is a special type of NavigableString object, in fact, the output still does not include the comment symbols, but if you do not deal with it properly, it may cause unexpected trouble to our text processing.
Let's find a label with comments
print print print type()
The results of the run are as follows
<a class="sister" href="/elsie" ><!-- Elsie --></a> Elsie <class ''>
The contents of the a tag are actually comments, but if we utilize .string to output its contents, we find that it has removed the comment symbols, so this may cause us unnecessary trouble.
In addition, we print out its type, and found that it is a Comment type, so it is best to do a little judgment before we use the judgment code is as follows
if type()==: print
In the above code, we first determine its type, whether it is of type Comment or not, and then perform other operations, such as printout.
6. Traversing the document tree
(1) Direct child nodes
Essentials: .contents .children property
.contents
The .content attribute of a tag outputs the child nodes of the tag as a list.
print #[<title>The Dormouse's story</title>]
The output is a list, and we can use the list index to get one of its elements
print [0] #<title>The Dormouse's story</title> .children
It doesn't return a list, but we can get all the children by iterating over them.
If we print out .children, we can see that it's a list generator object.
print #<listiterator object at 0x7f71457f5710>
How do we get the contents? It's easy, just iterate over it, the code and the result is as follows
for child in : print child <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a> and <a class="sister" href="/tillie" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>
(2) All descendant nodes
Knowledge Points: .descendants property
.descendants
The .contents and .children attributes contain only the direct children of the tag, and the .descendants attribute recursively loops through all of the tag's descendants, similar to children, which we need to iterate through to get the contents.
for child in : print child
The result is as follows, you can find that all the nodes are printed out, Mr. outermost HTML tags, followed by peeling off from the head tag one by one, and so on.
<html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a> and <a class="sister" href="/tillie" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html> <head><title>The Dormouse's story</title></head> <title>The Dormouse's story</title> The Dormouse's story <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a> and <a class="sister" href="/tillie" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <b>The Dormouse's story</b> The Dormouse's story <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a> and <a class="sister" href="/tillie" >Tillie</a>; and they lived at the bottom of a well.</p> Once upon a time there were three little sisters; and their names were <a class="sister" href="/elsie" ><!-- Elsie --></a> Elsie , <a class="sister" href="/lacie" >Lacie</a> Lacie and <a class="sister" href="/tillie" >Tillie</a> Tillie ; and they lived at the bottom of a well. <p class="story">...</p> ...
(3) Node content
Knowledge Points: .string property
If a tag has only one child node of type NavigableString, then the tag can use .string to get the child node. If a tag has only one child node, then the tag can also use the .string method, the output is the same as the .string result of the current unique child node.
In layman's terms, if there are no more tags inside a label, then .string will return the contents of the label. If there is only one label inside a tag, then .string will also return the innermost content. For example
print #The Dormouse's story print #The Dormouse's story
If the tag contains more than one child node, the tag is unable to determine which child node should be called by the string method, and the output of .string is None.
print # None
(4) Multiple contents
Knowledge: .strings .stripped_strings property
.strings
Get multiple contents, but you need to iterate through them, as in the following example
for string in : print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'
.stripped_strings
The output string may contain a lot of spaces or blank lines, use .stripped_strings to remove them.
for string in soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...'
(5) Parent node
Knowledge: .parent property
p = print #body content = print #title
(6) All parent nodes
Knowledge: .parents property
All parents of an element can be obtained recursively through the element's .parents attribute, for example
content = for parent in : print
title head html [document]
(7) Sibling nodes
Knowledge: .next_sibling .previous_sibling properties
Sibling nodes are understood to be nodes at the same level as this node. The .next_sibling attribute gets the next sibling node of the node, and .previous_sibling returns None if the node does not exist.
Note: The .next_sibling and .previous_sibling attributes of tags in the actual document are usually strings or whitespaces, and since a whitespace or newline can also be treated as a node, the result may be a whitespace or newline.
print .next_sibling # Actual place is blank print .prev_sibling #None Returns None if there is no previous sibling. print .next_sibling.next_sibling #<p class="story">Once upon a time there were three little sisters; and their names were #<a class="sister" href="/elsie" ><!-- Elsie --></a>, #<a class="sister" href="/lacie" >Lacie</a> and #<a class="sister" href="/tillie" >Tillie</a>; #and they lived at the bottom of a well.</p> # The next sibling of the next node is the node that we can see
(8) All sibling nodes
Knowledge: .next_siblings .previous_siblings properties
The .next_siblings and .previous_siblings attributes allow you to iterate over the siblings of the current node.
for sibling in .next_siblings: print(repr(sibling)) # u',\n' # <a class="sister" href="/lacie" >Lacie</a> # u' and\n' # <a class="sister" href="/tillie" >Tillie</a> # u'; and they lived at the bottom of a well.' # None
(9) Front and back nodes
Knowledge: .next_element .previous_element properties
Unlike .next_sibling .previous_sibling, it is not specific to sibling nodes, but at all nodes, regardless of hierarchy
For example, the head node is
<head><title>The Dormouse's story</title></head>
Then its next node is the title, which is not hierarchical.
print .next_element #<title>The Dormouse's story</title>
(10) All front and back nodes
Knowledge: .next_elements .previous_elements property
The .next_elements and .previous_elements iterators allow you to access the parsed content of a document either forward or backward, as if the document were being parsed.
for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # <p class="story">...</p> # u'...' # u'\n' # None
7. Search the document tree
(1)find_all( name , attrs , recursive , text , **kwargs )
The find_all() method searches all the tag children of the current tag and determines if the filter conditions are met.
1) name parameter
The name parameter finds all tags with the name, string objects are automatically ignored.
A.Pass String
The simplest filter is a string. In the search method, pass a string parameter, Beautiful Soup will look for a complete match with the string of content, the following example is used to find all the <b> tags in the document
soup.find_all('b') # [<b>The Dormouse's story</b>] print soup.find_all('a') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]
B. Passing Regular Expressions
If you pass a regular expression as an argument, Beautiful Soup will match the content with the regular expression's match(). The following example finds all tags that start with b. This means that both the <body> and <b> tags should be found.
import re for tag in soup.find_all(("^b")): print() # body # b
C. Pass List
If you pass in a list parameter, Beautiful Soup will return the content that matches any of the elements in the list. The following code finds all <a> tags and <b> tags in the document
soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" href="/elsie" >Elsie</a>, # <a class="sister" href="/lacie" >Lacie</a>, # <a class="sister" href="/tillie" >Tillie</a>]
D. Pass True
True can match any value, the following code finds all the tags, but does not return the string node
for tag in soup.find_all(True): print() # html # head # title # body # p # b # p # a # a
E. Methods of transmission
If there is no suitable filter, then you can also define a method that takes only one element argument [4] , if the method returns True then the current element matches and is found, if not it returns False.
The following method checks the current element and returns True if it contains the class attribute but not the id attribute.
def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')
Passing this as an argument to the find_all() method will get all <p> tags.
soup.find_all(has_class_but_no_id) # [<p class="title"><b>The Dormouse's story</b></p>, # <p class="story">Once upon a time there were...</p>, # <p class="story">...</p>]
2) keyword parameter
Note: If a parameter with a specified name is not searching for a built-in parameter name, the search will treat the parameter as an attribute of the specified name tag, if it contains a parameter with the name id, Beautiful Soup will search for the "id" attribute of each tag.
soup.find_all(id='link2') # [<a class="sister" href="/lacie" >Lacie</a>]
If you pass in the href parameter, Beautiful Soup will search for the "href" attribute of each tag.
soup.find_all(href=("elsie")) # [<a class="sister" href="/elsie" >Elsie</a>]
Multiple attributes of a tag can be filtered at the same time by using multiple parameters with the specified names.
soup.find_all(href=("elsie"), id='link1') # [<a class="sister" href="/elsie" >three</a>]
Here we want to filter by class, but class is a python keyword. Just add an underscore.
soup.find_all("a", class_="sister") # [<a class="sister" href="/elsie" >Elsie</a>, # <a class="sister" href="/lacie" >Lacie</a>, # <a class="sister" href="/tillie" >Tillie</a>]
Some tag attributes can't be used in search, such as the data-* attribute in HTML5.
soup.find_all("a", class_="sister") # [<a class="sister" href="/elsie" >Elsie</a>, # <a class="sister" href="/lacie" >Lacie</a>, # <a class="sister" href="/tillie" >Tillie</a>]
However, it is possible to search for tags containing special attributes by defining a dictionary parameter in the attrs parameter of the find_all() method
data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]
3) text parameter
The text parameter can be used to search the contents of the document for strings. As with the optional name parameter, the text parameter accepts strings, regular expressions, lists, True
soup.find_all(text="Elsie") # [u'Elsie'] soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(text=("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]
4) Limit Parameter
The find_all() method returns the entire search structure, which can be slow if the document tree is large. If we don't need all the results, we can use the limit parameter to limit the number of results returned. The effect is similar to the limit keyword in SQL; when the number of results reaches the limit, the search stops returning results.
There are 3 tags in the document tree that match the search criteria, but only 2 are returned because we limit the number of returns.
soup.find_all("a", limit=2) # [<a class="sister" href="/elsie" >Elsie</a>, # <a class="sister" href="/lacie" >Lacie</a>]
5) recursive parameter
When calling the find_all() method of a tag, Beautiful Soup will retrieve all the children of the current tag, if you only want to search the direct children of the tag, you can use the parameter recursive=False .
A simple piece of documentation.
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
...
Search results with or without recursive parameter.
.find_all("title") # [<title>The Dormouse's story</title>] .find_all("title", recursive=False) # []
(2)find( name , attrs , recursive , text , **kwargs )
The only difference between this and the find_all() method is that find_all() returns a list of elements with a single value, whereas find() returns the result directly
(3)find_parents() find_parent()
find_all() and find() only search all children, grandchildren, etc. of the current node. find_parents() and find_parent() are used to search for the parents of the current node, the search method is the same as that of a normal tag, searching the document for the contents of the document.
(4)find_next_siblings() find_next_sibling()
These two methods iterate through the .next_siblings attribute over all of the later resolved sibling tag nodes of the current tag, find_next_siblings() returns all of the later resolved sibling nodes that meet the criteria, find_next_sibling() returns only the first tag node after the later resolved sibling nodes that meet the criteria.
(5)find_previous_siblings() find_previous_sibling()
These two methods iterate over the previously resolved sibling tag nodes of the current tag using the .previous_siblings attribute, the find_previous_siblings() method returns all previous sibling nodes that match the condition, and the find_previous_sibling() method returns the first sibling node that matches the condition. previous sibling node
(6)find_all_next() find_next()
These two methods iterate through the tags and strings after the current tag using the .next_elements attribute, the find_all_next() method returns all eligible nodes, and the find_next() method returns the first eligible node.
(7)find_all_previous() cap (a poem) find_previous()
These two methods iterate over the tags and strings in front of the current node using the .previous_elements attribute, the find_all_previous() method returns all eligible nodes, and the find_previous() method returns the first eligible node.
Note: The above (2) (3) (4) (5) (6) (7) method parameter usage and find_all() is exactly the same, the principle is similar, will not repeat here.
picker
When we write CSS, we write tag names without any modifiers, class names with a dot, and id names with a #. Here we can use a similar approach to filter elements, using the method (), with a return type of list
(1) Search by tag name
print ('title') #[<title>The Dormouse's story</title>] print ('a') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>] print ('b') #[<b>The Dormouse's story</b>]
(2) Search by class name
print ('.sister') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>, <a class="sister" href="/lacie" >Lacie</a>, <a class="sister" href="/tillie" >Tillie</a>]
(3) Search by id name
print ('#link1') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
(4) Combined search
Combination of search that is and write class file, label name and class name, id name for the combination of the same principle, for example, to find p tags, id is equal to the contents of link1, the two need to be separated by a space
print ('p #link1') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
Direct sub-tab lookup
print ("head > title") #[<title>The Dormouse's story</title>]
(5) Attribute Lookup
Find can also add attribute elements, attributes need to be enclosed in parentheses, note that the attributes and labels belong to the same node, so you can not add spaces in the middle, otherwise it will not be able to match to.
print ("head > title") #[<title>The Dormouse's story</title>] print ('a[href="/elsie"]') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
Similarly, attributes can still be combined with the above lookups, with spaces separating those not in the same node and no spaces in the same node
print ('p a[href="/elsie"]') #[<a class="sister" href="/elsie" ><!-- Elsie --></a>]
Well, this is another method of finding that is similar to the find_all method, doesn't it feel convenient?
summarize
The content of this article is more, the Beautiful Soup method for most of the organization and summary, but this is not complete, there are still Beautiful Soup modify delete function, but these features used less, only collated to find the method of extraction, I hope to help you! I hope this will help you!
Proficiency in Beautiful Soup will surely bring you too much convenience, go for it!