Project Address:/jrainlau/wallpaper-downloader
preamble
It's been a while since I've written an article, as I've been adapting to my new position and using my free time to learn python. this article is a summary of a recent python learning phase, which involved the development of a crawler to batch download HD wallpapers from a wallpaper website.
Note: The project to which this article belongs is for python learning only, and is strictly forbidden to be used for any other purpose!
Initialization Project
The project uses thevirtualenv
to create a virtual environment and avoid polluting the global. Use thepip3
Just download it directly:
pip3 install virtualenv
Then create a newwallpaper-downloader
directory, use thevirtualenv
Create a file namedvenv
of the virtual environment:
virtualenv venv . venv/bin/activate
Next create the dependency directory:
echo bs4 lxml requests >
Finally yun just download and install the dependencies:
pip3 install -r
Analyzing Crawler Work Steps
For simplicity's sake, let's go directly to the wallpaper listing page categorized as "aero": /aer.....
As you can see, there are a total of 10 wallpapers available for download on this page. But because here are thumbnails, as a wallpaper for the clarity is far from enough, so we need to enter the wallpaper details page, to find high-definition download links. From the first wallpaper point in, you can see a new page:
Since my machine has a Retina screen, I'm going to just download the one with the largest volume for HD (volume shown in red circle).
Understand the specific steps, it is through the developer tools to find the corresponding dom node, extract the corresponding url can be, this process is no longer unfolded, the reader to try on their own can be, the following into the coding section.
access page
Create a newfile and then introduce two libraries:
from bs4 import BeautifulSoup import requests
Next, write a function dedicated to accessing the url and then returning the page html:
def visit_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36' } r = (url, headers = headers) = 'utf-8' soup = BeautifulSoup(, 'lxml') return soup
In order to prevent being hit by the website's anti-climbing mechanism, so we need to disguise the crawler as a normal browser by adding UA to the header, then specify the utf-8 encoding, and finally return the html in string format.
Extract link
After getting the html of the page, we need to extract the url of the wallpaper list of the page:
def get_paper_link(page): links = ('#content > div > ul > li > div > div a') collect = [] for link in links: (('href')) return collect
This function will fetch the url of all wallpaper details on the list page.
Download wallpapers
Once we have the address of the detail page, we can go in and pick the right size. After analyzing the dom structure of the page, we can see that each size corresponds to a link:
So the first step is to extract the links corresponding to these sizes:
wallpaper_source = visit_page(link) wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a') size_list = [] for link in wallpaper_size_links: href = ('href') size_list.append({ 'size': eval(link.get_text().replace('x', '*')), 'name': ('/download/', ''), 'url': href })
size_list
is a collection of these links. To make it easier to select the highest definition (largest) wallpaper next, thesize
I used theeval
method, it's straightforward to put here5120x3200
Give the calculation assize
The value of the
Once you've got all the collections, you can use themax()
The method picks the clearest one out:
biggest_one = max(size_list, key = lambda item: item['size'])
this onebiggest_one
amongurl
is the download link for the corresponding size, then you just need to download it via therequests
The library downloads the linked resources just fine:
result = (PAGE_DOMAIN + biggest_one['url']) if result.status_code == 200: open('wallpapers/' + biggest_one['name'], 'wb').write()
Note that first you need to create a root directory with awallpapers
directory, otherwise the runtime will report an error.
Organized and completedownload_wallpaper
The function looks like this:
def download_wallpaper(link): wallpaper_source = visit_page(PAGE_DOMAIN + link) wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a') size_list = [] for link in wallpaper_size_links: href = ('href') size_list.append({ 'size': eval(link.get_text().replace('x', '*')), 'name': ('/download/', ''), 'url': href }) biggest_one = max(size_list, key = lambda item: item['size']) print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name']) result = (PAGE_DOMAIN + biggest_one['url']) if result.status_code == 200: open('wallpapers/' + biggest_one['name'], 'wb').write()
batch run
The above steps are only able to downloadFirst wallpaper list page(used form a nominal expression)First wallpaper. If we want to downloadMultiple listing pages(used form a nominal expression)All wallpapersWe'll need to call these methods in a loop. First we define a couple constants:
import sys if len() != 4: print('3 arguments were required but only find ' + str(len() - 1) + '!') exit() category = [1] try: page_start = [int([2])] page_end = int([3]) except: print('The second and third arguments must be a number but not a string!') exit()
Here three constants are specified by getting command line argumentscategory
, page_start
cap (a poem)page_end
, corresponding to the wallpaper category, the start page number, and the end page number, respectively.
For convenience, define two more url-related constants:
PAGE_DOMAIN = '' PAGE_URL = '/' + category + '-desktop-wallpapers/page/'
The next step is to happily do batch operations, before we do that let's define astart()
Startup Functions:
def start(): if page_start[0] <= page_end: print('Preparing to download the ' + str(page_start[0]) + ' page of all the "' + category + '" wallpapers...') PAGE_SOURCE = visit_page(PAGE_URL + str(page_start[0])) WALLPAPER_LINKS = get_paper_link(PAGE_SOURCE) page_start[0] = page_start[0] + 1 for index, link in enumerate(WALLPAPER_LINKS): download_wallpaper(link, index, len(WALLPAPER_LINKS), start)
Then put the previousdownload_wallpaper
function is rewritten a bit more:
def download_wallpaper(link, index, total, callback): wallpaper_source = visit_page(PAGE_DOMAIN + link) wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a') size_list = [] for link in wallpaper_size_links: href = ('href') size_list.append({ 'size': eval(link.get_text().replace('x', '*')), 'name': ('/download/', ''), 'url': href }) biggest_one = max(size_list, key = lambda item: item['size']) print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name']) result = (PAGE_DOMAIN + biggest_one['url']) if result.status_code == 200: open('wallpapers/' + biggest_one['name'], 'wb').write() if index + 1 == total: print('Download completed!\n\n') callback()
Finally, specify the startup rules:
if __name__ == '__main__': start()
Running Projects
Start the test by entering the following code at the command line:
python3 aero 1 2
The following output can then be seen:
Take charles and grab the packet and you can see that the script is running smoothly:
At this point, the download script has been developed, and you finally don't have to worry about wallpaper drought!
The above is the whole content of this for you to organize, you have any questions can be discussed in the message area below, thank you for your support.