Some suggestions for learning Python crawlers

1. Learn Python packages and implement basic crawling processes

Most crawlers follow the process of "send request - obtaining page - parsing page - extracting and storing content", which actually simulates the process of using a browser to obtain web page information. There are many packages related to crawlers in Python: urllib, requests, bs4, scrapy, pyspider, etc. It is recommended to start with requests+Xpath, which is responsible for connecting to the website and returning to the web page. Xpath is used to parse web pages to facilitate data extraction.

If you have used BeautifulSoup, you will find that Xpath is much more convenient. The work of checking element code layer by layer is omitted. In this way, the basic routines are almost the same, and ordinary static websites are no problem at all. Of course, if you need to crawl asynchronously loaded websites, you can learn to capture packets by the browser to analyze real requests or learn Selenium to achieve automation.

2. Understand the storage of unstructured data

The data that crawled back can be stored locally in document form or stored in the database. When the data volume is not large at the beginning, you can directly save the data as a file like csv through Python syntax or pandas. Of course, you may find that the data crawled back is not clean, there may be missing, errors, etc. You also need to clean the data. You can learn the basic usage of the pandas package to preprocess the data to get cleaner data.

3. Learn scrapy and build engineered crawlers

There is basically no problem in mastering the previous technology of general order data and code, but when encountering very complex situations, you may still be unable to do so. At this time, a powerful scrapy framework is very useful. scrapy is a very powerful crawler framework. It not only can easily build requests, but also a powerful selector that can easily parse responses. However, the most surprising thing is its super high performance, allowing you to engineer and modular crawlers. Learn scrapy, you can build some crawler frameworks yourself, and you will basically have the mindset of a Python crawler engineer.

4. Learn database knowledge and deal with large-scale data storage and extraction

When the amount of data that is crawled back is small, you can store it in the form of documents. Once the amount of data is large, this will be a bit unworkable. Therefore, it is necessary to master a database, and it is OK to learn the current mainstream MongoDB. MongoDB can facilitate you to store some unstructured data, such as text of various comments, links to pictures, etc. You can also use PyMongo to more conveniently operate MongoDB in Python. Because the database knowledge to be used here is actually very simple, it mainly involves how to enter the database and extract the data, and just learn it when needed.

5. Master various skills and deal with anti-crawling measures for special websites

Of course, the crawler will also experience some despair, such as the IP blocked by the website, such as various strange verification codes, userAgent access restrictions, various dynamic loading, etc. When encountering these anti-crawler methods, of course, some advanced skills are needed to deal with them, such as access frequency control, using proxy IP pools, packet capture, verification code OCR processing, etc. Often, websites tend to be more inclined to the former between efficient development and anti-crawlers, which also provides space for crawlers. Mastering these skills to deal with anti-crawlers, most websites are no longer difficult for you.

6. Distributed crawlers to realize large-scale concurrent acquisition and improve efficiency

Crawling basic data is no longer a problem, your bottleneck will focus on the efficiency of crawling massive data. At this time, I believe you will naturally come into contact with a very powerful name: distributed crawler. Distributed things sound scary, but in fact they use the principle of multi-threading to make multiple crawlers work at the same time. You need to master the three tools of Scrapy+ MongoDB+ Redis. Scrapy We have mentioned earlier, it is used for basic page crawling, MongoDB is used to store crawled data, and Redis is used to store the web page queue to be crawled, that is, the task queue. So some things look scary, but in fact they are broken down, that's the case. When you can write distributed crawlers, you can try to create some basic crawler architectures to achieve more automated data acquisition.

As long as you follow the above Python crawler learning route and complete it step by step, even a novice can become an experienced driver, and it will be very easy and smooth after learning. So when novices start, try not to systematically chew some things, find a practical project, and start the operation directly.

The above are the detailed contents of some suggestions for learning Python crawlers. For more information about Python crawlers, please pay attention to my other related articles!