Analyzing Python General Purpose Crawler and Focused Crawler

I. A simple understanding of crawlers

1. What is a crawler?

Web crawler is also called web spider, if the Internet is compared to a spider web, then the spider is crawling around the web spider, crawler program by requesting the url address, according to the response to the content of the analysis of the collection of data, for example: if the response content is html, analyze the dom structure, dom parsing, or regular matching, if the response content is xml/json data, you can turn the data object and then parse the data. If the response content is xml/json data, you can turn the data object, and then parse the data.

2. What is the role of reptiles?

Batch data collection through effective crawler means can reduce labor costs, increase the amount of effective data, give data support for operation/sales, and accelerate product development.

3. The state of the reptile industry

At present, the Internet product competition is fierce, most of the industry will use crawler technology to mine, collect and analyze the data of competing products and big data, which is a necessary means, and many companies have established the position of crawler engineer.

4. Legitimacy

Crawlers are programs that are used to bulk crawl publicly available information on web pages, which is information about the data displayed on the front end. Because the information is completely public, it is legal. In fact, just like a browser, the browser parses the response content and renders it as a page, while the crawler parses the response content to collect the desired data for storage.

5. Anti-crawlers

Reptiles are difficult to completely stop, Dao is one foot tall and the devil is one foot tall, this is a war without smoke, code farmer VS code farmer

Anti-crawler some means:

Legitimacy detection: request verification (useragent, referer, interface plus signature, etc.)
Small black room: IP/user limits request frequency, or outright blocking
Poisoning: anti-crawler high realm can not be used to intercept, intercept is a moment, poisoning return false data, you can mislead competitors decision-making

II. Generalized crawlers

Depending on the usage scenario, web crawlers can be categorized into general-purpose crawlers and focused crawlers.

1. Crawling with a Climbing Climbing Client

A common web crawler is an important part of the crawling system of letting search engines (Baidu, Google, Yahoo). The main purpose is to download websites from the Internet to form a mirror backup of the Internet content.

The basic workflow of grid crawling is as follows:

A selection of carefully chosen subURLs was first chosen;
Place these URLs in the queue of URLs to be crawled;
Take the URLs to be crawled from the queue of URLs to be crawled, resolve the DNS, and get the ip of the host, and download the corresponding grids for the URLs and store them in the Downloaded Grid Library. In addition, the URLs are put into the Crawled URL queue.
Analyze the URLs in the queue of crawled URLs, analyze the other URLs in the queue, and place the URLs in the queue of URLs to be crawled to proceed to the next loop ....

2. Principles of Working with Search Engines

With the rapid development of the Web, the World Wide Web has become the carrier of a large amount of information, and how to effectively extract and utilize this information has become a great challenge. Users usually use search engines (Yahoo, Google, Baidu, and so on) as the access point to the World Wide Web.

The use of the web crawler is a very important part of the search engine system, as it is responsible for gathering the web from the Internet, collecting information that is used to support the indexing of the search engine, and determining the richness and timeliness of the entire engine system, so its performance has a direct impact on the search engine's effectiveness.

Step 1: Grabbing the Grid

Search engines are called "spiders" and "robots" because they use a specific type of software to follow the links on the web and crawl from one link to another, just like spiders crawl on the spider grid.

However, the search engine spiders are bound by certain rules, and they need to follow certain commands or files.
The full name of Robots Protocol (also known as Crawler Protocol, My protocol, etc.) is "Web Crawler Exclusion Protocol" (Robots Exclusion Protocol), and the web site tells the search engine through Robots Protocol which surfaces can be crawled and which surfaces cannot be crawled. Through the Robots Protocol, the web site tells the search engine which surfaces can be crawled and which surfaces cannot be crawled.

/robots...
/

It is just an agreement, and it is entirely up to the will of the author to comply or not to comply. For example, there is a sticker on a bus that says "Please give up your seat for the infirm, the sick, the handicapped, and the pregnant", but most people don't follow it. Generally speaking, only most search engine crawlers will follow your site's protocols, while others will not even look at what you have written.

Second Step: Data Storage

Search engines crawl the web by spidering links and store the crawled data in a database of original surfaces. This is exactly the same as the HTML that a user's browser would get from the web site. When search engine spiders crawl the surfaces, they also do a certain amount of duplicate content detection, and if they come across a lot of plagiarized, harvested, or duplicated content on a website with a very low weight, it is likely that the website will not be crawled again.

Step 3: Pre-processing

Search engines perform a variety of pre-processing steps on the surfaces retrieved by spiders.

Extract
Chinese character splitter (grammar)
Go to the stoppage.
classical Chinese poem
Elimination of Noise (Search engines need to recognize and eliminate such noise, such as copyrighted text, navigation bar, obituaries, etc.) ......)
forward indexing
leading light (of a musical instrument)
Citation Link Relationship Calculation
Special File Handling
....

In addition to HTML files, search engines can usually crawl and index a wide range of file types based on files such as PDF, Word, WPS, XLS, PPT, TXT files, etc. We often see these file types in search results. We often see these file types in the search results as well.

However, search engines are not yet capable of handling non-textual content such as photographs, videos, Flash, or executing scripts and programs.

Step 4: Ranking and providing search services
A search engine is a system that collects information from the Internet based on a certain strategy using a specific computer program, organizes and processes the information, provides search services for users, and displays information relevant to users' searches to users.

However, these general-purpose search engines have limitations:

Users in different fields and with different backgrounds often have different search goals and needs, and the results returned by using search engines contain a large number of web pages that users are not interested in.
The goal of using search engines is to achieve the greatest possible web coverage, and the shield between the limited resources of search engine servers and the unlimited resources of web data will be further deepened.
The richness of data forms in the World Wide Web and the continuous development of web technologies have led to the emergence of a large amount of different data such as photographs, databases, audio, and multimedia, and the use of search engines is often ineffective in discovering and accessing these information-intensive data with a certain structure.
Most common search engines provide keyword-based searches and do not support queries based on semantic information.

III. Focused Crawler (Focused Crawler)

Focused crawling, also known as thematic crawling (or specialized crawling), is a web crawling program that is "oriented to a specific topic". It differs from the commonly used crawler (generic crawler) in that Focused Crawler performs theme filtering when performing web crawling. It tries to make sure that only the grid information related to the topic is crawled.
Rather than aiming for a large coverage, the Focused Grid Crawl aims to crawl the web for content related to a specific topic and prepare data resources for user queries on that topic.
Focused crawling is a complex process that involves filtering links that are unrelated to the topic according to a web analyzing algorithm, retaining useful links and placing them in a queue of URLs to be crawled. It then selects the next URL from the queue based on a search strategy, and repeats the process until it reaches a certain condition in the system.
In addition, all crawled grids will be stored, analyzed, filtered, and indexed for subsequent querying and retrieval; for focused crawlers, the results of this process may also provide feedback and guidance for future crawling.

Above is the details of analyzing python general-purpose crawler and focused crawler, for more information about python crawler, please pay attention to my other related articles!