Ten points to prevent gold collection suggestions

The author himself writes a collector, so he has some experience in preventing website collection. Because it is during working hours, various methods are just mentioned simply.
When implementing many anti-collection methods, it is necessary to consider whether they affect search engines’ crawling of websites, so let’s first analyze the differences between general collectors and search engine crawler collections.

Similarities:

a. Both need to be directly crawled to the web page source code to work effectively.
b. Both will crawl a large number of visited website content multiple times within a unit time;

c. From a macro perspective, both IPs will change;

d. The two are too impatient to crack some of your encryption (verification) of the web page, such as encrypting the content through a js file, such as entering a verification code to browse the content, such as logging in to access the content, etc.

Differences:

Search engine crawlers first ignore the entire web page source code scripts and styles and html tag code, and then perform a series of complex processing such as word-cutting grammar and syntax analysis on the remaining text parts. The collector generally uses the characteristics of the html tag to capture the required data. When creating the collection rules, you need to fill in the start mark and end mark of the target content, so as to locate the required content; or use specific regular expressions to create specific web pages to filter out the required content. Whether it is using the start-end flag or regular expression, it involves html tags (web page structure analysis).

Then come up with some anti-collection methods

1. Limit the number of visits per unit time of IP address

Analysis: No ordinary person can access the same website 5 times in a second, unless it is a program access, and with this preference, only search engine crawlers and annoying collectors are left.

Disadvantages: One-size-fits-all, this will also prevent search engines from collecting websites

Applicable websites: websites that do not rely too much on search engines

What will the collector do: reduce the number of visits per unit time and reduce the collection efficiency

2. Block IP

Analysis: Through the background counter, the visitor IP and access frequency are recorded, the visit record is artificially analyzed, and suspicious IP is blocked.

Disadvantages: There seems to be no disadvantages, just the webmaster is a little busy

Applicable websites: All websites, and the webmaster can know which robots are from Google or Baidu

What will the collector do: fight guerrilla warfare! Using IP agent to collect and replace it once, but it will reduce the efficiency and network speed of the collector (use a proxy).

3. Use js to encrypt web content

Note: I have never been exposed to this method, but I just look at it from other places

Analysis: No need to analyze, search engine crawlers and collectors will kill them

Applicable websites: Websites that hate search engines and collectors extremely

The collector will do this: You are so awesome, you are willing to give up everything, and he won't pick you up

4. Hidden website copyright or some random junk text on the web page, and these text styles are written in the css file.

Analysis: Although it cannot prevent collection, it will fill the collected content with the copyright statement of your website or some spam text, because generally the collector will not collect your css files at the same time, and those texts will be displayed without style.

Applicable websites: All websites

What will the collector do: It is easy to deal with for copyrighted text, replace it. There is nothing we can do about random junk text, so be diligent.

5. Users can only access website content by logging in

Analysis: Search engine crawlers will not design login programs for every such type of website. I heard that the collector can design a website to simulate the user login and submit form behavior.

Applicable websites: Websites that hate search engines extremely and want to block most collectors

What will the collector do: create a module that intends to submit a form for users to log in
6. Use script language to make paging (hide paging)

Analysis: The same sentence is true, search engine crawlers will not analyze hidden paging of various websites, which affects search engines' inclusion. However, when collectors write collection rules, they need to analyze the target web page code. Those who know a little about scripting will know the real link address of the paging.

Applicable websites: websites that are not highly dependent on search engines, and those who collect your knowledge of scripts do not understand

What will the collector do: It should be said that the collector will do it. Anyway, he has to analyze your web page code and analyze your pagination scripts, which won’t take much extra time.

7. Anti-theft link measures (only allowed to view through this website page, such as: ("HTTP_REFERER") )

Analysis: Asp and php can determine whether the request comes from this website by reading the HTTP_REFERER attribute of the request, thereby limiting the collector, and also restricting search engine crawlers, seriously affecting the search engine's inclusion of some anti-theft link content on the website.

Applicable websites: Websites included in search engines are not considered

What will the collector do: It's not difficult to disguise HTTP_REFERER.

8. Full flash, picture or pdf to present website content

Analysis: It has poor support for search engine crawlers and collectors, and many people who know a little SEO know this.

Applicable websites: Media design and do not care about websites included in search engines

What will the collector do: No more picking, leave

9. The website randomly adopts different templates

Analysis: Because the collector locates the required content based on the web structure, once the template is replaced twice, the collection rules will be invalid, which is good. And this has no effect on search engine crawlers.

Applicable websites: dynamic websites, and user experience is not considered.

What will the collector do: There cannot be more than 10 templates on a website, just make one rule for each template, and different templates adopt different collection rules. If there are more than 10 templates, since the target website has tried so hard to replace the templates, it will be convenient for him to withdraw.

10. Use dynamic irregular html tags

Analysis: This is quite perverted. Considering that the effects of html tags containing spaces and not containing spaces are the same, < div > and < div > are the same for page display, but as the collector tag, there are two different tags. If the number of spaces in the html tag of the second page is random, then
The collection rules will be invalid. However, this has little impact on search engine crawlers.

Suitable for websites: All websites that are dynamic and do not want to comply with web design specifications.

What will the collector do: There are still countermeasures. There are still many html cleaners now. First clean the html tags, and then write the collection rules; you should clean the html tags before the collection rules, and you can still get the required data.

Summarize:

Once you have to search engine crawlers and collectors at the same time, this is a very helpless thing, because the first step of search engines is to collect the content of the target web page, which is the same as the principle of collectors. Therefore, many methods to prevent collection also hinder search engines from collecting websites. Helpless, right? Although the above 10 suggestions cannot be 100% protected from collection, several methods are applied together and have rejected most of the collectors.