Class libraries for parsing url's.
python2 version:
from urlparse import urlparse import urllib
python3 version:
from import urlparse import
After researching different url rules, I found that as long as the search keyword is grafted with =, the key to the query is in the parsed query
If not grafted with =, the key to the query is in the parsed path.
The rules for parsing are all the same, with the following canonical rules: (combinations of 6 different cases)
Also the url encoding of host as '' is different from others to be handled separately.
The code is as follows: some of the rules of the site is not very clear, need to spend a lot of time to find the rules, the clearer the rules, the clearer the keywords, the following rules have been suitable for the vast majority of the site, discretionary reference.
# -*- coding:utf-8 -*- from urlparse import urlparse import urllib import re # url source_txt = "E:\\python_Anaconda_code\\" # Rules regular = r'(\w+(%\w\w)+\w+|(%\w\w)+\w+(%\w\w)+|\w+(%\w\w)+|(%\w\w)+\w+|(%\w\w)+|\w+)' # Store keywords kw_list = list() # key is the host of the site to be researched, value is the graft identifier of the keyword dict = { "": "wd=", "": "word=", "": "query=", "": "kw=", "": "word=", "": "k=", "": "q=", "": "list_", "": "query=", "": "weibo/" } def Main(): with open(source_txt, 'r') as f_source_txt: for url in f_source_txt: host = ("//")[1].split("/")[0] if host in dict: flag = dict[host] if ("=") != -1: query = urlparse(url).('+', '') kw = (flag + regular, query, ) # .group(0) if kw: kw = ((0).split(flag)[1]) print(kw) else: path = urlparse(url).('+', '') kw = (flag + regular, ("%25", "%"), ) if kw: kw = ((0).split(flag)[1]) print(kw) if __name__ == '__main__': Main()
of the following:
/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&ch=&tn=baidu&bar=&wd=python&rn=&oq=&rsv_pq=ece0867c0002c793&rsv_t=edeaQq7DDvZnxq%2FZVra5K%2BEUanlTIUXhGIhvuTaqdfOECLuXR25XKDp%2Bi0I&rqlang=cn&rsv_enter=1&inputT=218 /s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python%E9%87%8C%E7%9A%84%E5%AD%97%E5%85%B8dict&oq=python&rsv_pq=96c160e70003f332&rsv_t=0880NkOvMIr3TvOdDP1t8EbloD8qwr4yeP6CfPjQihQNNhdExfuwyOFMrx0&rqlang=cn&rsv_enter=0&inputT=10411 /s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python%E9%87%8C%E7%9A%84urlprese&oq=python%25E9%2587%258C%25E7%259A%2584re%25E9%2587%258C%25E7%259A%2584%257C%25E6%2580%258E%25E4%25B9%2588%25E7%2594%25A8&rsv_pq=d1d4e7b90003d391&rsv_t=5ff4Vok4EELK1PgJ4oSk8L0VvKAn51%2BL8ns%2FjSubexg7Lb7znKcTvnVtn8M&rqlang=cn&rsv_enter=1&inputT=2797 /s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python++wo+%E7%88%B1urlprese&oq=python%25E9%2587%258C%25E7%259A%2584urlprese&rsv_pq=eecf45e900033e87&rsv_t=1c70xAYhrvw5JOZA7lpVgt4pw%2BW1TO8hqTejTh67JgEQfqAGyDydd25HAmU&rqlang=cn&rsv_enter=0&inputT=10884 /ns?word=%E8%B6%B3%E7%90%83&tn=news&from=news&cl=2&rn=20&ct=1 /ns?ct=1&rn=20&ie=utf-8&bs=%E8%B6%B3%E7%90%83&rsv_bp=1&sr=0&cl=2&f=8&prevct=no&tn=news&word=++++++%E8%B6%B3++%E7%90%83+++++%E4%BD%A0%E5%A5%BD+%E5%98%9B%EF%BC%9F&rsv_sug3=14&rsv_sug4=912&rsv_sug1=4&inputT=8526 /f?ie=utf-8&kw=%E7%BA%A2%E6%B5%B7%E8%A1%8C%E5%8A%A8&fr=search&red_tag=q0224393377 /web?query=ni+zai+%E6%88%91+%E5%BF%83li&_asf=&_ast=1520388441&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=9493&sst0=1520388440692&lkt=8%2C1520388431200%2C1520388436842&sugsuv=1498714959961744&sugtime=1520388440692 /jobs/list_python%E5%A4%A7%E6%95%B0%E6%8D%AEmr?labelWords=&fromSearch=true&suginput= /pc/search/?query=%E6%85%A2%E6%80%A7%E4%B9%99%E8%82%9D% /weibo/%25E5%2594%2590%25E4%25BA%25BA%25E8%25A1%2597%25E6%258E%25A2%25E6%25A1%25882&Refer=index /weibo/%25E4%25BD%25A0%25E5%25A5%25BD123mm%2520%25E5%2597%25AF%2520mm11&Refer=STopic_box
The results are as follows:
If you want to study other hosts, you can add them to the dictionary dict.
Note: The above code and ideas are for reference only, if there is a better way please leave a message!
Above this Python parsing, extract url keyword examples in detail is all I have shared with you, I hope to give you a reference, and I hope you support me more.