goal
Two years ago, I made a room-finding robot for the purpose of renting rooms, "crawling Douban rentals and pushing them to WeChat at regular intervals", but it was abandoned after a period of maintenance.
At that time because the code is relatively simple has not been open source, now think about it may be open source can also help some students to better find rental information, so simply organize, open source to github, address:/facert/zufang (local download)
Here is a simple introduction to the principle written at the time:
People in the Imperial City know the difficulty of renting a room, every time you look for a room is exhausted. Douban rental group is considered more reliable housing, but due to the group information is complicated, and there is no search function, want to real-time access to rental information is a very difficult thing, so recently dug a pit for themselves, do a WeChat looking for housing robot, first look at the general effect, see the picture below:
realization
The first is scrapy crawler for douban Beijing rental group real-time crawling, and do full-text search, on the title, description using jieba and whoosh for lexicon and indexing, made api. next is the access to the application, the Internet has a WeChat robot open source [wxBot](/liuwons/wxBo), so it is modified, the implementation of timed push and persistence. wxBot](/liuwons/wxBo), so I modified it to achieve timed push and persistence. Finally by the way the public number also do the same function, support real-time rental information search.
Partial code
scrapy supports a custom pipeline that makes it easy to generate indexes in real time during data entry, see code.
class IndexPipeline(object): def __init__(self, index): = index @classmethod def from_crawler(cls, crawler): return cls( index=('WHOOSH_INDEX', 'indexes') ) def process_item(self, item, spider): = AsyncWriter(get_index(, zufang_schema)) create_time = (item['create_time'], "%Y-%m-%d %H:%M:%S") .update_document( url=item['url'].decode('utf-8'), title=item['title'], description=item['description'], create_time=create_time ) () return item
The search api code is simple:
def zufang_query(keywords, limit=100): ix = get_index('indexes', zufang_schema) content = ["title", "description"] query = MultifieldParser(content, ).parse(keywords) result_list = [] with () as searcher: results = (query, sortedby="create_time", reverse=True, limit=limit) for i in results: result_list.append({'url': i['url'], 'title': i['title'], 'create_time': i['create_time']}) return result_list
summarize
The above is the entire content of this article, I hope that the content of this article for your study or work has a certain reference learning value, thank you for your support.