SoFunction
Updated on 2024-10-29

Example analysis of scrapy processing project data in python

After we process the data, we are accustomed to put it in the original location, but this will also appear a certain hidden danger. If because of the addition of new data or various other reasons, when we want to enable this file again, the partners will start to be anxious but how to turn out, it seems that there is no other better way to collect, and re-organize the data is obviously unrealistic. Here we look at python crawler scrapy processing project data together.

1. Pulling items

$ git clone /jonbakerfish/

$ cd TweetScraper/

$ pip install -r  #add '--user' if you are not root

$ scrapy list

$ #If the output is 'TweetScraper', then you are ready to go.

2. Data persistence

By reading the documentation, we found that the project has three ways to persist the data, the first one is to save it in a file, the second one is to save it in Mongo, and the third one is to save it in MySQL database. Since the data we capture needs to be analyzed at a later stage, therefore, it needs to be saved in MySQL.

The captured data is by default saved in Json format on disk . /Data/tweet/, so the configuration file TweetScraper/ needs to be modified.

ITEM_PIPELINES = {  # '':100,
#'':100, # replace `SaveToFilePipeline` with this to use MongoDB
  '':100, # replace `SaveToFilePipeline` with this to use MySQL
}
#settings for mysql
MYSQL_SERVER = "18.126.219.16"
MYSQL_DB   = "scraper"
MYSQL_TABLE = "tweets" # the table will be created automatically
MYSQL_USER  = "root"    # MySQL user to use (should have INSERT access granted to the Database/Table
MYSQL_PWD  = "admin123456"    # MySQL user's password

Content Extension:

is the project's configuration file

from  import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = [""]
start_urls = [
  "/Computers/Programming/Languages/Python/Books/",
  "/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
  filename = ("/")[-2]
  open(filename, 'wb').write()

to this article on the python scrapy processing project data example analysis of the article is introduced to this, more related python crawler scrapy how to deal with project data content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!