python基础教程之scrapy-redis 分布式哔哩哔哩网站用

当前位置:

首页 > Python基础教程 >

python基础教程之scrapy-redis 分布式哔哩哔哩网站用

scrapy里面，对每次请求的url都有一个指纹，这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据，为了让爬虫的数据不重复，我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹，并且用redis里面的set集合来判断是否重复。

setting.py

			
									# -*- coding: utf-8 -*-

									# Scrapy settings for bilibili project

									#

									# For simplicity, this file contains only settings considered important or

									# commonly used. You can find more settings consulting the documentation:

									#

									#     https://doc.scrapy.org/en/latest/topics/settings.html

									#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

									#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

									BOT_NAME = 'bilibili'

									SPIDER_MODULES = ['bilibili.spiders']

									NEWSPIDER_MODULE = 'bilibili.spiders'

									# Crawl responsibly by identifying yourself (and your website) on the user-agent

									#USER_AGENT = 'bilibili (+http://www.yourdomain.com)'

									# Obey robots.txt rules

									# ROBOTSTXT_OBEY = True

									# Configure maximum concurrent requests performed by Scrapy (default: 16)

									#CONCURRENT_REQUESTS = 32

									# Configure a delay for requests for the same website (default: 0)

									# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

									# See also autothrottle settings and docs

									DOWNLOAD_DELAY = 1

									# The download delay setting will honor only one of:

									#CONCURRENT_REQUESTS_PER_DOMAIN = 16

									#CONCURRENT_REQUESTS_PER_IP = 16

									# Disable cookies (enabled by default)

									#COOKIES_ENABLED = False

									# Disable Telnet Console (enabled by default)

									#TELNETCONSOLE_ENABLED = False

									# Override the default request headers:

									DEFAULT_REQUEST_HEADERS = {

									  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

									  'Accept-Language': 'en',

									}

									# Enable or disable spider middlewares

									# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

									#SPIDER_MIDDLEWARES = {

									#    'bilibili.middlewares.BilibiliSpiderMiddleware': 543,

									#}

									# Enable or disable downloader middlewares

									# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

									DOWNLOADER_MIDDLEWARES = {

									    'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,

									    'bilibili.middlewares.randomUserAgentMiddleware':400

									}

									# Enable or disable extensions

									# See https://doc.scrapy.org/en/latest/topics/extensions.html

									#EXTENSIONS = {

									#    'scrapy.extensions.telnet.TelnetConsole': None,

									#}

									# Configure item pipelines

									# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

									ITEM_PIPELINES = {

									   'bilibili.pipelines.BilibiliPipeline': 300,

									    'scrapy_redis.pipelines.RedisPipeline':300

									}

									# Enable and configure the AutoThrottle extension (disabled by default)

									# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

									#AUTOTHROTTLE_ENABLED = True

									# The initial download delay

									#AUTOTHROTTLE_START_DELAY = 5

									# The maximum download delay to be set in case of high latencies

									#AUTOTHROTTLE_MAX_DELAY = 60

									# The average number of requests Scrapy should be sending in parallel to

									# each remote server

									#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

									# Enable showing throttling stats for every response received:

									#AUTOTHROTTLE_DEBUG = False

									# Enable and configure HTTP caching (disabled by default)

									# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

									#HTTPCACHE_ENABLED = True

									#HTTPCACHE_EXPIRATION_SECS = 0

									#HTTPCACHE_DIR = 'httpcache'

									#HTTPCACHE_IGNORE_HTTP_CODES = []

									#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

									SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

									DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

									REDIS_URL = 'redis://@127.0.0.1:6379'

									SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

栏目列表