首页 > temp > python入门教程 >
-
2016-2020Chinese-Weather-Analysis(二)
1.创建天气后报网爬虫
在开始编程之前,我们首先要根据项目需求对天气后报网站进行分析。目标是提取2016-2020年每个城市的每天的温度、天气状况、风力风向等数据。首先来到天气后报网(http://www.tianqihoubao.com/lishi/)。如图1所示。
图 1
可以看到列表中每个省份下的城市信息,以北京市为例,点击进去,进入二级页面。
、 图 2
以2011年1月北京天气为例,进入三级页面(详情页面),其中可以看到日期、天气状况、气温、风力风向等所需的信息。
图 3
以上将整个爬虫项目的流程分析完成,编程可以开始了。首先在命令行中切换到用于存储项目的路径,然后输入下面命令创建爬虫项目和爬虫模块:
1 scrapy startproject tqhbCrawl 2 cd tqhbCrawl 3 scrapy genspider -t crawl tqhb_spider "tianqihoubao.com/lishi/"
2.定义Item
创建完工程后,首先要做的是定义Item,确定我们需要提取的结构化数据。代码如下:
1 import scrapy 2 3 class TqhbItem(scrapy.Item): 4 # 城市名 5 city_name = scrapy.Field() 6 # 日期 7 date = scrapy.Field() 8 # 天气状况 9 state = scrapy.Field() 10 # 风力风向 11 wind = scrapy.Field() 12 #温度 13 temp = scrapy.Field()
3.编写爬虫模块
通过genspider命令已经创建了一个基于CrawlSpider 类的爬虫模板,类名称为 TqhbSpiderSpider,下面进行开始页面解析,主要有两个方法。detail_url 方法用于解析图2所示的列表信息,抽取三级页面url的链接信息。parse 方法用于抽取图3所示的基本的信息。对于二级页面链接的抽取,则是在 rules 中定义抽取规则(只能抽取start_urls中符合 rules 的链接,故需要使用 detail_url 方法构造三级链接 TqhbSpiderSpider完整代码如下:
1 class TqhbSpiderSpider(CrawlSpider): 2 name = 'tqhb_spider' 3 allowed_domains = ['tianqihoubao.com'] 4 start_urls = ['http://tianqihoubao.com/lishi'] 5 6 rules = ( 7 Rule(LinkExtractor(allow='.+lishi.+html'),callback="detail_url",follow=False), 8 ) 9 10 def detail_url(self, response): 11 base_url = "http://tianqihoubao.com" 12 divs = response.xpath("//div[@class='box pcity']")[5:9] 13 detail_urls = divs.xpath(".//a/@href").getall() 14 for detail_url in detail_urls: 15 yield scrapy.Request(base_url+detail_url,callback=self.parse) 16 17 def parse(self, response): 18 # 获取 城市 日期 天气状态 气温 风力风向信息 19 city_name = response.xpath('//div[@id="s-calder"]/h2/text()').get() 20 city_name = ''.join(re.findall(r'[^0-9]', city_name))[:-9] 21 trs = response.xpath("//tr")[1:] 22 for tr in trs: 23 tds = tr.xpath(".//td") 24 date = tds[0].xpath(".//text()").getall() 25 date = "".join(''.join(date).split()) 26 state = tds[1].xpath(".//text()").getall() 27 state = "".join(''.join(state).split()) 28 temp = tds[2].xpath(".//text()").getall() 29 temp = "".join(''.join(temp).split()) 30 wind = tds[3].xpath(".//text()").getall() 31 wind = "".join(''.join(wind).split()) 32 item = TqhbItem( 33 city_name=city_name, 34 date=date, 35 state=state, 36 temp=temp, 37 wind=wind) 38 yield item
4.Pipeline
下面开始编写Pipeline,主要完成 Item 到 SCV 表的存储。
1 class TqhbPipeline(object): 2 def __init__(self): 3 self.fp = open("tqhb.csv", 'wb') 4 self.exporter = CsvItemExporter( 5 self.fp, encoding='utf-8') 6 7 def open_spider(self, spider): 8 print("爬虫开始....") 9 10 11 def process_item(self, item, spider): 12 self.exporter.export_item(item) 13 return item 14 15 def close_spider(self, spider): 16 self.fp.close() 17 print("爬虫结束了....")
最后在 settings 中将如下代码的注释取消掉:
1 ITEM_PIPELINES = { 2 'tqhb.pipelines.TqhbPipeline': 300, 3 }
5.应对反爬虫机制
为了不被反爬虫机制检测到,主要采用了伪造随机 ‘User-Agent’、自动限速、禁用 robots.txt 等措施。
1.伪造随机 User-Agent,编写middlewares.py
1 import random 2 3 class TqhbDownloaderMiddleware(object): 4 user_agents = [ 5 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.23 Safari/537.36", 6 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36", 7 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36", 8 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.27 Safari/537.36", 9 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36", 10 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4280.87 Safari/537.36", 11 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36", 12 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75", 13 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.57", 14 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.54", 15 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.50", 16 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.6", 17 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.8", 18 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.14" 19 ] 20 def process_request(self,request,spider): 21 user_agent = random.choice(self.user_agents) 22 request.headers["User-Agent"]=user_agent
并使用该中间件设置DEFAULT_REQUEST_HEADERS:
1 DOWNLOADER_MIDDLEWARES = { 2 'tqhb.middlewares.TqhbDownloaderMiddleware': 543, 3 } 4 5 DEFAULT_REQUEST_HEADERS = { 6 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 7 'Accept-Language': 'en', 8 }
2.自动限速设置:
1 DOWNLOAD_DELAY = 1 2 AUTOTHROTTLE_ENABLED = True 3 AUTOTHROTTLE_START_DELAY = 5 4 AUTOTHROTTLE_MAX_DELAY = 60
3.禁用禁用 robots.txt
1 ROBOTSTXT_OBEY = False
6.运行项目
在项目文件下创建start.py,代码如下:
1 from scrapy import cmdline 2 3 cmdline.execute("scrapy crawl tqhb_spider".split())
存储效果
以上所有代码皆可在本人github账号上下载: https://github.com/chyhoo/2016-2020Chinese-Weather-Analysis
文章出处:https://www.cnblogs.com/chyhoo/p/14581518.html