VB.net 2010 视频教程 VB.net 2010 视频教程 python基础视频教程
SQL Server 2008 视频教程 c#入门经典教程 Visual Basic从门到精通视频教程
当前位置:
首页 > Python基础教程 >
  • 爬虫(十八):Scrapy框架(五) Scrapy通用爬虫(3)

replace(',"',".0")
  • else:
  • score=score
  •  
  • else:
  • score='NULL'
  • return score
  •  
  • def get_story(self,response):
  • story=response.xpath('//div[@class="book-intro"]/p/text()').extract()[0]
  • if len(story)>0:
  • story=story.strip()
  • else:
  • story='NULL'
  • return story
  •  
  • def get_news(self,response):
  • news=response.xpath('//div[@class="detail"]/p[@class="cf"]/a/text()').extract()[0]
  • if len(news)>0:
  • news=news.strip()
  • else:
  • news='NULL'
  • return news
  • 其他部分就没什么变化了,就settings加入了请求头:

    
    	
    1. DEFAULT_REQUEST_HEADERS = {
    2. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    3. 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv 11.0) like Gecko',
    4. }

    1.3.4 运行程序

    scrapy crawl read

    运行结果:

     

    1.3.5 完整代码

    read.py:

    
    	
    1. # -*- coding: utf-8 -*-
    2.  
    3. from scrapy.linkextractors import LinkExtractor
    4. from scrapy.spiders import CrawlSpider, Rule
    5. from qd.items import QdItem
    6. import requests
    7.  
    8. class ReadSpider(CrawlSpider):
    9. name = 'read'
    10. # allowed_domains = ['qidian.com']
    11. start_urls = ['https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1']
    12.  
    13. rules = (
    14. #匹配全部主页面的url规则 深度爬取子页面
    15. Rule(LinkExtractor(allow=(r'https://www.qidian.com/all\?orderId=\&style=1\&pageSize=20\&siteid=1\&pubflag=0\&hiddenField=0\&page=(\d+)')),follow=True),
    16. #匹配详情页面 不作深度爬取
    17. Rule(LinkExtractor(allow=r'https://book.qidian.com/info/(\d+)'), callback='parse_item', follow=False),
    18. )
    19.  
    20. def parse_item(self, response):
    21. item=QdItem()
    22. item['book_name']=self.get_book_name(response)
    23. item['author']=self.get_author(response)
    24. item['state']=self.get_state(response)
    25. item['type']=self.get_type(response)
    26. item['about']=self.get_about(response)
    27. item['score']=self.get_score(response)
    28. item['story']=self.get_story(response)
    29. item['news']=self.get_news(response)
    30. yield item
    31.  
    32. def get_book_name(self,response):
    33.  
    34. book_name=response.xpath('//h1/em/text()').extract()[0]
    35. if len(book_name)>0:
    36. book_name=book_name.strip()
    37. else:
    38. book_name='NULL'
    39. return book_name
    40.  
    41. def get_author(self,response):
    42. author=response.xpath('//h1/span/a/text()').extract()[0]
    43. if len(author)>0:
    44. author=author.strip()
    45. else:
    46. author='
    
    相关教程