VB.net 2010 视频教程 VB.net 2010 视频教程 python基础视频教程
SQL Server 2008 视频教程 c#入门经典教程 Visual Basic从门到精通视频教程
当前位置:
首页 > Python基础教程 >
  • 爬虫(十七):Scrapy框架(四) 对接selenium爬取京东商品数据(3)

)
  •  
  • def open_spider(self,spider):
  • self.client = pymongo.MongoClient(self.mongo_url)
  • self.db = self.client[self.mongo_db]
  •  
  • def process_item(self,item, spider):
  • # name = item.__class__.collection
  • name = self.collection
  • self.db[name].insert(dict(item))
  • return item
  •  
  • def close_spider(self,spider):
  • self.client.close()
  • 1.6 配置settings文件

    配置settings文件,将项目中使用到的配置项在settings文件中配置,本项目中使用到了KEYWORDS,MAX_PAGE,SELENIUM_TIMEOUT(页面加载超时时间),MONGOURL,MONGODB,COLLECTION。

    
    	
    1. KEYWORDS=['iPad']
    2. MAX_PAGE=2
    3.  
    4. MONGO_URL = 'localhost'
    5. MONGO_DB = 'test'
    6. COLLECTION = 'ProductItem'
    7.  
    8. SELENIUM_TIMEOUT = 30

    以及修改配置项,激活下载器中间件和item pipeline。

    
    	
    1. DOWNLOADER_MIDDLEWARES = {
    2. 'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543,
    3. }
    4.  
    5. ITEM_PIPELINES = {
    6. 'scrapyseleniumtest.pipelines.MongoPipeline': 300,
    7. }

    1.7 执行结果

    项目中所有需要开发的代码和配置项开发完成,运行项目。

    
    	
    1. scrapy crawl jd

    运行项目之后,在mongodb中查看数据,已经执行成功。

    1.8 完整代码

    items.py:

    
    	
    1. # -*- coding: utf-8 -*-
    2.  
    3. # Define here the models for your scraped items
    4. #
    5. # See documentation in:
    6. # https://docs.scrapy.org/en/latest/topics/items.html
    7. from scrapy import Item,Field
    8.  
    9. class ProductItem(Item):
    10. # define the fields for your item here like:
    11. # name = scrapy.Field()
    12. # dp = Field()
    13. # title = Field()
    14. # price = Field()
    15. # comment = Field()
    16. # url = Field()
    17. # type = Field()
    18. pass

    jd.py:

    
    	
    1. # -*- coding: utf-8 -*-
    2. from scrapy import Request,Spider
    3. from urllib.parse import quote
    4. from bs4 import BeautifulSoup
    5.  
    6. class JdSpider(Spider):
    7. name = 'jd'
    8. allowed_domains = ['www.jd.com']
    9. base_url = 'https://search.jd.com/Search?keyword='
    10.  
    11. def start_requests(self):
    12. for keyword in self.settings.get('KEYWORDS'):
    13. for page in range(1, self.settings.get('MAX_PAGE') + 1):
    14. url = self.base_url + quote(keyword)
    15. # dont_filter = True 不去重
    16. yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)
    17.  
    18. def parse(self, response):
    19. soup = BeautifulSoup(response.text, 'lxml')
    20. lis = soup.find_all(name='li', class_="gl-item")
    21. for li in lis:
    22. proc_dict = {}
    23. dp = li.find(name='span', class_="J_im_icon")
    24. if dp:
    25. proc_dict['dp'] = dp.get_text().strip()
    26. else:
    27. continue
    28. id = li.attrs['data-sku']
    29. title = li.find(name=
    
    相关教程