爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据(2)

当前位置:

首页 > Python基础教程 >

爬虫(十七)：Scrapy框架(四) 对接selenium爬取京东商品数据(2)

page)
			submit.click() # 点击按钮

			time.sleep(5)

			# 判断当前页码出现在了输入的页面中，EC.text_to_be_present_in_element 判断元素在指定字符串中出现

			self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))

			# 等待 #J_goodsList 加载出来，为页面数据，加载出来之后，在返回网页源代码

			self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))

			return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200)

			except TimeoutException:

			return HtmlResponse(url=request.url, status=500, request=request)

首先我在__init__()里对一些对象进行初始化，包括WebDriverWait等对象，同时设置页面大小和页面加载超时时间。在process_request()方法中，我们通过Request的meta属性获取当前需要爬取的页码，将页码赋值给input变量，再将翻页的点击按钮框赋值给submit变量，然后在数据框中输入页码，等待页面加载，直接返回htmlresponse给spider解析，这里我们没有经过下载器下载，直接构造response的子类htmlresponse返回。(当下载器中间件返回response对象时，更低优先级的process_request将不在执行，转而执行其他的process_response()方法，本例中没有其他的process_response(),所以直接将结果返回给spider解析。)

1.4 解析页面

Response对象就会回传给Spider内的回调函数进行解析。所以下一步我们就实现其回调函数，对网页来进行解析。

def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
lis = soup.find_all(name='li', class_="gl-item")
for li in lis:
proc_dict = {}
dp = li.find(name='span', class_="J_im_icon")
if dp:
proc_dict['dp'] = dp.get_text().strip()
else:
continue
id = li.attrs['data-sku']
title = li.find(name='div', class_="p-name p-name-type-2")
proc_dict['title'] = title.get_text().strip()
price = li.find(name='strong', class_="J_" + id)
proc_dict['price'] = price.get_text()
comment = li.find(name='a', id="J_comment_" + id)
proc_dict['comment'] = comment.get_text() + '条评论'
url = 'https://item.jd.com/' + id + '.html'
proc_dict['url'] = url
proc_dict['type'] = 'JINGDONG'
yield proc_dict

这里我们采用BeautifulSoup进行解析，匹配所有商品，随后对结果进行遍历，依次选取商品的各种信息。

1.5 储存结果

提取完页面数据之后，数据会发送到item pipeline处进行数据处理，清洗，入库等操作，所以我们此时当然需要定义项目管道了，在此我们将数据存储在mongodb数据库中。

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
class MongoPipeline(object):
def __init__(self,mongo_url,mongo_db,collection):
self.mongo_url = mongo_url
self.mongo_db = mongo_db
self.collection = collection
@classmethod
#from_crawler是一个类方法，由 @classmethod标识，是一种依赖注入的方式，它的参数就是crawler
#通过crawler我们可以拿到全局配置的每个配置信息，在全局配置settings.py中的配置项都可以取到。
#所以这个方法的定义主要是用来获取settings.py中的配置信息
def from_crawler(cls,crawler):
return cls(
mongo_url=crawler.settings.get('MONGO_URL'),
mongo_db = crawler.settings.get('MONGO_DB'),
collection = crawler.settings.get('COLLECTION')

栏目列表