python爬取3万+条评论，解读猫眼评分9.5的《海王》是否值得一看？

当前位置:

首页 > 编程开发 > 数据分析 >

python爬取3万+条评论，解读猫眼评分9.5的《海王》是否值得一看？

前言

2018年12月7日，本年度最后一部压轴大片《海王》如期上映，目前猫眼评分达到9.5分，靠着1.5亿美金的制作成本，以小博大，目前票房接近9亿，本文爬取了猫眼3w+条评论，多方位带你解读是否值得一看！！其实(yin)我(wei)也(mei)没(qian)看!

除了这个案例。我还会在裙里分享各种有趣的python项目案例视频教程，有兴趣的可以来我的python学习免肥解答.裙：七衣衣九七七巴而五（数字的谐音）转换下可以找到了，这里还有资深程序员分享以前学习心得，学习笔记，还有一线企业的工作经验等

海王

数据爬取

现在猫眼电影网页似乎已经全部服务端渲染了，没有发现相应的评论接口，参考了之前其他文章中对于猫眼数据的爬取方法，找到了评论接口！
http://m.maoyan.com/mmdb/comments/movie/249342.json?v=yes&offset=15&startTime=2018-1208%2019%3A17%3A16%E3%80%82

检查网页发现无评论链接.png

接口有了，但是没有对应的电影id，不过这难不倒我们，使用猫眼app+charles，我们成功找到海王对应的电影ID；

电影id获取

接下来爬取评论：

#获取数据
def get_data(url):
    headrs = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
    }
    html = request(method='GET',url=url,headers=headrs)
    if html.status_code == 200:
        return html.content
    else:
        return None

解析接口返回数据

#处理接口返回数据
def parse_data(html):
    json_data = json.loads(html,encoding='utf-8')['cmts']
    comments = []
    try:
        for item in json_data:
            comment = {
                'nickName':item['nickName'],
                'cityName':item['cityName'] if 'cityName' in item else '',
                'content':item['content'].strip().replace('\n',''),
                'score':item['score'],
                'startTime': item['startTime']
            }
            comments.append(comment)
        return comments
    except Exception as e:
        print(e)

处理链接及存储数据

def change_url_and_save():
    start_time = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())).replace(' ','%20')
    end_time = '2018-12-07 00:00:00'
    while start_time > end_time:
        url = "http://m.maoyan.com/mmdb/comments/movie/249342.json?v=yes&offset=15&startTime="+start_time
        html = None
        try:
            html = get_data(url)
        except Exception as e:
            time.sleep(0.5)
            html = get_data(url)
        else:
            time.sleep(0.1)
        comments = parse_data(html)
        start_time = comments[14]['startTime']
        print(start_time)
        t = datetime.datetime.now()
        start_time = time.strptime(start_time,'%Y-%m-%d %H:%M:%S')
        start_time = datetime.datetime.fromtimestamp(time.mktime(start_time))+datetime.timedelta(seconds=-1)
        start_time = time.mktime(start_time.timetuple())
        start_time = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(start_time)).replace(' ', '%20')
        for item in comments:
            print(item)
            with open('/Users/mac/Desktop/H5DOC/H5learn/REPTILE/comments.txt', 'a', encoding='utf-8')as f:
                f.write(item['nickName'] + ',' + item['cityName'] + ',' + item['content'] + ',' + str(item['score']) +','+ item[
                    'startTime'] + '\n')