前言

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

在我们浏览网页,浏览器会渲染输出HTML、JS、CSS等信息；通过这些元素，我们就可以看到我们想要查看的新闻,图片,电影,评论,商品等等。一般情况下我们看到自己需要的内容，图片可能会复制文字并且下载图片保存，但是如果面对大量的文字和图片，我们人工是处理不过来的，同时比如类似百度需要每天定时获取大量网站最新文章并且收录，这些大量数据与每天的定时的工作我们是无法通过人工去处理的，这时候爬虫的作用就体现出来了。

内容介绍：

话不多说，直接开始，开始我们的论坛爬虫旅程。

1、模块导入

# encoding:utf8
import requestsfrom bs4 import BeautifulSoup

导入requests网络数据请求模块，用于网络爬虫。导入BeautifulSoup尾页解析模块，用于网页数据处理。

2、获取url资源

def getHtmlList(list, url, main_url):
    try:
        soup = getHtmlText(url)
        managesInfo = soup.find_all('td', attrs={'class': 'td-title faceblue'})
        for m in range(len(managesInfo)):
            a = managesInfo[m].find_all('a') #获取帖子的位置
            for i in a:
                try:
                    href = i.attrs['href']
                    list.append(main_url + href) #把帖子的url存放在list中
                except:
                    continue
            print(list)
    except:
        print("获取网页失败")

获取一个url，通过requests.get()方法，获取页面的信息，这是一个获取url资源的模块。

3、获取子帖列表

def getHtmlList(list, url, main_url):
    try:
        soup = getHtmlText(url)
        managesInfo = soup.find_all('td', attrs={'class': 'td-title faceblue'})
        for m in range(len(managesInfo)):
            a = managesInfo[m].find_all('a') #获取帖子的位置
            for i in a:
                try:
                    href = i.attrs['href']
                    list.append(main_url + href) #把帖子的url存放在list中
                except:
                    continue
            print(list)
    except:
        print("获取网页失败")

获取一个url，调用第一个函数解析财经论坛页面，获取到其中的子帖子的url，存放在list中。这个方法得到了该链接下所有子帖的网络链接，为接下来的数据爬取做准备。子帖列表如下：

4、解析页面

def getHtmlInfo(list, fpath):
    for i in list:
        infoDict = {} #初始化存放帖子要获取的全部信息的字典
        authorInfo = [] #初始化存放帖子评论的作者的信息的列表
        comment = [] #初始化存放帖子评论的信息的列表
        try:
            soup = getHtmlText(i)
            if soup == "": #如果页面不存在则跳过，继续获取
                continue
            Info = soup.find('span', attrs={'style': 'font-weight:400;'})
            title = Info.text # 获取帖子的标题
            infoDict.update({'论坛话题:  ': title}) #把帖子的标题内容存放到字典中
            author = soup.find_all('div', attrs={'class': 'atl-info'})
            for m in author:
                authorInfo.append(m.text) #把帖子中的评论的作者的信息存放到列表里
            author = soup.find_all('div', attrs={'class': 'bbs-content'})
            for m in author:
                comment.append(m.text) #把帖子的评论的信息存放在列表里
            for m in range(len(authorInfo)):
                key = authorInfo[m] + '\n'
                value = comment[m] + '\n'
                infoDict[key] = value # 把评论的作者的信息跟评论的内容以键值对的形式存储起来
            # 把获取到的信息存放在自己指定的位置
            with open(fpath, 'a', encoding='utf-8')as f:
                for m in infoDict:
                    f.write(str(m) + '\n')
                    f.write(str(infoDict[m]) + '\n')
        except:
            continue

把list中的url通过for循环一个一个解析页面，获取其中我们想要的内容，然后把得到的内容存放在指定的电脑的位置里。

5、传入参数

def main():
    main_url = 'http://bbs.tianya.cn'
    develop_url = 'http://bbs.tianya.cn/list-1109-1.shtml'
    #develop_url = 'http://bbs.tianya.cn/list-develop-1.shtml'
    ulist = []

    fpath = r'E:\tianya.txt'
    getHtmlList(ulist, develop_url, main_url)
    getHtmlInfo(ulist, fpath)
main() # 运行main函数

输入爬取的网页名称以及数据保存路径，本文未对爬取的数据进行进一步解析。爬取结果如下，包括主帖的内容已经跟贴的内容。

栏目列表

首页 > temp > 简明python教程 >

初探numpy

前言