VB.net 2010 视频教程 VB.net 2010 视频教程 python基础视频教程
SQL Server 2008 视频教程 c#入门经典教程 Visual Basic从门到精通视频教程
当前位置:
首页 > temp > python入门教程 >
  • 爬取本blog所有文章链接

刚接触python,试一下爬虫。拿自己的Blog开刀
import requests
from bs4 import BeautifulSoup
import pprint
url = "https://www.cnblogs.com/zyqgold/"

#爬取分页
def download_all_htmls():
    htmls = []
    for i in range(7):
        url = f"https://www.cnblogs.com/zyqgold/default.html?page={i+1}"
        #print("页面URL:",url)
        r = requests.get(url)
        if r.status_code != 200:
            raise Exception("error")
        htmls.append(r.text)
    return htmls
#爬取分页里边的文章链接
def parse_single_html(html):
    soup = BeautifulSoup(html,"html.parser")
    articles = soup.find_all("a",class_= "postTitle2 vertical-middle")
    nodes =[]
    for article in articles:
        nodes.append({"name":article.span.string,"link":article.attrs["href"]})
    return nodes

htmls = download_all_htmls()

all_html = []
for html in htmls:
    all_html.extend(parse_single_html(html))
pprint.pprint(all_html)

出  处:https://www.cnblogs.com/zyqgold/p/14612501.html



      



  

相关教程