VB.net 2010 视频教程 VB.net 2010 视频教程 python基础视频教程
SQL Server 2008 视频教程 c#入门经典教程 Visual Basic从门到精通视频教程
当前位置:
首页 > 编程开发 > python爬虫 >
  • 反反爬虫策略

本项目github地址:https://github.com/wangqifan/ZhiHu     

Gtihub相关项目推荐:
知乎爬虫
自建代理池

一.对请求IP等进行限制的。

   以知乎为例,当我们的请求速度到达一定的阈值,会触发反爬虫机制!

   在我爬取知乎百万用户信息中,出现了429错误(Too Many Requests) 详情请见我的博客http://www.cnblogs.com/zuin/p/6227834.html

 

应对策略.

1.降低爬虫采集速率,使速率略低于阈值

进行测试,侦探出阈值。

开启6个线程抓取时,服务器返回429

 for (int i = 0; i < 6; i++)
            {
                ThreadPool.QueueUserWorkItem(GetUser);
            }

开启5个线程时,运行良好,没有遭到阻碍

 for (int i = 0; i < 6; i++)
            {
                ThreadPool.QueueUserWorkItem(GetUser);
            }

所以,如果任务量比较小可以采取这种策略进行

 2.建立代理池

c#实现代理服务很简单

详细请见我的博客http://www.cnblogs.com/zuin/p/6261677.html

每次请求都在代理池中随机获取一个代理,这样就不会达到阈值了。缺点是网上收集代理有效率很低,随时都可能无法使用。

3。使用云代理服务

服务商的代理稳定,高质量。以阿布云为例

官方网站:https://www.abuyun.com/

将资源下载进行修改即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
public static string DownLoadString(string url)
       {
           string Source = string.Empty;
           try
           {
               string proxyHost = "http://proxy.abuyun.com";
               string proxyPort = "9020";
               // 代理隧道验证信息
               string proxyUser = "H71T6AMK7GRE";
               string proxyPass = "D3F01F";
               var proxy = new WebProxy();
               proxy.Address = new Uri(string.Format("{0}:{1}", proxyHost, proxyPort));
               proxy.Credentials = new NetworkCredential(proxyUser, proxyPass);
 
               ServicePointManager.Expect100Continue = false;
 
               Stopwatch watch = new Stopwatch();
               watch.Start();
               HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
               request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0";
               request.Accept = "*/*";
               request.Method = "GET";
               request.Referer = "https://www.zhihu.com/";
               request.Headers.Add("Accept-Encoding"" gzip, deflate, br");
               request.KeepAlive = true;//启用长连接
               request.Proxy = proxy;
               using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
               {
 
                   using (Stream dataStream = response.GetResponseStream())
                   {
 
                       if (response.ContentEncoding.ToLower().Contains("gzip"))//解压
                       {
                           using (GZipStream stream = new GZipStream(response.GetResponseStream(), CompressionMode.Decompress))
                           {
                               using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
                               {
                                   Source = reader.ReadToEnd();
                               }
                           }
                       }
                       else if (response.ContentEncoding.ToLower().Contains("deflate"))//解压
                       {
                           using (DeflateStream stream = new DeflateStream(response.GetResponseStream(), CompressionMode.Decompress))
                           {
                               using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
                               {
                                   Source = reader.ReadToEnd();
                               }
 
                           }
                       }
                       else
                       {
                           using (Stream stream = response.GetResponseStream())//原始
                           {
                               using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
                               {
 
                                   Source = reader.ReadToEnd();
                               }
                           }
                       }
 
                   }
               }
               request.Abort();
               watch.Stop();
               Console.WriteLine("请求网页用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
           }
           catch
           {
               Console.WriteLine("出错了,请求的URL为{0}", url);
 
           }
           return Source;
       }

其他:根据友军情报,服务器可能不是对IP进行限制,而是对账户进行限制,及时使用代理每次请求都是同一账户,如果对账户进行限制,可以申请大量账户,建立cookie池,每次请求都随机获取一个cookie,保证低于阈值。除了cookie池还有useragent池,根据情况建立。

二.对参数进行加密

  现代web应用富AJAX,如果是想要数据包含在ajax中,直接分析ajax返回数据就可以了,但是人家可没有那么容易让你的手

看看网易云音乐

 

 网易对参数进行了加密,想破解加密算法可行性太低,对于这种参数加密,采取在应用中内嵌浏览器。

采用的是 WebBrowser

引入命名空间

using System.Windows.Forms;

 

封装好下载页面方法

复制代码
   private static string htmlstr;  
            private static void GetHtmlWithBrowser(object url)  
            {  
                htmlstr = string.Empty;

                using(WebBrowser wb = new WebBrowser())
              {
                wb.AllowNavigation = true;
                wb.Url = new Uri(url.ToString());
           
                while (wb.ReadyState != WebBrowserReadyState.Complete)
                {
                    Application.DoEvents();
                }  

                if (wb.ReadyState == WebBrowserReadyState.Complete)
                {
                    HtmlDocument doc = wb.Document;
                    htmlstr = doc.Window.Frames[0].Document.Body.InnerHtml;
                    Console.WriteLine(htmlstr);
                }  
             }
               
            }  
复制代码

 

 做个测试   按单曲搜索下“烟霞”

 

 

 这样我们就拿到了搜索数据

 



出处:https://www.cnblogs.com/zuin/p/6323533.html


相关教程