Python3并发检验代理池地址
2016-04-14 16:27
471 查看
【用线程池并发检验代理有效性】
相关阅读:
1、concurrent.futures.ThreadPoolExecutor (Python documentation,官方)
2、12.7 创建一个线程池 (python3-cookbook)
3、What is the best way to send multiple HTTP requests in Python 3? (stackoverflow)
4、requests 中文文档
5、aiohttp 文档
*** walker ***
#encoding=utf-8 #author: walker #date: 2016-04-14 #summary: 用线程池并发检验代理有效性 import os, sys, time import requests from concurrent import futures cur_dir_fullpath = os.path.dirname(os.path.abspath(__file__)) Headers = { 'Accept': '*/*', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E)', } #检验单个代理的有效性 #如果有效,返回该proxy;否则,返回空字符串 def Check(desturl, proxy, feature): proxies = {'http': 'http://' + proxy} proxies = { 'http': proxy, 'https': proxy } r = None #声明 exMsg = None try: r = requests.get(url=desturl, headers=Headers, proxies=proxies, timeout=3) except: exMsg = '* ' + traceback.format_exc() #print(exMsg) finally: if 'r' in locals() and r: r.close() if exMsg: return '' if r.status_code != 200: return '' if r.content.decode('utf8').find(feature) < 0: return '' return proxy #输入代理列表(set/list),返回有效代理列表 def GetValidProxyPool(rawProxyPool, desturl, feature): validProxyList = list() #有效代理列表 pool = futures.ThreadPoolExecutor(8) futureList = list() for proxy in rawProxyPool: futureList.append(pool.submit(Check, desturl, proxy, feature)) print('\n submit done, waiting for responses\n') for future in futures.as_completed(futureList): proxy = future.result() print('proxy:' + proxy) if proxy: #有效代理 validProxyList.append(proxy) print('validProxyList size:' + str(len(validProxyList))) return validProxyList #获取原始代理池 def GetRawProxyPool(): rawProxyPool = set() #通过某种方式获取原始代理池...... return rawProxyPool if __name__ == "__main__": rawProxyPool = GetRawProxyPool() desturl = 'http://...' #需要通过代理访问的目标地址 feature = 'xxx' #目标网页的特征码 validProxyPool = GetValidProxyPool(rawProxyPool, desturl, feature)【用协程并发检验代理有效性】
#encoding=utf-8 #author: walker #date: 2017-03-28 #summary: 用协程并发检验代理有效性 #Python sys.version:3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] import os, sys, time import requests import aiohttp import asyncio import traceback Headers = { 'Accept': '*/*', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E)', } #检验单个代理的有效性 #如果有效,返回该proxy;否则,返回空字符串 async def Check(desturl, proxy, feature): proxy = 'http://' + proxy #print('proxy:' + proxy) exMsg = None try: async with aiohttp.ClientSession() as session: async with session.get(desturl, headers=Headers, proxy=proxy, timeout=10) as resp: #print(resp.status) assert resp.status == 200 #print(await resp.text()) html = await resp.text(encoding='utf-8') except: exMsg = '* ' + traceback.format_exc() #print(exMsg) if exMsg: return '' if html.find(feature) < 0: return '' return proxy #输入代理列表(set/list),返回有效代理列表 async def GetValidProxyPool(rawProxyPool, desturl, feature): print('GetValidProxyPool ...') validProxyList = list() #有效代理列表 coroList = list() for proxy in rawProxyPool: coroList.append(asyncio.ensure_future((Check(desturl, proxy, feature)))) totalSleepTime = 0 for f in asyncio.as_completed(coroList): proxy = await f #print('rtn proxy:' + proxy) if proxy: validProxyList.append(proxy) print('validProxyList size: %d' % len(validProxyList)) return validProxyList #获取原始代理池 def GetRawProxyPool(): rawProxyPool = set() #通过某种方式获取原始代理池...... return rawProxyPool if __name__ == "__main__": startTime = time.time() rawProxyPool = GetRawProxyPool() desturl = 'http://...' #需要通过代理访问的目标地址 feature = 'xxx' #目标网页的特征码 print('rawProxyPool size:%d' % len(rawProxyPool)) loop = asyncio.get_event_loop() validProxyList = loop.run_until_complete(GetValidProxyPool(rawProxyPool, desturl, feature)) loop.close() print('rawProxyPool size:%d' % len(validProxyList)) print('time cost:%.2fs' % (time.time()-startTime))
相关阅读:
1、concurrent.futures.ThreadPoolExecutor (Python documentation,官方)
2、12.7 创建一个线程池 (python3-cookbook)
3、What is the best way to send multiple HTTP requests in Python 3? (stackoverflow)
4、requests 中文文档
5、aiohttp 文档
*** walker ***
相关文章推荐
- Python3写爬虫(四)多线程实现数据爬取
- C#实现多线程的同步方法实例分析
- 浅谈chuck-lua中的多线程
- Lua协程(coroutine)程序运行分析
- Lua的协程(coroutine)简介
- C#简单多线程同步和优先权用法实例
- C#多线程学习之(四)使用线程池进行多线程的自动管理
- C#多线程编程中的锁系统(三)
- 解析C#多线程编程中异步多线程的实现及线程池的使用
- C#多线程学习之(六)互斥对象用法实例
- 基于一个应用程序多线程误用的分析详解
- C#多线程学习之(三)生产者和消费者用法分析
- C#多线程学习之(一)多线程的相关概念分析
- C#多线程之Thread中Thread.IsAlive属性用法分析
- 分享我在工作中遇到的多线程下导致RCW无法释放的问题
- C#多线程编程之使用ReaderWriterLock类实现多用户读与单用户写同步的方法
- C#控制台下测试多线程的方法
- 21天学习android开发教程之SurfaceView与多线程的混搭
- Ruby 多线程的潜力和弱点分析