Python 抓取google链接代码
2013-12-06 17:33
253 查看
简单介绍下程序,PY2.7.2写的,如果是PY3的有不兼容的话请参照2-》3的手册自己改吧,另外由于msvcrt模块,只支持windows哈
本程序的原理是基于google的json的api,例如:https://ajax.googleapis.com/ajax
... p;rsz=8&start=1
如下图
1.line代表线程数
2.key是关键字,支持google语法
3.How many代表拉取几条,由于json一页只有8条,所以一个线程一次拉取8条哈
4.任何时候,按q键,直接退出
5.请大家按喜好随便修改
#! /usr/bin/env python
#coding=utf-8
import urllib2,urllib,threading,Queue,os
import msvcrt
import simplejson
import sys
seachstr = raw_input("Key?:")
pagenum = raw_input("How many?:")
pagenum = int(pagenum)/8+1
line = 5
class googlesearch(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.urls= []
def run(self):
while 1:
self.catchURL()
queue.task_done()
def catchURL(self):
self.key = seachstr.decode('gbk').encode('utf-8')
self.page= str(queue.get())
url = ('https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s&rsz=8&start=%s') % (urllib.quote(self.key),self.page)
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
results = simplejson.load(response)
URLinfo = results['responseData']['results']
except Exception,e:
print e
else:
for info in URLinfo:
print info['url']
class ThreadGetKey(threading.Thread):
def run(self):
while 1:
try:
chr = msvcrt.getch()
if chr == 'q':
print "stopped by your action ( q )"
os._exit(1)
else:
continue
except:
os._exit(1)
if __name__ == '__main__':
pages=[]
queue = Queue.Queue()
for i in range(1,pagenum+1):
pages.append(i)
for n in pages:
queue.put(n)
ThreadGetKey().start()
for p in range(line):
googlesearch().start()
转自:http://sb.f4ck.org/forum.php?mod=viewthread&tid=6205&highlight=python
本程序的原理是基于google的json的api,例如:https://ajax.googleapis.com/ajax
... p;rsz=8&start=1
如下图
1.line代表线程数
2.key是关键字,支持google语法
3.How many代表拉取几条,由于json一页只有8条,所以一个线程一次拉取8条哈
4.任何时候,按q键,直接退出
5.请大家按喜好随便修改
#! /usr/bin/env python
#coding=utf-8
import urllib2,urllib,threading,Queue,os
import msvcrt
import simplejson
import sys
seachstr = raw_input("Key?:")
pagenum = raw_input("How many?:")
pagenum = int(pagenum)/8+1
line = 5
class googlesearch(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.urls= []
def run(self):
while 1:
self.catchURL()
queue.task_done()
def catchURL(self):
self.key = seachstr.decode('gbk').encode('utf-8')
self.page= str(queue.get())
url = ('https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s&rsz=8&start=%s') % (urllib.quote(self.key),self.page)
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
results = simplejson.load(response)
URLinfo = results['responseData']['results']
except Exception,e:
print e
else:
for info in URLinfo:
print info['url']
class ThreadGetKey(threading.Thread):
def run(self):
while 1:
try:
chr = msvcrt.getch()
if chr == 'q':
print "stopped by your action ( q )"
os._exit(1)
else:
continue
except:
os._exit(1)
if __name__ == '__main__':
pages=[]
queue = Queue.Queue()
for i in range(1,pagenum+1):
pages.append(i)
for n in pages:
queue.put(n)
ThreadGetKey().start()
for p in range(line):
googlesearch().start()
转自:http://sb.f4ck.org/forum.php?mod=viewthread&tid=6205&highlight=python
相关文章推荐
- Python动态类型的学习---引用的理解
- 垃圾邮件过滤器 python简单实现
- install and upgrade scrapy
- Scrapy的架构介绍
- Centos6 编译安装Python
- 使用Python生成Excel格式的图片
- 让Python文件也可以当bat文件运行
- [Python]推算数独
- Python中zip()函数用法举例
- Python中map()函数浅析
- Python在CAM软件Genesis2000中的应用
- 使用Shiboken为C++和Qt库创建Python绑定
- Python,Flex 2和Aptana[js开放工具]
- 国外开发者谈为何放弃PHP而改用Python
- 利用webqq协议使用python登录qq发消息源码参考
- python 判断自定义对象类型
- 让python的Cookie.py模块支持冒号做key的方法
- Python 面向对象 成员的访问约束
- 新手该如何学python怎么学好python?
- Python linecache.getline()读取文件中特定一行的脚本