python项目之 爬虫爬取煎蛋jandan的妹子图-上
2016-03-08 14:20
696 查看
python项目之 爬虫爬取煎蛋jandan的妹子图-上
抓取妹子图练练手。网页url格式
http://jandan.net/ooxx/page-1777#comment只需改变页码1777即可
分析页面源码发现妹子图有两个
一个是缩略图<img src="http://ww1.sinaimg.cn/mw600/4bf31e43jw1f09htnzkh5j20dw0kumz0.jpg" /></p>
另一个是原图
<a href="http://ww1.sinaimg.cn/large/4bf31e43jw1f09htnzkh5j20dw0kumz0.jpg" target="_blank" class="view_img_link">[查看原图]</a>
这里我们抓取原图,使用class和target这个属性查找。
最终得到每一页的TXT文件,下篇是文件合并与图片存取。
源码如下
代理ip文件请自行查找:-D# coding:utf-8 #################################################### # coding by 刘云飞 #################################################### import requests import os import time import random from bs4 import BeautifulSoup import threading url = "http://jandan.net/ooxx/page-" img_lists = [] url_lists = [] not_url_lists = [] ips = [] thread_list = [] with open('ip2.txt', 'r') as f: lines = f.readlines() for line in lines: ip_one = "http://" + line.strip() ips.append(ip_one) headers = { 'Host': 'jandan.net', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/42.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Referer': 'http://jandan.net/ooxx/', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', } for i in range(1530, 1883): url_lists.append(url + str(i) + '#comments') def writeToTxt(name, list): with open(name, 'w+') as f: for urlOne in list: f.write(urlOne + "\n") def get_img_url(url): single_ip_addr = random.choice(ips) lists_tmp = [] page = int(url[28:32]) filename = str(page) + ".txt" proxies = {'http': single_ip_addr} try: res = requests.get(url, headers=headers, proxies=proxies) print(res.status_code) if res.status_code == 200: text = res.text Soup = BeautifulSoup(text, 'lxml') results = Soup.find_all("a", target="_blank", class_="view_img_link") for img in results: lists_tmp.append(img['href']) url_lists.append(img['href']) print(url + " --->>>>抓取完毕!!") writeToTxt(filename, lists_tmp) else: not_url_lists.append(url) print("not ok") except: not_url_lists.append(url) print("not ok") for url in url_lists: page = int(url[28:32]) filename = str(page) + ".txt" if os.path.exists(filename): print(url + " is pass") else: # time.sleep(1) get_img_url(url) print(img_lists) with open("img_url.txt", 'w+') as f: for url in img_lists: f.write(url + "\n") print("共有 " + str(len(img_lists)) + " 张图片。") print("all done!!!") with open("not_url_lists.txt", 'w+') as f: for url in not_url_lists: f.write(url + "\n")
相关文章推荐
- Python 知识结构图
- LeetCode----Lowest Common Ancestor of a Binary Search Tree
- Python爬取中文页面的时候出现的乱码问题
- 使用python的PIL库简单的处理图像
- 简体中文转繁体的python简单实现
- 常用Python第三方库 简介
- python 模拟web网页登录过程
- Python对字母字符(串)中大小写转换函数--upper() 和 lower()
- python 文件保存 出错
- 将爬取到的数据(用Python)写入PostgreSQL数据库
- Python——管理属性(1)
- python 系列之 - 多线程
- python学习日记-01
- python 类属性初始化
- python基础(正则表达式)
- python descriptor
- Python Selenium环境配置
- python 切片
- Python hashmap
- Python Singleton