城市二手房房源信息爬虫https://github.com/oubahe/WuhanReptile
说明:一个简单的静态网站爬虫
目的:爬取湖北省武汉市所有二居室二手房信息并保存至桌面
特点:request、beautifulsoup、json的基本使用,多进程和多线程爬取
适用:适合初学者的小项目
获取源代码或更好的修改建议请转至https://github.com/oubahe/WuhanReptile
玩玩爬取武汉链家二手房两房信息
import requests
import bs4
import json
from multiprocessing import Pool
from multiprocessing import Process
import threading
def get_url():
url=[]
url1=’https://wh.lianjia.com/ershoufang/l2/’
url.append(url1)
page=[x for x in range(2,101)]
for i in page:
url.append(‘https://wh.lianjia.com/ershoufang/pg‘+str(i)+’l2/’)
return url
获取信息
def get_infor(url):
response = requests.get(url)
soup=bs4.BeautifulSoup(response.text,’lxml’)
##地址和房源名称信息
titles=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.title > a’)
##位置
locations=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.flood > div > a’)
###小区信息
addresses=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.address > div > a’)
##房子信息
houseinfoes=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.address > div’)
##楼层##
positioninfoes=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.flood > div’)
##单价###
prices=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.priceInfo > div.unitPrice > span’)
##总价###
totals=soup.select(‘body > div.content > div.leftContent > ul > li > div.info.clear > div.priceInfo > div.totalPrice > span’)
##整理信息##
for location,address,houseinfo,positioninfo,price,total in zip(locations,addresses,houseinfoes,positioninfoes,prices,totals):
data={
‘位置’:location.get_text(),
“小区”:address.get_text(),
‘房源信息’:houseinfo.get_text(),
‘楼层年代’:positioninfo.get_text(),
“单价”:price.get_text()[2:],
“总价”:total.get_text()+’万’
}
print(data)
write_to_file(data)
写入txt文件中
def write_to_file(content):
with open(‘C:\Users\Administrator\Desktop\house_information.txt’, ‘a’, encoding=’utf-8’) as f:
f.write(json.dumps(content, ensure_ascii=False) + ‘\n’ +’\n’)
获取全部房源信息并保存到txt
def main():
urls=get_url()
for url in urls:
get_infor(url)
if name==’main‘:
###多进程
p=Process(target=main)
p.start()
p.join()
print(‘Process Completed’)
###进程池
# p=Pool(5)
# for i in range(5):
# p.apply_async(main)
# print(u’信息获取完成’)
###多线程
# t=threading.Thread(target=main)
# t.start()
# t.join()
# print(‘信息获取完成’)
- https://github.com/hehonghui/android-tech-frontier
- github unable to access 'https://github.com/...: Failed to connect to github.com port 443‘
- https://github.com/droolsjbpm/jbpm
- fatal: unable to access 'https://github.com/...': Could not resolve host: github.com
- 【房价网房价信息爬虫】整站40万条房价数据并行抓取,可更换抓取城市
- Java爬虫:大量抓取二手房信息并存入云端数据库过程详解(二)
- github FATAL:unable to access 'https://github.com/...: Failed to connect to github.com:443; No error
- 推荐一个github上前端大神的作品,有兴趣大家去看看。https://github.com/bailicangdu
- MJRefresh框架 代码地址: https://github.com/CoderMJLee/MJRefresh
- python爬虫爬取链家二手房信息
- https://github.com/fusijie/Cocos-Resource
- Hotfix安卓热部署方案https://github.com/dodola/HotFix
- Github错误:fatal: Authentication failed for 'https://github.com/ ...
- git错误: The requested URL returned error: 403 Forbidden while accessing https://github.com/wangz/futu
- 官网https://github.com/reactjs/react-router-tutorial学习
- https://github.com/cykl/infoqscraper/
- git - error: failed to push some refs to 'https://github.com/xuzhezhaozhao/Practice.git' 解决办法
- (2016 年) githup 博客地址 : https://github.com/JMWY/MyBlog
- MBProgressHUD框架的使用:https://github.com/jdg/MBProgressHUD