python beautifulsoup 爬虫学习
2016-12-01 21:03
549 查看
爬取IMDB上的电影关键词keyword
源HTML文档,参看文档源码
loosely based on real events
widow
period film
1960s
american politics
jackie kennedy
title spoken by character
character name in title
[‘first lady’, ‘american history’, ‘kubrickian’, ‘34 year old’, ‘forename as title’, ‘one word title’, ‘female lead’, ‘kennedy assassination’, ‘death of husband’, ‘year 1963’, ‘loosely based on real events’, ‘widow’, ‘period film’, ‘1960s’, ‘american politics’, ‘jackie kennedy’, ‘title spoken by character’, ‘character name in title’]
写入文档:first lady|american history|kubrickian|34 year old|forename as title|one word title|female lead|kennedy assassination|death of husband|year 1963|loosely based on real events|widow|period film|1960s|american politics|jackie kennedy|title spoken by character|character name in title|
源HTML文档,参看文档源码
# -*- coding: utf-8 -*- import urllib2 from bs4 import BeautifulSoup import unicodedata page=urllib2.urlopen("http://www.imdb.com/title/tt1619029/keywords?ref_=tt_stry_kw") soup=BeautifulSoup(page,"lxml") print soup.find_all(attrs={"class":"sodatext"})#正则获取标签值 print soup.select(' div[class="sodatext"]')#获取标签值 f=open('F:\\keyw.txt','w')#打开文档准备写入 kwinfo=[]#字典,准备获取字段 for keyw in soup.select(' div[class="sodatext"]'): kw=keyw.get_text() #print k kw.strip() line = unicodedata.normalize('NFKD', kw).encode('ascii', 'ignore')#将Unicode类型转换为str类型 if(line.startswith("\n")):#如果以换行符开头, line=line.replace("\n","")#去除所有换行符 print type(line) kwinfo.append(line)#加入到字典中 print line print kwinfo#打印字典数据 for item in range(len(kwinfo)): f.write(kwinfo[item]+"|")#写入字典数据到文件,作为一行,以|标记各关键词 f.close()#关闭文档
loosely based on real events
widow
period film
1960s
american politics
jackie kennedy
title spoken by character
character name in title
[‘first lady’, ‘american history’, ‘kubrickian’, ‘34 year old’, ‘forename as title’, ‘one word title’, ‘female lead’, ‘kennedy assassination’, ‘death of husband’, ‘year 1963’, ‘loosely based on real events’, ‘widow’, ‘period film’, ‘1960s’, ‘american politics’, ‘jackie kennedy’, ‘title spoken by character’, ‘character name in title’]
写入文档:first lady|american history|kubrickian|34 year old|forename as title|one word title|female lead|kennedy assassination|death of husband|year 1963|loosely based on real events|widow|period film|1960s|american politics|jackie kennedy|title spoken by character|character name in title|
相关文章推荐
- Python爬虫包 BeautifulSoup 学习(四) bs基本对象与函数
- Python爬虫包 BeautifulSoup 学习(九) 正则表达式与Lambda表达式
- Python爬虫包 BeautifulSoup 学习(五) 实例
- python3个人爬虫之:BeautifulSoup学习心得
- Python爬虫学习---------使用beautifulSoup4爬取名言网
- Python爬虫包 BeautifulSoup 学习(七) children等应用
- python库学习笔记——爬虫常用的BeautifulSoup的介绍
- Python爬虫包 BeautifulSoup 学习(十一) CSS 选择器
- Python 网页爬虫-BeautifulSoup库的学习
- Python爬虫包 BeautifulSoup 学习(二) 异常处理
- Python爬虫包 BeautifulSoup 学习(八) parent等应用
- python爬虫——beautifulsoup4使用学习
- 【Python3.6爬虫学习记录】(二)使用BeautifulSoup爬取简单静态网页文章
- python爬虫【记录】BeautifulSoup 的用法遍历学习
- python学习(6):python爬虫之requests和BeautifulSoup的使用
- Python爬虫包 BeautifulSoup 学习(十) 各种html解析器的比较及使用
- Python爬虫包 BeautifulSoup 学习(六) 递归抓取
- Python爬虫包 BeautifulSoup 学习(三) 实例
- Python爬虫包 BeautifulSoup 学习(一) 简介与安装
- Python爬虫包BeautifulSoup学习实例(五)