您的位置：首页 > 其它

CS109 Lecture 7

2016-07-29 17:07 134 查看

CS109 Lecture 7

Data Scraping

Sources

From a Web Sites

With An API

Copyrights and permission

Be careful and polite

Give credit

Care about media law

Don’t be evil

Useful tags

<h1></h1>
<p></p>
<br>
<a href = 'url'>Link</a>

Useful Libraries for Scraping

urllib

beautifulsoup

pattern

LXML

Get Data From Website

url = 'url'
scource = urllib2.urlopen(url).read()

soup = bs4.BeautifulSoup(source)
soup.findAll('a') # find <a><\a> tag

tag = soup.find('a')
tag.get('href')

C = soup.findAll('p',{'class':'Event'})
t=C[0]
t.findNextSiblings

Get Data With An API

import json # JavaScript Obejct Notation
import requests
api_key = 'mykey'
url = 'url' + api_key
scource = urllib2.urlopen(url).read()

#---simple example--------
a = {'a':1,'b':2}
s = json.dump(a)
a2 = json.loads(s)
#-------------------------
dataDict = json.loads(data)
dtatDict.keys()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： CS109

相关文章推荐

新的分享

章节导航