您的位置:首页 > 其它

CS109 Lecture 7

2016-07-29 17:07 134 查看

CS109 Lecture 7

Data Scraping

Sources

From a Web Sites

With An API

Copyrights and permission

Be careful and polite

Give credit

Care about media law

Don’t be evil

Useful tags

<h1></h1>
<p></p>
<br>
<a href = 'url'>Link</a>


Useful Libraries for Scraping

urllib

beautifulsoup

pattern

LXML

Get Data From Website

url = 'url'
scource = urllib2.urlopen(url).read()


soup = bs4.BeautifulSoup(source)
soup.findAll('a') # find <a><\a> tag


tag = soup.find('a')
tag.get('href')


C = soup.findAll('p',{'class':'Event'})
t=C[0]
t.findNextSiblings


Get Data With An API

import json # JavaScript Obejct Notation
import requests
api_key = 'mykey'
url = 'url' + api_key
scource = urllib2.urlopen(url).read()


#---simple example--------
a = {'a':1,'b':2}
s = json.dump(a)
a2 = json.loads(s)
#-------------------------
dataDict = json.loads(data)
dtatDict.keys()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  CS109