您的位置:首页 > 编程语言 > Python开发

使用python进行收据搜集示例之crawl_and_parse

2016-12-16 23:52 429 查看
这里是用jupyter notebook写的关于使用python进行数据收集的基本知识,包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料,移植到了python3+windows环境上。代码上传到csdn资源啦:ABC of data_collection

为了方便查看,代码分开4篇博客里。

下面是jupyter notebook代码导出的md文件。

1.crawl_and_parse

Crawl and parsing HTML with Beauitful Soup

寒小阳(hanxiaoyang.ml@gmail.com)

2016-08

# 载入模块
import requests
from bs4 import BeautifulSoup
import pandas as pd


### 创建dataframe然后输出出来,为一会儿爬取做准备

# 构建一个字典
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}

# 创建dataframe
raw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])

# 输出看一眼
raw_df


first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1MollyJacobson522494
2TinaAli363157
3JakeMilner24262
4AmyCooze73370
### Download the HTML and create a Beautiful Soup object

# url
url = 'http://nbviewer.ipython.org/github/HanXiaoyang/python_and_data_easy_examples/crawl_and_parse.ipynb'

# 用requests访问获取内容
r = requests.get(url)

# 用BeautifulSoup解析一下
soup = BeautifulSoup(r.text, "lxml")


### 解析Beautiful Soup结构体

# Create four variables to score the scraped data in
first_name = []
last_name = []
age = []
preTestScore = []
postTestScore = []

# Create an object of the first object that is class=dataframe
table = soup.find(class_='dataframe')

# Find all the <tr> tag pairs, skip the first one, then for each.
for row in table.find_all('tr')[1:]:
# Create a variable of all the <td> tag pairs in each <tr> tag pair,
col = row.find_all('td')

# Create a variable of the string inside 1st <td> tag pair,
column_1 = col[0].string.strip()
# and append it to first_name variable
first_name.append(column_1)

# Create a variable of the string inside 2nd <td> tag pair,
column_2 = col[1].string.strip()
# and append it to last_name variable
last_name.append(column_2)

# Create a variable of the string inside 3rd <td> tag pair,
column_3 = col[2].string.strip()
# and append it to age variable
age.append(column_3)

# Create a variable of the string inside 4th <td> tag pair,
column_4 = col[3].string.strip()
# and append it to preTestScore variable
preTestScore.append(column_4)

# Create a variable of the string inside 5th <td> tag pair,
column_5 = col[4].string.strip()
# and append it to postTestScore variable
postTestScore.append(column_5)

# Create a variable of the value of the columns
columns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}

# Create a dataframe from the columns variable
df = pd.DataFrame(columns)


—————————————————————————

AttributeError Traceback (most recent call last)

in ()
10
11 # Find all the tag pairs, skip the first one, then for each.
—> 12 for row in table.find_all(‘tr’)[1:]:
13 # Create a variable of all the tag pairs in each tag pair,
14 col = row.find_all(‘td’)

AttributeError: ‘NoneType’ object has no attribute ‘find_all’

# View the dataframe
df


agefirst_namelast_namepostTestScorepreTestScore
042JasonMiller254
152MollyJacobson9424
236TinaAli5731
324JakeMilner622
473AmyCooze703
5 rows × 5 columns
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: