您的位置：首页 > 编程语言 > Python开发

使用python进行收据搜集示例之crawl_and_parse

2016-12-16 23:52 429 查看

这里是用jupyter notebook写的关于使用python进行数据收集的基本知识，包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料，移植到了python3+windows环境上。代码上传到csdn资源啦：ABC of data_collection 。

为了方便查看，代码分开4篇博客里。

下面是jupyter notebook代码导出的md文件。

1.crawl_and_parse

Crawl and parsing HTML with Beauitful Soup

寒小阳(hanxiaoyang.ml@gmail.com)

2016-08

# 载入模块
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 创建dataframe然后输出出来，为一会儿爬取做准备

# 构建一个字典
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}

# 创建dataframe
raw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])

# 输出看一眼
raw_df


	first_name	last_name	age	preTestScore	postTestScore
0	Jason	Miller	42	4	25
1	Molly	Jacobson	52	24	94
2	Tina	Ali	36	31	57
3	Jake	Milner	24	2	62
4	Amy	Cooze	73	3	70

### Download the HTML and create a Beautiful Soup object

# url
url = 'http://nbviewer.ipython.org/github/HanXiaoyang/python_and_data_easy_examples/crawl_and_parse.ipynb'

# 用requests访问获取内容
r = requests.get(url)

# 用BeautifulSoup解析一下
soup = BeautifulSoup(r.text, "lxml")

### 解析Beautiful Soup结构体

# Create four variables to score the scraped data in
first_name = []
last_name = []
age = []
preTestScore = []
postTestScore = []

# Create an object of the first object that is class=dataframe
table = soup.find(class_='dataframe')

# Find all the <tr> tag pairs, skip the first one, then for each.
for row in table.find_all('tr')[1:]:
# Create a variable of all the <td> tag pairs in each <tr> tag pair,
col = row.find_all('td')

# Create a variable of the string inside 1st <td> tag pair,
column_1 = col[0].string.strip()
# and append it to first_name variable
first_name.append(column_1)

# Create a variable of the string inside 2nd <td> tag pair,
column_2 = col[1].string.strip()
# and append it to last_name variable
last_name.append(column_2)

# Create a variable of the string inside 3rd <td> tag pair,
column_3 = col[2].string.strip()
# and append it to age variable
age.append(column_3)

# Create a variable of the string inside 4th <td> tag pair,
column_4 = col[3].string.strip()
# and append it to preTestScore variable
preTestScore.append(column_4)

# Create a variable of the string inside 5th <td> tag pair,
column_5 = col[4].string.strip()
# and append it to postTestScore variable
postTestScore.append(column_5)

# Create a variable of the value of the columns
columns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}

# Create a dataframe from the columns variable
df = pd.DataFrame(columns)

—————————————————————————

AttributeError Traceback (most recent call last)

in ()
10
11 # Find all the tag pairs, skip the first one, then for each.
—> 12 for row in table.find_all(‘tr’)[1:]:
13 # Create a variable of all the tag pairs in each tag pair,
14 col = row.find_all(‘td’)

AttributeError: ‘NoneType’ object has no attribute ‘find_all’

# View the dataframe
df


	age	first_name	last_name	postTestScore	preTestScore
0	42	Jason	Miller	25	4
1	52	Molly	Jacobson	94	24
2	36	Tina	Ali	57	31
3	24	Jake	Milner	62	2
4	73	Amy	Cooze	70	3

5 rows × 5 columns

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航