使用python进行收据搜集示例之crawl_and_parse
2016-12-16 23:52
429 查看
这里是用jupyter notebook写的关于使用python进行数据收集的基本知识,包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料,移植到了python3+windows环境上。代码上传到csdn资源啦:ABC of data_collection 。
为了方便查看,代码分开4篇博客里。
下面是jupyter notebook代码导出的md文件。
1.crawl_and_parse
2016-08
### 创建dataframe然后输出出来,为一会儿爬取做准备
### Download the HTML and create a Beautiful Soup object
### 解析Beautiful Soup结构体
—————————————————————————
AttributeError Traceback (most recent call last)
in ()
10
11 # Find all the tag pairs, skip the first one, then for each.
—> 12 for row in table.find_all(‘tr’)[1:]:
13 # Create a variable of all the tag pairs in each tag pair,
14 col = row.find_all(‘td’)
AttributeError: ‘NoneType’ object has no attribute ‘find_all’
5 rows × 5 columns
为了方便查看,代码分开4篇博客里。
下面是jupyter notebook代码导出的md文件。
1.crawl_and_parse
Crawl and parsing HTML with Beauitful Soup
寒小阳(hanxiaoyang.ml@gmail.com)2016-08
# 载入模块 import requests from bs4 import BeautifulSoup import pandas as pd
### 创建dataframe然后输出出来,为一会儿爬取做准备
# 构建一个字典 raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]} # 创建dataframe raw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) # 输出看一眼 raw_df
first_name | last_name | age | preTestScore | postTestScore | |
---|---|---|---|---|---|
0 | Jason | Miller | 42 | 4 | 25 |
1 | Molly | Jacobson | 52 | 24 | 94 |
2 | Tina | Ali | 36 | 31 | 57 |
3 | Jake | Milner | 24 | 2 | 62 |
4 | Amy | Cooze | 73 | 3 | 70 |
# url url = 'http://nbviewer.ipython.org/github/HanXiaoyang/python_and_data_easy_examples/crawl_and_parse.ipynb' # 用requests访问获取内容 r = requests.get(url) # 用BeautifulSoup解析一下 soup = BeautifulSoup(r.text, "lxml")
### 解析Beautiful Soup结构体
# Create four variables to score the scraped data in first_name = [] last_name = [] age = [] preTestScore = [] postTestScore = [] # Create an object of the first object that is class=dataframe table = soup.find(class_='dataframe') # Find all the <tr> tag pairs, skip the first one, then for each. for row in table.find_all('tr')[1:]: # Create a variable of all the <td> tag pairs in each <tr> tag pair, col = row.find_all('td') # Create a variable of the string inside 1st <td> tag pair, column_1 = col[0].string.strip() # and append it to first_name variable first_name.append(column_1) # Create a variable of the string inside 2nd <td> tag pair, column_2 = col[1].string.strip() # and append it to last_name variable last_name.append(column_2) # Create a variable of the string inside 3rd <td> tag pair, column_3 = col[2].string.strip() # and append it to age variable age.append(column_3) # Create a variable of the string inside 4th <td> tag pair, column_4 = col[3].string.strip() # and append it to preTestScore variable preTestScore.append(column_4) # Create a variable of the string inside 5th <td> tag pair, column_5 = col[4].string.strip() # and append it to postTestScore variable postTestScore.append(column_5) # Create a variable of the value of the columns columns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore} # Create a dataframe from the columns variable df = pd.DataFrame(columns)
—————————————————————————
AttributeError Traceback (most recent call last)
in ()
10
11 # Find all the tag pairs, skip the first one, then for each.
—> 12 for row in table.find_all(‘tr’)[1:]:
13 # Create a variable of all the tag pairs in each tag pair,
14 col = row.find_all(‘td’)
AttributeError: ‘NoneType’ object has no attribute ‘find_all’
# View the dataframe df
age | first_name | last_name | postTestScore | preTestScore | |
---|---|---|---|---|---|
0 | 42 | Jason | Miller | 25 | 4 |
1 | 52 | Molly | Jacobson | 94 | 24 |
2 | 36 | Tina | Ali | 57 | 31 |
3 | 24 | Jake | Milner | 62 | 2 |
4 | 73 | Amy | Cooze | 70 | 3 |
相关文章推荐
- 使用python进行收据搜集示例之different_format_data_processing
- 使用python进行收据搜集示例之python_regular_expression
- 使用python进行收据搜集示例之feature_engineering_example
- 使用aespython进行ECB加解密示例
- Python中使用pickle对内建类型(built in types)进行对象序列化(object serialization and deseirialzation)
- python使用百度翻译进行中翻英示例
- python使用百度翻译进行中翻英示例
- 使用Python进行AES加密和解密的示例代码
- Python使用MD5加密算法对字符串进行加密操作示例
- SqlPager最终版[附源码和示例程序](使用存储过程进行分页)
- 使用 Python 进行线程编程
- 使用server.transfer进行页面传值示例
- 使用SWIG和Python对C/C++进行单元测试(二)
- 使用VS进行工作流开发系列博客2-Developing Workflows in VS: Part 1 - Workflow Objects and A Crash Course on Mechanics
- 使用Eclipse和Ant进行python开发
- 使用ropemacs对python代码进行补全
- 使用VS进行工作流开发系列博客7-Developing Workflows in VS: Part6 - Deploy and Debug your workflow
- 使用VS进行工作流开发系列博客8-Developing Workflows in VS: Part 7 - Summary and Final Thoughts
- Python urlopen 使用小示例
- Python urlopen 使用小示例