Google Python Class 之——正则表达式提取html网页数据字段
2015-09-06 16:14
866 查看
需要提取的内容格式:
输出要求:
解决思路:
提取main命令参数,按照文件名依次读取,按行匹配,姓名排序,dict存储,输出结果到文件
<span style="font-size:18px;">#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/
import sys
import re
"""Baby Names exercise
Define the extract_names() function below and change main()
to call it.
For writing regex, it's nice to include a copy of the target
text for inspiration.
Here's what the html looks like in the baby.html files: ... <h3 align="center">Popularity in 1990</h3> .... <tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> <tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> <tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> ...
Suggested milestones for incremental development:
-Extract the year and print it
-Extract the names and rank numbers and just print them
-Get the names data into a dict and print it
-Build the [year, 'name rank', ... ] list and print it
-Fix main() to use the extract_names list
"""
def extract_names(filename):
""" Given a file name for baby.html, returns a list starting with the year string followed by the name-rank strings in alphabetical order. ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...] """
# +++your code here+++
file_output = open('reTestFile/output.txt', 'a+') # summary output file
file_raw = open('reTestFile/'+filename, 'rU') # input single file
# extract year
# <h3 align="center">Popularity in 1990</h3>
dict_show = []
for a_line in file_raw:
match_year = re.search(r'>Popularity in\s(\w+)<', a_line)
match_name_and_rank = re.search(r'<tr align="right"><td>(\w+)</td><td>(\w+)</td><td>(\w+)</td>', a_line)
if match_year:
year = match_year.group(1) # 1990
# print >> file_output, year
if match_name_and_rank:
rank = match_name_and_rank.group(1)
name = match_name_and_rank.group(2)
# print >> file_output, name+rank
dict_show.append(name+' '+rank)
dict_show.sort()
dict_show.insert(0, year)
print >> file_output, dict_show
file_output.write('\n')
file_raw.close()
file_output.close()
return
def main():
# This command-line parsing code is provided.
# Make a list of command line arguments, omitting the [0] element
# which is the script itself.
args = sys.argv[1:]
if not args:
print 'usage: [--summaryfile] file [file ...]'
sys.exit(1)
# Notice the summary flag and remove it from args if it is present.
summary = False
if args[0] == '--summaryfile':
summary = True
del args[0]
# +++your code here+++
# For each filename, get the names, then either print the text output
# or write it to a summary file
if summary:
for a_file in args:
extract_names(a_file)
if __name__ == '__main__':
main()
</span>
Here's what the html looks like in the baby.html files: ... <h3 align="center">Popularity in 1990</h3> .... <tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> <tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> <tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> ...
输出要求:
""" Given a file name for baby.html, returns a list starting with the year string followed by the name-rank strings in alphabetical order. ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...] """
解决思路:
提取main命令参数,按照文件名依次读取,按行匹配,姓名排序,dict存储,输出结果到文件
<span style="font-size:18px;">#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/
import sys
import re
"""Baby Names exercise
Define the extract_names() function below and change main()
to call it.
For writing regex, it's nice to include a copy of the target
text for inspiration.
Here's what the html looks like in the baby.html files: ... <h3 align="center">Popularity in 1990</h3> .... <tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> <tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> <tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> ...
Suggested milestones for incremental development:
-Extract the year and print it
-Extract the names and rank numbers and just print them
-Get the names data into a dict and print it
-Build the [year, 'name rank', ... ] list and print it
-Fix main() to use the extract_names list
"""
def extract_names(filename):
""" Given a file name for baby.html, returns a list starting with the year string followed by the name-rank strings in alphabetical order. ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...] """
# +++your code here+++
file_output = open('reTestFile/output.txt', 'a+') # summary output file
file_raw = open('reTestFile/'+filename, 'rU') # input single file
# extract year
# <h3 align="center">Popularity in 1990</h3>
dict_show = []
for a_line in file_raw:
match_year = re.search(r'>Popularity in\s(\w+)<', a_line)
match_name_and_rank = re.search(r'<tr align="right"><td>(\w+)</td><td>(\w+)</td><td>(\w+)</td>', a_line)
if match_year:
year = match_year.group(1) # 1990
# print >> file_output, year
if match_name_and_rank:
rank = match_name_and_rank.group(1)
name = match_name_and_rank.group(2)
# print >> file_output, name+rank
dict_show.append(name+' '+rank)
dict_show.sort()
dict_show.insert(0, year)
print >> file_output, dict_show
file_output.write('\n')
file_raw.close()
file_output.close()
return
def main():
# This command-line parsing code is provided.
# Make a list of command line arguments, omitting the [0] element
# which is the script itself.
args = sys.argv[1:]
if not args:
print 'usage: [--summaryfile] file [file ...]'
sys.exit(1)
# Notice the summary flag and remove it from args if it is present.
summary = False
if args[0] == '--summaryfile':
summary = True
del args[0]
# +++your code here+++
# For each filename, get the names, then either print the text output
# or write it to a summary file
if summary:
for a_file in args:
extract_names(a_file)
if __name__ == '__main__':
main()
</span>
相关文章推荐
- python unittest源码解析三----loader.py之_get_name_from_path(self, path)
- wxpython基本控件
- Python Thread related
- python SyntaxError: Non-ASCII character '\xd5' in file
- python 数组新增或删除元素
- Python操作Mysql数据库
- [转] 强大的python字符串解析
- Python线程指南
- Python正则表达式指南
- speed up performance of python
- python代码片段
- python 2 编码问题
- wxpython初学者(二)
- python unittest源码解析二----Test Discovery
- 基于python 的Apriori算法
- python中发送邮件各种问题
- Python 父目录获取
- 《机器学习实战》笔记之四——基于概率论的分类方法:朴素贝叶斯
- python内建函数学习
- python 验证码识别:pytesser .image_file_to_string('1.tif') WindowsError: [Error 2] 解决办法