您的位置:首页 > 编程语言 > Python开发

Google Python Class 之——正则表达式提取html网页数据字段

2015-09-06 16:14 866 查看
需要提取的内容格式:

Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...

输出要求:

"""
Given a file name for baby.html, returns a list starting with the year string
followed by the name-rank strings in alphabetical order.
['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
"""

解决思路:

     提取main命令参数,按照文件名依次读取,按行匹配,姓名排序,dict存储,输出结果到文件

<span style="font-size:18px;">#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/
import sys
import re

"""Baby Names exercise

Define the extract_names() function below and change main()
to call it.

For writing regex, it's nice to include a copy of the target
text for inspiration.

Here's what the html looks like in the baby.html files: ... <h3 align="center">Popularity in 1990</h3> .... <tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> <tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> <tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> ...

Suggested milestones for incremental development:
-Extract the year and print it
-Extract the names and rank numbers and just print them
-Get the names data into a dict and print it
-Build the [year, 'name rank', ... ] list and print it
-Fix main() to use the extract_names list
"""

def extract_names(filename):
""" Given a file name for baby.html, returns a list starting with the year string followed by the name-rank strings in alphabetical order. ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...] """
# +++your code here+++
file_output = open('reTestFile/output.txt', 'a+') # summary output file
file_raw = open('reTestFile/'+filename, 'rU') # input single file

# extract year
# <h3 align="center">Popularity in 1990</h3>
dict_show = []
for a_line in file_raw:
match_year = re.search(r'>Popularity in\s(\w+)<', a_line)
match_name_and_rank = re.search(r'<tr align="right"><td>(\w+)</td><td>(\w+)</td><td>(\w+)</td>', a_line)
if match_year:
year = match_year.group(1) # 1990
# print >> file_output, year
if match_name_and_rank:
rank = match_name_and_rank.group(1)
name = match_name_and_rank.group(2)
# print >> file_output, name+rank
dict_show.append(name+' '+rank)
dict_show.sort()
dict_show.insert(0, year)
print >> file_output, dict_show
file_output.write('\n')
file_raw.close()
file_output.close()
return

def main():
# This command-line parsing code is provided.
# Make a list of command line arguments, omitting the [0] element
# which is the script itself.
args = sys.argv[1:]

if not args:
print 'usage: [--summaryfile] file [file ...]'
sys.exit(1)

# Notice the summary flag and remove it from args if it is present.
summary = False
if args[0] == '--summaryfile':
summary = True
del args[0]

# +++your code here+++
# For each filename, get the names, then either print the text output
# or write it to a summary file
if summary:
for a_file in args:
extract_names(a_file)

if __name__ == '__main__':
main()
</span>
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: