您的位置：首页 > 编程语言 > Python开发

Python-爬虫6-页面解析和数据提取的方法、正则表达式

2019-08-20 15:20 1281 查看

页面解析和数据提取

结构数据：先有的结构，在谈数据 JSON文件 JSON Path
转换成Python类型进行操作（json类）

XML文件

XPath
CSS选择器
正则

非结构化数据：先有数据，再谈结构

电话号码
邮箱地址通常处理此类数据，使用正则表达式

Html文件

XPath
CSS选择器

正则表达式

一套规则，可以在字符串文本中进行搜查替换等
re的基本使用流程

'''
python中正则模块是re
使用大致步骤：
1. compile函数讲正则表达式的字符串便以为一个Pattern对象
2. 通过Pattern对象的一些列方法对文本进行匹配，匹配结果是一个Match对象
3. 用Match对象的方法，对结果进行操纵

'''

import re

# \d表示以数字
# 后面+号表示这个数字可以出现一次或者多次
s = r"\d+" # r表示后面是原生字符串，后面不需要转义

# 返回Pattern对象
pattern = re.compile(s)

# 返回一个Match对象
# 默认找到一个匹配就返回
m = pattern.match("one12two2three3")

print(type(m))
# 默认匹配从头部开始，所以此次结果为None
print(m)

# 返回一个Match对象
# 后面为位置参数含义是从哪个位置开始查找，找到哪个位置结束
m = pattern.match("one12two2three3", 3, 10)

print(type(m))
# 默认匹配从头部开始，所以此次结果为None
print(m)

print(m.group())

print(m.start(0))
print(m.end(0))
print(m.span(0))

match的基本使用

'''
正则结果Match的使用案例
'''

import re

# 以下正则分成了两个组，以小括号为单位
s = r'([a-z]+) ([a-z]+)'
pattern = re.compile(s, re.I) # s.I表示忽略大小写

m = pattern.match("Hello world wide web")

# goup（0）表示返回匹配成功的整个子串
s = m.group(0)
print(s)

a = m.span(0) # 返回匹配成功的 整个子串的跨度
print(a)

# gourp(1)表示返回的第一个分组匹配成功的子串
s = m.group(1)
print(s)

a = m.span(1) # 返回匹配成功的第一个子串的跨度
print(a)

s = m.groups() #等价于m.gourp(1), m.group(2).......
print(s)

正则常用方法： match: 从开始位置开始查找，一次匹配
search：从任何位置查找，一次匹配

'''
search
'''

import re

s = r'\d+'

pattern = re.compile(s)

m = pattern.search("one12two34three56")
print(m.group())

# 参数表明搜查的起始范围
m = pattern.search("one12two34three56", 10, 40)
print(m.group())

findall：全部匹配，返回列表
finditer：全部匹配，返回迭代器

'''
findall案例
'''
import re

pattern = re.compile(r'\d+')

s = pattern.findall("i am 18 years odl and 185 high")

print(s)

s = pattern.finditer("i am 18 years odl and 185 high")

print(type(s))

for i in s:
print(i.group())

split：分割字符串，返回列表
sub：替换
匹配中文中文unicode范围主要在[u4e00-u9fa5]

'''
中文unicode案例
'''

import re

hello = u'你好，世界'

pattern = re.compile(r'[\u4e00-\u9fa5]+')

m = pattern.findall(hello)
print(m)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航