您的位置：首页 > 编程语言 > Python开发

自然语言16.1_Python自然语言处理学习笔记之信息提取步骤&分块（chunking）

2016-11-21 19:43 316 查看

sklearn实战-乳腺癌细胞数据挖掘（博主亲自来录制视频教程）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

QQ:231469242

欢迎喜欢nltk朋友交流
http://www.cnblogs.com/undercurrent/p/4754944.html
一、信息提取模型　　

　　信息提取的步骤共分为五步，原始数据为未经处理的字符串，

第一步：分句，用nltk.sent_tokenize(text)实现,得到一个listofstrings

第二步：分词，[nltk.word_tokenize(sent)forsentinsentences]实现，得到listoflistsofstrings

第三步：标记词性，[nltk.pos_tag(sent)forsentinsentences]实现得到一个listoflistsoftuples

前三步可以定义在一个函数中：

>>>defie_preprocess(document):
...sentences=nltk.sent_tokenize(document)
...sentences=[nltk.word_tokenize(sent)forsentinsentences]
...sentences=[nltk.pos_tag(sent)forsentinsentences]

第四步：实体识别（entitydetection）在这一步，既要识别已定义的实体（指那些约定成俗的习语和专有名词），也要识别未定义的实体，得到一个树的列表

第五步：关系识别（relationdetection）寻找实体之间的关系，并用tuple标记，最后得到一个tuple列表

二、分块（chunking）

　　分块是第四步entitydetection的基础，本文只介绍一种块nounphrasechunking即NP-chunking，这种块通常比完整的名词词组小，例如：themarketforsystem-managementsoftware是一个名词词组，但是它会被分为两个NP-chunking——themarket和system-managementsoftware。任何介词短语和从句都不会包含在NP-chunking中，因为它们内部总是会包含其他的名词词组。

　　从一个句子中提取分块需要用到正则表达式，先给出示例代码：

grammar=r"""
NP:{<DT|PP\$>?<JJ>*<NN>}#chunkdeterminer/possessive,adjectivesandnoun
{<NNP>+}#chunksequencesofpropernouns
"""
cp=nltk.RegexpParser(grammar)
sentence=[("Rapunzel","NNP"),("let","VBD"),("down","RP"),
("her","PP$"),("long","JJ"),("golden","JJ"),("hair","NN")]

>>>print(cp.parse(sentence))
(S
(NPRapunzel/NNP)
let/VBD
down/RP
(NPher/PP$long/JJgolden/JJhair/NN))

　　正则表达式的格式为"""块名：{<表达式>...<>}

{...}”""

如：

grammar=r"""
NP:{<DT|PP\$>?<JJ>*<NN>}#chunkdeterminer/possessive,adjectivesandnoun
{<NNP>+}#chunksequencesofpropernouns
"""

　　大括号内为分块规则（chunkingrule），可以有一个或多个，当rule不止一个时，RegexpParser会依次调用各个规则，并不断更新分块结果，直到所有的rule都被调用。nltk.RegexpParser(grammar)用于依照chunkingrule创建一个chunk分析器，cp.parse()则在目标句子中运行分析器，最后的结果是一个树结构，我们可以用print打印它，或者用result.draw()将其画出。

　　在chunkingrule中还用一种表达式chink，用于定义chunk中我们不想要的模式，这种表达式的格式为：‘}表达式{’使用chink的结果一般有三种，一、chink定义的表达式和整个chunk都匹配，则将整个chunk删除；二、匹配的序列在chunk中间，则chunk分裂为两个小chunk；三、在chunk的边缘，则chunk会变小。使用方法如下：

grammar=r"""
NP:
{<.*>+}#Chunkeverything
}<VBD|IN>+{#ChinksequencesofVBDandIN
"""
sentence=[("the","DT"),("little","JJ"),("yellow","JJ"),
("dog","NN"),("barked","VBD"),("at","IN"),("the","DT"),("cat","NN")]
cp=nltk.RegexpParser(grammar)

>>>print(cp.parse(sentence))
(S
(NPthe/DTlittle/JJyellow/JJdog/NN)
barked/VBD
at/IN
(NPthe/DTcat/NN))

python风控评分卡建模和风控常识

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航