nltk分块,命名实体识别和关系抽取 AttributeError: module 'nltk.sem' has no attribute 'show_raw_rtuple'
2018-01-10 21:05
549 查看
1. 分块是用于实体识别的基本技术
示例:名词短语分块
2. 加缝器: 加缝器是从大块中去除标识符序列的过程
(1)如果匹配的标识符序列贯穿整块,那么这个整块将被去除
(2)如果标识符序列出现在块中间,那么这些标识符会被去除,在以前只有一个块的地方留下两个块
(3)如果序列在块的周边,这些标记会被去除,留下一个较小的块
3. 分块表示
#分块的表示
#使用最广泛的表示是IOB标记,I(inside,内部):块内的标识符子序列,
# B(begin,开始):分块的开始,O(outside,外部):所有其它的标识符
#开发和评估分块器
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])
print(conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99])
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
(VP told/VBD)
(NP his/PRP$ story/NN)
./.)
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
told/VBD
(NP his/PRP$ story/NN)
./.)使用正则规则能提升分块表现
#简单的评估和基准
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
print(cp.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 43.4%%
Precision: 0.0%%
Recall: 0.0%%
F-Measure: 0.0%%
grammer = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammer)
print(cp.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 87.7%%
Precision: 70.6%%
Recall: 67.8%%
F-Measure: 69.2%%使用unigram标注器
#使用unigram标注器对名词短语分块
class UnigramChunker(nltk.ChunkParserI):
def __init__(self,train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
self.tagger = nltk.UnigramTagger(train_data)
def parse(self,sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos,chunktag) in tagged_pos_tags]
conlltags = [(word,pos,chunktag) for ((word,pos),chunktag) in zip(sentence,chunktags)]
return nltk.chunk.conllstr2tree(conlltags)
test_sents = conll2000.chunked_sents('text.txt',chunk_types=["NP"])
train_sents = conll2000.chunked_sents('train.tx
a697
t',chunk_types=["NP"])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))4. 关系抽取:注意原书代码nltk.sem.show_raw_rtuple(rel)会报错,这是因为show_raw_rtuple已经改为rtuple方法了
#方法之一是首先寻找所有(X,a,Y)形式的三元组,其中X,Y是指定类型的命名实体,a表示X和Y之间的关系的字符串。
#然后使用正则表达式从a的实体中抽出正在查找的关系。
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print(nltk.sem.rtuple(rel))
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
示例:名词短语分块
#一个简单的基于正则表达式的NP分块的例子 sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),\ ('barked','VBD'),('at','IN'),('the','DT'),('cat','NN')] grammer = "NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammer) result = cp.parse(sentence) print(result) (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) In[4]: result.draw()
2. 加缝器: 加缝器是从大块中去除标识符序列的过程
(1)如果匹配的标识符序列贯穿整块,那么这个整块将被去除
(2)如果标识符序列出现在块中间,那么这些标识符会被去除,在以前只有一个块的地方留下两个块
(3)如果序列在块的周边,这些标记会被去除,留下一个较小的块
In[8]: grammar= r""""NP:{<.*>+} }<VBD|IN>+{ {<.*>+}""" cp = nltk.RegexpParser(grammer) print(cp.parse(sentence)) (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN))
3. 分块表示
#分块的表示
#使用最广泛的表示是IOB标记,I(inside,内部):块内的标识符子序列,
# B(begin,开始):分块的开始,O(outside,外部):所有其它的标识符
#开发和评估分块器
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])
print(conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99])
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
(VP told/VBD)
(NP his/PRP$ story/NN)
./.)
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
told/VBD
(NP his/PRP$ story/NN)
./.)使用正则规则能提升分块表现
#简单的评估和基准
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
print(cp.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 43.4%%
Precision: 0.0%%
Recall: 0.0%%
F-Measure: 0.0%%
grammer = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammer)
print(cp.evaluate(test_sents))
ChunkParse score:
IOB Accuracy: 87.7%%
Precision: 70.6%%
Recall: 67.8%%
F-Measure: 69.2%%使用unigram标注器
#使用unigram标注器对名词短语分块
class UnigramChunker(nltk.ChunkParserI):
def __init__(self,train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
self.tagger = nltk.UnigramTagger(train_data)
def parse(self,sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos,chunktag) in tagged_pos_tags]
conlltags = [(word,pos,chunktag) for ((word,pos),chunktag) in zip(sentence,chunktags)]
return nltk.chunk.conllstr2tree(conlltags)
test_sents = conll2000.chunked_sents('text.txt',chunk_types=["NP"])
train_sents = conll2000.chunked_sents('train.tx
a697
t',chunk_types=["NP"])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))4. 关系抽取:注意原书代码nltk.sem.show_raw_rtuple(rel)会报错,这是因为show_raw_rtuple已经改为rtuple方法了
AttributeError: module 'nltk.sem' has no attribute 'show_raw_rtuple'
#方法之一是首先寻找所有(X,a,Y)形式的三元组,其中X,Y是指定类型的命名实体,a表示X和Y之间的关系的字符串。
#然后使用正则表达式从a的实体中抽出正在查找的关系。
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print(nltk.sem.rtuple(rel))
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
相关文章推荐
- Python3.X识别混合编码,顺便解决“AttributeError: 'module' object has no attribute 'urlopen'”
- Python 【精】AttributeError: 'Module' object has no attribute 'STARTF_USESHOWINDOW'
- 【tensorflow_error】'module' has no attribute 'select'
- mxnet运行时报错:AttributeError: module 'mxnet.ndarray' has no attribute 'random'
- Ubuntu16.04上启动Spyder出现错误AttributeError: 'module' object has no attribute '_base'
- AttributeError: &#39;module&#39; object has no attribute &#39;maketrans&#39;
- Python脚本报错AttributeError: ‘module’ object has no attribute’xxx’解决方法
- 学习Python csv模块遇到AttributeError: module 'csv' has no attribute 'writer'和写入后出现空格问题
- Python脚本报错AttributeError: ‘module’ object has no attribute’xxx’解决方法
- Python脚本报错AttributeError: ‘module’ object has no attribute’xxx’解决方法
- matplotlib显示AttributeError: 'module' object has no attribute 'verbose'
- python 报错信息: AttributeError: module 'token' has no attribute '__all__'
- python3.6脚本import json模块后,报错:AttributeError: module 'json' has no attribute 'dumps'
- AttributeError: 'module' object has no attribute 'Serial'
- TensorFlow Bug AttributeError: module 'tensorflow' has no attribute 'unpack'
- import json后,报错:AttributeError: 'module' object has no attribute 'dumps,原因分析及解决方法
- python 中出现的AttributeError: 'module' object has no attribute '_handlerList'
- python错误:AttributeError: 'module' object has no attribute 'setdefaultencoding'问题的解决方法
- AttributeError: 'module' object has no attribute 'imsave'
- python错误:AttributeError: 'module' object has no attribute 'setdefaultencoding'问题的解决方法