您的位置:首页 > 运维架构

openNLP的POSTagger使用(一)训练模型

2014-11-25 21:05 567 查看
<span style="font-size:18px;">
</span>


</pre><span style="font-size:18px;">以前都是在新浪上写的博客,当然了,自娱自乐的性质比较多一些。当时没有选择CSDN的原因是因为这上面大牛太多了,像我这种菜鸟级别的就不好意思在这里浪费笔墨了。不过后来发现在查资料大部分还是要在这里查,在两个博客之间切过来切过去着实麻烦,因此最后决定有什么想法或记录还是记在这里吧比较好一些。</span><p></p><p><span style="font-size:18px"><span style="font-family:SimSun">好了,废话不说,今天主要写一下我使用openNLP的词性标注器的一些步骤。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun"></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">最近,要比较不同算法进行词性标记的准确率,比较成熟POSTagger实现有基于HMM,ME,CRFs,Perceptron等等。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">其中关于ME的介绍和实现以及与之相关的实现看链接<a target=_blank target="_blank" href="https://homepages.inf.ed.ac.uk/lzhang10/maxent.html">Maxent</a>,</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">好了,言归正传,下面稍微介绍一下openNLP的POSTagger。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">主要参考为:<a target=_blank target="_blank" href="http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.postagger">opennlp</a></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">OpenNLP POS Tagger使用概率模型来预测正确的词性标记,同时为了限制一个标志(单词)的可能词性候选,可以使用一个tag字典来增加标注和运行的性能。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">在对文本进行词性标记之前首先需要训练模型。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">训练集的格式为:</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">每一句话为一行。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">单词和词性之间用“_”连接,比如“春装_n”,“新款_b”等。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">而”单词_词性“对之间用空格分割,例如“<span style="font-size:18px">春装_n 新款_b”等。</span></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">训练集的一行应该为如下格式:</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">你_r 永远_d 想象_v 不_d 到_v 上_v 一秒_t 对_p 你_r 好_a 的_u 人_n 下_f 一秒_t 会_v 变成_v 什么样_r </span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">生成训练集后就可以调用openNLP的训练API进行训练模型了。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">导入训练集的语句为:</span></span></p><p><span style="font-size:18px"><span style="font-size:18px; white-space:pre; background-color:rgb(240,240,240)"><span style="font-family:SimSun">InputStream dataIn = null; </span></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun"><span style="font-size:18px; white-space:pre; background-color:rgb(240,240,240)">dataIn = new FileInputStream(fileRead); //fileRead是训练集的路径</span></span></span></p><p><span style="font-family:SimSun; font-size:18px">整个代码块如下:</span></p><p><span style="font-size:18px"></span></p><pre name="code" class="java"><span style="font-family:SimSun;">		POSModel model = null;

InputStream dataIn = null;
try {
dataIn = new FileInputStream(fileRead.toString());
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model=POSTaggerME.train("ch", sampleStream, TrainingParameters.defaultParams(),new POSTaggerFactory());
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}</span>
这里要对train()函数做一些说明:

train()函数是POSTaggerME类的静态函数,根据参数的不同,一共有三种实现,

(1)
train(String languageCode, ObjectStream<POSSample> samples, ModelType modelType, POSDictionary tagDictionary, Dictionary ngramDictionary, int cutoff, int iterations)


(2)train(String languageCode, ObjectStream<POSSample> samples, TrainingParameters trainParams, POSDictionary tagDictionary, Dictionary ngramDictionary)

(3)train(String languageCode, ObjectStream<POSSample> samples, TrainingParameters trainParams, POSTaggerFactory posFactory)

其中(1)和(2)已经被弃用,只有(3)可用,这也简化了我们的编程。

接下来对train()里的参数进行简要的介绍:

String languageCode:是训练集的语言类型,如果是英文则参数值应该为"en",如果是中文应该是"ch";

ObjectStream<POSSample> samples:ObjectStream是一个定义了从流中读取对象的接口。PlainTextByLineStream类继承了ObjectStream接口,在本例中接收的参数为InputStream类的对象,和文件(训练集)编码方式。这条代码为:

ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");

该对象可以把训练集的每一行当做一个字符串对象返回。

WordTagSampleStream类的继承关系是:

WordTagSampleStream->FilterObjectStream->ObjectStream;

WordTagSampleStream类是个流过滤器,它按行读取包含词和词性标记的句子,并输出一个POSSample类型的对象。

FilterObjectStream<S, T>类是取一个存在的流,并把它的输出转化为别的东西。S为源/输入流类型,T为当前流类型。

TrainingParameters trainParams:训练模型所需的参数,我们这里设置为默认的参数。

POSTaggerFactory posFactory:为POS Tagger提供默认的实现和资源。由于我们没有n-gram字典和pos字典,所以我们使用默认的无参构造函数new POSTaggerFactory();

最后该train()函数返回一个POSModel类型的值。我们这里该模型为model。

生成模型model后我们需要把该模型存到硬盘上,以下为存入写操作的代码。

OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
}
catch (IOException e) {
// Failed to save model
e.printStackTrace();
}
finally {
if (modelOut != null) {
try {
modelOut.close();
}
catch (IOException e) {
// Failed to correctly save model.
// Written model might be invalid.
e.printStackTrace();
}
}


在语句:

modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));中
modelFile变量为你要写入的文件的路径。

model.serialize():对于给定的输出流序列化该模型。即是序列化输出的意思。

本文执行训练模型的全部代码如下所示:

package org.openNLP.openNLPTest;

import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Set;

import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerFactory;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.postag.WordTagSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

public class MaxentTry {
String path="D:/Eclipse/workspace/openNLPTest/corpus/";
StringBuffer fileRead;
StringBuffer fileWrite;

public void train1(){
fileRead=new StringBuffer(path);
fileRead.append("train.txt");
fileWrite=new StringBuffer(path);
fileWrite.append("model.bin");

POSModel model = null;

InputStream dataIn = null;
try {
dataIn = new FileInputStream(fileRead.toString());
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model=POSTaggerME.train("ch", sampleStream, TrainingParameters.defaultParams(),new POSTaggerFactory());
//		  model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(fileWrite.toString()));
model.serialize(modelOut);
}
catch (IOException e) {
// Failed to save model
e.printStackTrace();
}
finally {
if (modelOut != null) {
try {
modelOut.close();
}
catch (IOException e) {
// Failed to correctly save model.
// Written model might be invalid.
e.printStackTrace();
}
}

}
}

public static void main(String args[]){
MaxentTry mt=new MaxentTry();
mt.train1();
}
}


未完待续……
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: