您的位置：首页 > 运维架构

openNLP的POSTagger使用（一）训练模型

2014-11-25 21:05 567 查看

<span style="font-size:18px;">
</span>

</pre><span style="font-size:18px;">以前都是在新浪上写的博客，当然了，自娱自乐的性质比较多一些。当时没有选择CSDN的原因是因为这上面大牛太多了，像我这种菜鸟级别的就不好意思在这里浪费笔墨了。不过后来发现在查资料大部分还是要在这里查，在两个博客之间切过来切过去着实麻烦，因此最后决定有什么想法或记录还是记在这里吧比较好一些。</span><p></p><p><span style="font-size:18px"><span style="font-family:SimSun">好了，废话不说，今天主要写一下我使用openNLP的词性标注器的一些步骤。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun"></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">最近，要比较不同算法进行词性标记的准确率，比较成熟POSTagger实现有基于HMM，ME，CRFs，Perceptron等等。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">其中关于ME的介绍和实现以及与之相关的实现看链接<a target=_blank target="_blank" href="https://homepages.inf.ed.ac.uk/lzhang10/maxent.html">Maxent</a>，</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">好了，言归正传，下面稍微介绍一下openNLP的POSTagger。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">主要参考为：<a target=_blank target="_blank" href="http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.postagger">opennlp</a></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">OpenNLP POS Tagger使用概率模型来预测正确的词性标记，同时为了限制一个标志（单词）的可能词性候选，可以使用一个tag字典来增加标注和运行的性能。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">在对文本进行词性标记之前首先需要训练模型。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">训练集的格式为：</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">每一句话为一行。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">单词和词性之间用“_”连接，比如“春装_n”，“新款_b”等。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">而”单词_词性“对之间用空格分割，例如“<span style="font-size:18px">春装_n 新款_b”等。</span></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">训练集的一行应该为如下格式：</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">你_r 永远_d 想象_v 不_d 到_v 上_v 一秒_t 对_p 你_r 好_a 的_u 人_n 下_f 一秒_t 会_v 变成_v 什么样_r </span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">生成训练集后就可以调用openNLP的训练API进行训练模型了。</span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun">导入训练集的语句为：</span></span></p><p><span style="font-size:18px"><span style="font-size:18px; white-space:pre; background-color:rgb(240,240,240)"><span style="font-family:SimSun">InputStream dataIn = null; </span></span></span></p><p><span style="font-size:18px"><span style="font-family:SimSun"><span style="font-size:18px; white-space:pre; background-color:rgb(240,240,240)">dataIn = new FileInputStream(fileRead); //fileRead是训练集的路径</span></span></span></p><p><span style="font-family:SimSun; font-size:18px">整个代码块如下：</span></p><p><span style="font-size:18px"></span></p><pre name="code" class="java"><span style="font-family:SimSun;">		POSModel model = null;

InputStream dataIn = null;
try {
dataIn = new FileInputStream(fileRead.toString());
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model=POSTaggerME.train("ch", sampleStream, TrainingParameters.defaultParams(),new POSTaggerFactory());
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}</span>

这里要对train()函数做一些说明：

train()函数是POSTaggerME类的静态函数，根据参数的不同，一共有三种实现，

（1）

train(String languageCode, ObjectStream<POSSample> samples, ModelType modelType, POSDictionary tagDictionary, Dictionary ngramDictionary, int cutoff, int iterations)

（2）train(String languageCode, ObjectStream<POSSample> samples, TrainingParameters trainParams, POSDictionary tagDictionary, Dictionary ngramDictionary)

（3）train(String languageCode, ObjectStream<POSSample> samples, TrainingParameters trainParams, POSTaggerFactory posFactory)

其中（1）和（2）已经被弃用，只有（3）可用，这也简化了我们的编程。

接下来对train()里的参数进行简要的介绍：

String languageCode：是训练集的语言类型，如果是英文则参数值应该为"en"，如果是中文应该是"ch"；

ObjectStream<POSSample> samples：ObjectStream是一个定义了从流中读取对象的接口。PlainTextByLineStream类继承了ObjectStream接口，在本例中接收的参数为InputStream类的对象，和文件（训练集）编码方式。这条代码为：

ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");

该对象可以把训练集的每一行当做一个字符串对象返回。

WordTagSampleStream类的继承关系是：

WordTagSampleStream->FilterObjectStream->ObjectStream；

WordTagSampleStream类是个流过滤器，它按行读取包含词和词性标记的句子，并输出一个POSSample类型的对象。

FilterObjectStream<S, T>类是取一个存在的流，并把它的输出转化为别的东西。S为源/输入流类型，T为当前流类型。

TrainingParameters trainParams：训练模型所需的参数，我们这里设置为默认的参数。

POSTaggerFactory posFactory：为POS Tagger提供默认的实现和资源。由于我们没有n-gram字典和pos字典，所以我们使用默认的无参构造函数new POSTaggerFactory()；

最后该train()函数返回一个POSModel类型的值。我们这里该模型为model。

生成模型model后我们需要把该模型存到硬盘上，以下为存入写操作的代码。

OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
}
catch (IOException e) {
// Failed to save model
e.printStackTrace();
}
finally {
if (modelOut != null) {
try {
modelOut.close();
}
catch (IOException e) {
// Failed to correctly save model.
// Written model might be invalid.
e.printStackTrace();
}
}

在语句：

modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));中

modelFile变量为你要写入的文件的路径。

model.serialize()：对于给定的输出流序列化该模型。即是序列化输出的意思。

本文执行训练模型的全部代码如下所示：

package org.openNLP.openNLPTest;

import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Set;

import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerFactory;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.postag.WordTagSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

public class MaxentTry {
String path="D:/Eclipse/workspace/openNLPTest/corpus/";
StringBuffer fileRead;
StringBuffer fileWrite;

public void train1(){
fileRead=new StringBuffer(path);
fileRead.append("train.txt");
fileWrite=new StringBuffer(path);
fileWrite.append("model.bin");

POSModel model = null;

InputStream dataIn = null;
try {
dataIn = new FileInputStream(fileRead.toString());
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model=POSTaggerME.train("ch", sampleStream, TrainingParameters.defaultParams(),new POSTaggerFactory());
//		  model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(fileWrite.toString()));
model.serialize(modelOut);
}
catch (IOException e) {
// Failed to save model
e.printStackTrace();
}
finally {
if (modelOut != null) {
try {
modelOut.close();
}
catch (IOException e) {
// Failed to correctly save model.
// Written model might be invalid.
e.printStackTrace();
}
}

}
}

public static void main(String args[]){
MaxentTry mt=new MaxentTry();
mt.train1();
}
}

未完待续……

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航