您的位置:首页 > 编程语言

教你如何修改源代码使paoding庖丁分词器支持最新版本Lucene3.0.2

2010-07-09 14:48 465 查看
大家都喜欢开源,不过其有一个问题就是版本的兼容性问题。

去年我还在用lucene2.3现在却到了3.0.2.而qieqie前辈估计很忙,所以对于兼容Lucene3的版本的paoding迟迟未发布。

在google代码网http://code.google.com/p/paoding/issues/detail?id=49#makechanges有提供的三个文件用来覆盖原有的代码使paoding兼容Lucene3.

而我们是直接采用svn同步了paoding的最新代码,嵌入到我们的项目。同事在lucene3.0.1上没有听说有报错问题,但是我把它嵌入到lucene3.0.2版本下却报:

Exception in thread "main" java.lang.NullPointerException
at net.paoding.analysis.analyzer.PaodingTokenizer.close(PaodingTokenizer.java:164)
at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:571)
at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1362)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1250)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1178)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1167)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:182)
at test.TestBoBo.main(TestBoBo.java:44)
最后看到关于paoding支持solr1.4的文章:

转:http://www.odoc.info/?p=185



由于solr1.4使用Lucene 2.9.1 ,故需要修改庖丁源代码:net.paoding.analysis.analyzer.PaodingTokenizer

需要注意的有两点

1. 继承关系由 TokenStream 调整为 Tokenizer,因此需要删除变量

private final Reader input;

删除对应的关闭方法

public void close() throws IOException {
super.close();
input.close();
}

2. 高亮显示功能由于底层实现变化,故需要重写reset方法。原先reset 只是调整input。现在需要将多个值重置。

public void reset(Reader input) throws IOException {
this.input = input;
this.inputLength=0;
this.offset=0;
this.dissected=0;
this.tokenIteractor=null;
this.beef.set(0, 0);
}

调整后的代码如下:

public final class PaodingTokenizer extends Tokenizer implements Collector {

// -------------------------------------------------

/**
* 从input读入的总字符数
*/
private int inputLength;

/**
*
*/
private static final int bufferLength = 128;

/**
* 接收来自{@link #input}的文本字符
*
* @see #next()
*/
private final char[] buffer = new char[bufferLength];

/**
* {@link buffer}[0]在{@link #input}中的偏移
*
* @see #collect(String, int, int)
* @see #next()
*/
private int offset;

/**
*
*/
private final Beef beef = new Beef(buffer, 0, 0);

/**
*
*/
private int dissected;

/**
* 用于分解beef中的文本字符,由PaodingAnalyzer提供
*
* @see #next()
*/
private Knife knife;

/**
*
*/
private TokenCollector tokenCollector;

/**
* tokens迭代器,用于next()方法顺序读取tokens中的Token对象
*
* @see #tokens
* @see #next()
*/
private Iterator<Token> tokenIteractor;

private TermAttribute termAtt;
private OffsetAttribute offsetAtt;
private TypeAttribute typeAtt;

// -------------------------------------------------

/**
*
* @param input
* @param knife
* @param tokenCollector
*/
public PaodingTokenizer(Reader input, Knife knife,
TokenCollector tokenCollector) {
this.input = input;
this.knife = knife;
this.tokenCollector = tokenCollector;
init();
}

private void init() {
termAtt = addAttribute(TermAttribute.class);
offsetAtt = addAttribute(OffsetAttribute.class);
typeAtt = addAttribute(TypeAttribute.class);
}

// -------------------------------------------------

public TokenCollector getTokenCollector() {
return tokenCollector;
}

public void setTokenCollector(TokenCollector tokenCollector) {
this.tokenCollector = tokenCollector;
}

// -------------------------------------------------

public void collect(String word, int offset, int end) {
tokenCollector.collect(word, this.offset + offset, this.offset + end);
}

// -------------------------------------------------
public int getInputLength() {
return inputLength;
}

// -------------------------------------------------

//update by stt 2010-07-09
//reason:报空指针错误
/*@Override
public void close() throws IOException {
super.close();
input.close();
}*/

@Override
public boolean incrementToken() throws IOException {
// 已经穷尽tokensIteractor的Token对象,则继续请求reader流入数据
while (tokenIteractor == null || !tokenIteractor.hasNext()) {
// System.out.println(dissected);
int read = 0;
int remainning = -1;// 重新从reader读入字符前,buffer中还剩下的字符数,负数表示当前暂不需要从reader中读入字符
if (dissected >= beef.length()) {
remainning = 0;
} else if (dissected < 0) {
remainning = bufferLength + dissected;
}
if (remainning >= 0) {
if (remainning > 0) {
System.arraycopy(buffer, -dissected, buffer, 0, remainning);
}
read = input
.read(buffer, remainning, bufferLength - remainning);
inputLength += read;
int charCount = remainning + read;
if (charCount < 0) {
// reader已尽,按接口next()要求返回null.
return false;
}
if (charCount < bufferLength) {
buffer[charCount++] = 0;
}
// 构造“牛”,并使用knife“解”之
beef.set(0, charCount);
offset += Math.abs(dissected);
// offset -= remainning;
dissected = 0;
}
dissected = knife.dissect((Collector) this, beef, dissected);
// offset += read;// !!!
tokenIteractor = tokenCollector.iterator();
}
// 返回tokensIteractor下一个Token对象
Token token = tokenIteractor.next();
termAtt.setTermBuffer(token.term());
offsetAtt.setOffset(correctOffset(token.startOffset()),
correctOffset(token.endOffset()));
typeAtt.setType("paoding");
return true;
}

@Override
public void reset(Reader input) throws IOException {
//update by stt 2010-07-09
//reason:报空指针错误
/*super.reset();
offset = 0;
inputLength = 0;*/

this.input = input;
this.inputLength=0;
this.offset=0;
this.dissected=0;
this.tokenIteractor=null;
this.beef.set(0, 0);
}
}

运行后没有再报此错。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: