您的位置:首页 > 编程语言 > C#

让中科院中文分词系统ICTCLAS为lucene所用的简单程序(C#版)

2010-02-07 20:00 603 查看
我利用了吕震宇根据Free版ICTCLAS改编而成.net平台下的ICTCLAS,把ICTCLAS的分词为lucene所用。以下是我写的程序,比较简单。大家看看评评,有什么要改进的地方,望大家指出
Analyzer类:

using System;
using System.Collections.Generic;
using System.Text;

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using System.IO;

namespace ICTCLASForLucene
{
public class ICTCLASAnalyzer : Analyzer
{
//定义要过滤的词
public static readonly System.String[] CHINESE_ENGLISH_STOP_WORDS = new string[368];
public string NoisePath = Environment.CurrentDirectory + "//data//sNoise.txt";

public ICTCLASAnalyzer()
{
StreamReader reader = new StreamReader(NoisePath, System.Text.Encoding.UTF8);
string noise = reader.ReadLine();
int i = 0;
while (!string.IsNullOrEmpty(noise))
{
CHINESE_ENGLISH_STOP_WORDS[i] = noise;
noise = reader.ReadLine();
i++;
}
}

/**//// <summary>Constructs a {@link StandardTokenizer} filtered by a {@link
/// StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}.
/// </summary>
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
{
TokenStream result = new ICTCLASTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, CHINESE_ENGLISH_STOP_WORDS);
return result;
}

}
}

Tokenizer类:
using System;
using System.Collections.Generic;
using System.Text;

using Lucene.Net.Analysis;
using SharpICTCLAS;
using System.IO;

namespace ICTCLASForLucene
{
class ICTCLASTokenizer : Tokenizer
{
int nKind = 2;
List<WordResult[]> result;
int startIndex = 0;
int endIndex = 0;
int i = 1;
/**//// <summary>
/// 待分词的句子
/// </summary>
private string sentence;
/**//// <summary>Constructs a tokenizer for this Reader. </summary>
public ICTCLASTokenizer(System.IO.TextReader reader)
{
this.input = reader;
sentence = input.ReadToEnd();
sentence = sentence.Replace("/r/n","");
string DictPath = Path.Combine(Environment.CurrentDirectory, "Data") + Path.DirectorySeparatorChar;
//Console.WriteLine("正在初始化字典库,请稍候");
WordSegment wordSegment = new WordSegment();
wordSegment.InitWordSegment(DictPath);
result = wordSegment.Segment(sentence, nKind);
}

/**//// <summary>进行切词,返回数据流中下一个token或者数据流为空时返回null
/// </summary>
public override Token Next()
{
Token token = null;
while (i < result[0].Length-1)
{
string word = result[0][i].sWord;
endIndex = startIndex + word.Length - 1;
token = new Token(word, startIndex, endIndex);
startIndex = endIndex + 1;

i++;
return token;

}
return null;
}
}
}


分词郊果:

需分词句子:***,周恩来,中华人民共和国在1949年建立,从此开始了新中国的伟大篇章.长春市长春节发表致词汉字abc iphone 1265325.98921 fee1212@tom.com http://news.qq.com 100%
分词结果:
(***,0,2)(周恩来,4,6)(中华人民共和国,8,14)(1949年,16,20)(建立,21,22)(从此,24,25)(新,29,29)(中国,30,31)(伟大,33,34)(篇章,35,36)(长春市,38,40)(春节,42,43)(发表,44,45)(致词,46,47)(汉字,48,49)(abc,50,52)(iphone,53,58)(1265325.98921,59,71)(fee1212@tom,72,82)(com,84,86)(http://news,87,97)(qq,99,100)(com,102,104)(100%,105,108)
耗时00:00:00.0937500


 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息