您的位置：首页 > 其它

IK中文分词扩展自定义词典【源码解析：文中是Configuration类，但是我的是Configuration接口，DefaultConfig类，可能ik版本不一致】

2015-08-12 15:07 771 查看

文章来源：http://blog.csdn.net/iamaboyy/article/details/7569977

1.基于分布式系统的自定义分词要求与流程设计

（见图）E:\plan\readingnote\分词与索引\分词\2012-4-20

2.分词实现原理——词典的加载过程

2.1.分词词典的加载过程涉及到3个类，分别是Configuration类，Directory类，以及DictSegment类。
其中前两个类是获得配置文件，获得分词词典的，为词典内容的加载做准备的。而DictSegment类则是实现真正的分词加载的类。
2.2.在调用分词的过程中，首先调用Directory类对象，在方法loadMainDict（）中，包含了自定义分词词典内容的加载。
2.2.1.在自定义分词内容的加载中，首先调用Configuration类中的一个方法，用来获得IKAnalyzer.cfg.xml(自定义词典文件配置路径)中配置的自定义词典文件的配置路径。List<String> extDictFiles = Configuration.getExtDictionarys();在这之前，得先获得配置文件的路径，并将其作为流加载到内存中。其实，这两件事情都是在Configuration类中实现的，Directory类只是调用了Configuration类提供的接口而已。
2.2.2.现在来看看Configuration类中做的两件事。

private Configuration(){

props = new Properties();
//
String path=Configuration.class.getResource(FILE_NAME).toString();
//
String path2=Configuration.class.getResource("").toString();
//
String path3=Configuration.class.getResource("/").toString();
InputStream input = Configuration.class.getResourceAsStream(FILE_NAME);
if(input != null){
try {
props.loadFromXML(input);
} catch (InvalidPropertiesFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

（1）初始化。将IKAnalyzer.cfg.xml作为流加载到内存中。注意加粗的代码，这就使得在不修改代码的情况下，只能加载到类所在的classpath路径下的文件。（对于加粗代码的理解，详见E:\plan\readingnote\分词与索引\分词\分词自定义词典路径）
public static List<String> getExtDictionarys(){
List<String> extDictFiles = new ArrayList<String>(2);
String extDictCfg = CFG.props.getProperty(EXT_DICT);
if(extDictCfg != null){
//使用;分割多个扩展字典配置
String[] filePaths = extDictCfg.split(";");
if(filePaths != null){
for(String filePath : filePaths){
if(filePath != null && !"".equals(filePath.trim())){
extDictFiles.add(filePath.trim());
//System.out.println(filePath.trim());
}
}
}
}
return extDictFiles;
}
（2）该段代码用来实现从IKAnalyzer.cfg.xml中获取自定义词典配置路径，并将其放入一个集合中，作为返回值返回。
2.2.3现在转回到Directory类。在获得自定义词典文件路径之后，就是根据文件路径找到自定义词典，然后调用DircSegment加载到内村中。

List<String> extDictFiles = Configuration.getExtDictionarys();
if(extDictFiles != null){
for(String extDictName : extDictFiles){
//读取扩展词典文件
is = Dictionary.class.getResourceAsStream(extDictName);
//如果找不到扩展的字典，则忽略
if(is == null){
continue;
}
try {
BufferedReader br = new BufferedReader(new InputStreamReader(is , "UTF-8"), 512);
String theWord = null;
do {
theWord = br.readLine();
if (theWord != null && !"".equals(theWord.trim())) {
//加载扩展词典数据到主内存词典中
//System.out.println(theWord);
_MainDict.fillSegment(theWord.trim().toCharArray());
}
} while (theWord != null);

} catch (IOException ioe) {
System.err.println("Extension Dictionary loading exception.");
ioe.printStackTrace();

}finally{
try {
if(is != null){
is.close();
is = null;
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
注意到粗体部分，这就要求其只能获得Directory类所在的classpath路径下的资定义词典文件，超出了这个路径范围这找不到了。第二个粗体部分则是调用DictSegment对自定义词典中的词进行加载。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航