您的位置:首页 > 编程语言 > Java开发

结巴分词JAVA版本源码解析(三)

2018-01-14 15:45 417 查看
在上一部分已经计算好根据字典用动态规划生成的切词路径。对于未登录的词要采用HMM模型用维特比算法进行解码
for (int i = 0; i < sentence.length(); ++i) {
char ch = sentence.charAt(i);
if (CharacterUtil.isChineseLetter(ch)) {//英文及其他字符遇到中文
if (other.length() > 0) {
processOtherUnknownWords(other.toString(), tokens);
other = new StringBuilder();
}
chinese.append(ch);
}
else {
if (chinese.length() > 0) {
viterbi(chinese.toString(), tokens);
chinese = new StringBuilder();
}
other.append(ch);
}

}
if (chinese.length() > 0)
viterbi(chinese.toString(), tokens);
else {
processOtherUnknownWords(other.toString(), tokens);
}
}
如上这段代码包含对中英文的区别,可以看出,结巴分词同样可以处理英文数字等字符
private void processOtherUnknownWords(String other, List<String> tokens) {//正则匹配等处理非中文字符串
Matcher mat = CharacterUtil.reSkip.matcher(other);
int offset = 0;
while (mat.find()) {
if (mat.start() > offset) {
tokens.add(other.substring(offset, mat.start()));
}
tokens.add(mat.group());
offset = mat.end();
}
if (offset < other.length())
tokens.add(other.substring(offset));
}
public static Pattern reSkip = Pattern.compile("(\\d+\\.\\d+|[a-zA-Z0-9]+)");


采用正则处理
下面是采用维特比算法对未登录词进行处理,可以看出BMES序列标注可以看做一阶马尔科夫模型,类似于一种bigram,考虑前一个状态的约束
public void viterbi(String sentence, List<String> tokens) {
Vector<Map<Character, Double>> v = new Vector<Map<Character, Double>>();//可看做句子对应序列存储当前位置概率
Map<Character, Node> path = new HashMap<Character, Node>();

v.add(new HashMap<Character, Double>());
for (char state : states) {
Double emP = emit.get(state).get(sentence.charAt(0));//四个状态中取出该值的概率
if (null == emP)
emP = MIN_FLOAT;
v.get(0).put(state, start.get(state) + emP);//以当前字开始四种概率
path.put(state, new Node(state, null));
}

for (int i = 1; i < sentence.length(); ++i) {
Map<Character, Double> vv = new HashMap<Character, Double>();
v.add(vv);
Map<Character, Node> newPath = new HashMap<Character, Node>();
for (char y : states) {
Double emp = emit.get(y).get(sentence.charAt(i));
if (emp == null)
emp = MIN_FLOAT;
Pair<Character> candidate = null;
for (char y0 : prevStatus.get(y)) {
Double tranp = trans.get(y0).get(y);
if (null == tranp)
tranp = MIN_FLOAT;
tranp += (emp + v.get(i - 1).get(y0));
if (null == candidate)
candidate = new Pair<Character>(y0, tranp);
else if (candidate.freq <= tranp) {
candidate.freq = tranp;
candidate.key = y0;
}
}
vv.put(y, candidate.freq);//存放四种状态和前面候选的最大频率
newPath.put(y, new Node(y, path.get(candidate.key)));//存放状态路径
}
path = newPath;
}
//可以看做是以E或S结尾
double probE = v.get(sentence.length() - 1).get('E');
double probS = v.get(sentence.length() - 1).get('S');
Vector<Character> posList = new Vector<Character>(sentence.length());
Node win;
if (probE < probS)
win = path.get('S');
else
win = path.get('E');

while (win != null) {
posList.add(win.value);
win = win.parent;
}
Collections.reverse(posList);

int begin = 0, next = 0;
for (int i = 0; i < sentence.length(); ++i) {
char pos = posList.get(i);
if (pos == 'B')
begin = i;
else if (pos == 'E') {
tokens.add(sentence.substring(begin, i + 1));
next = i + 1;
}
else if (pos == 'S') {
tokens.add(sentence.substring(i, i + 1));
next = i + 1;
}
}
if (next < sentence.length())
tokens.add(sentence.substring(next));
}
然后大功告成。具体细节不再赘述。思路大体如上接下来我说我的一些思考:
   结巴分词采用HMM模型,之前通过词典的匹配生成DAG,中文分词本身难度较大,采用这种方法最大的特点就是简单快速,也带来一些自身的缺点,比如字典占用空间问题,而对于未登录词的识别采用HMM模型,其中概率表上采用位置到单字的发射概率,这一点我认为可能也是算法的局限型所在,并没有考虑到所谓的上下文之间的关系,可能在具体的应用中会产生一些歧义没有很好的完成消歧的工作,因此其分词的准确率很大程度依然依赖于DAG,而由于短词相乘概率会变得很小,分词也将会偏向于处理长句,不过其作为分词工具在绝大多数情况下都能取得不错的效果。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: