smartcn是lucene自带的一个中文分词工具,它源自中科院的ICTCLAS中文分词系统。关于ICTCLAS的算法研究,可以参考这里。SmartChineseAnalyzer里的行为分析,可以从reusableTokenStream或tokenStream方法开始入手。其中前者可以重复使用以提高性能(简单看一下,像是有些实例放到了ThreadLocal里,下次再调用时避免了重新构建)。以下是我以reusableTokenStream为切入点做的分析笔记。
reusableTokenStream方法判断是否已有SavedStreams,如果没有则创建,否则重置一些状态后,直接返回。我只关注新建过程。新建streams其实并未进行实际操作。
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new SentenceTokenizer(reader);
streams.filteredTokenStream = new WordTokenFilter(streams.tokenStream);
streams.filteredTokenStream = new PorterStemFilter(streams.filteredTokenStream);
if (!stopWords.isEmpty()) {
streams.filteredTokenStream = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
streams.filteredTokenStream, stopWords, false);
}
此方法返回TokenStream实例。调用TokenStream的incrementToken()方法可以逐个获取分词。跟踪incrementToken,来到PorterStemFilter.incrementToken(),而此方法又调用了WordTokenFilter.incrementToken()。
WordTokenFilter.incrementToken()里首先调用SentenceTokenizer.incrementToken()进行断句。句子结束的标志为:碰到句号及其它表明句字结束的标点符号(定义在Utility.PUNCTION中),或者碰到文本结束。其中,空字符(Utility.SPACES)会被忽略。
断完句子之后,调用segmentSentence()对句子进行切分。segmentSentence()方法里最重要的一步就是调HHMMSegmenter.process()进行分词处理。此process()方法是算法核心所在。而其类名则表明其基于隐马尔科夫算法(Hidden Markov Model, HMM)。来看看此方法:
/**
* Return a list of {@link SegToken} representing the best segmentation of a sentence
* @param sentence input sentence
* @return best segmentation as a {@link List}
*/
public List
SegGraph segGraph = createSegGraph(sentence);
BiSegGraph biSegGraph = new BiSegGraph(segGraph);
List
return shortPath;
}
这里有三个步骤,第一步是生成SegGraph,第二步则是创建BiSegGraph,第三步寻找最短路径作为结果。关于这里的两个图,可参考ICTCLAS的算法介绍中的相关部分。大致上,第一个图是所有可能的词,第二个图则是词与词之间可能的结合(有向图的路径)。这两个图中结点都有权重属性。后一个图的权重是根据两个词组合的频率经过平滑算法得出的。寻找最短路径其实就是在寻找最佳词语组合。以下对相关代码进行注释。
第一步:org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.createSegGraph(String)。SegGraph的主要属性是一个键为整数,值为SegToken列表的Map:
private Map
其键其实是切分出来的词语在源字符串中的起始位置,其另外一个属性maxStart = -1,用来记录最大起始位置。
/**
* Create the {@link SegGraph} for a sentence.
*
* @param sentence
* input sentence, without start and end markers
* @return {@link SegGraph} corresponding to the input sentence.
*/
private SegGraph createSegGraph(String sentence) {
int i = 0, j;
int length = sentence.length();
int foundIndex;
int[] charTypeArray = getCharTypes(sentence);
StringBuilder wordBuf = new StringBuilder();
SegToken token;
int frequency = 0; // the number of times word appears.
boolean hasFullWidth;
int wordType;
char[] charArray;
SegGraph segGraph = new SegGraph();
/* 从头至尾处理 */
while (i
switch (charTypeArray[i]) {
case CharType.SPACE_LIKE:
i++;
/* 略去空格 */
break;
case CharType.HANZI:
/* 处理汉字 */
j = i + 1;
wordBuf.delete(0, wordBuf.length());
// It doesn't matter if a single Chinese character (Hanzi) can
// form a phrase or not,
// it will store that single Chinese character (Hanzi) in the
// SegGraph. Otherwise, it will
// cause word division.
wordBuf.append(sentence.charAt(i));
charArray = new char[] { sentence.charAt(i) };
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, i, j, WordType.CHINESE_WORD,
frequency);
/* 加入单个汉字 */
segGraph.addToken(token);
foundIndex = wordDict.getPrefixMatch(charArray);
/* 寻找向后组合所能构成的所有词语并添加到图中 foundIndex != –1 表示有可能构成词 */
while (j <= length && foundIndex != -1) {
/* 如果已经构成词语(上面说有可能,是因为这也可能只是一个词语的前缀) */
if (wordDict.isEqual(charArray, foundIndex)
&& charArray.length > 1) {
// It is the phrase we are looking for; In other words,
// we have found a phrase SegToken
// from i to j. It is not a monosyllabic word (single
// word).
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, i, j,
WordType.CHINESE_WORD, frequency);
segGraph.addToken(token);
}
/* 向后组词时略去空格(空格格开的词smartcn一样可以正确切分出来) */
while (j
j++;
/* 如果是汉字,继续拼加并测试是否成词 */
if (j
charArray = new char[wordBuf.length()];
wordBuf.getChars(0, charArray.length, charArray, 0);
// idArray has been found (foundWordIndex!=-1) as a
// prefix before.
// Therefore, idArray after it has been lengthened can
// only appear after foundWordIndex.
// So start searching after foundWordIndex.
foundIndex = wordDict.getPrefixMatch(charArray,
foundIndex);
j++;
} else {
break; /* 结束或不是汉字 */
}
}
i++;
break;
/* 以下是处理其它字符类型,这里就不作分析了 */
case CharType.FULLWIDTH_LETTER:
hasFullWidth = true;
case CharType.LETTER:
j = i + 1;
while (j
if (charTypeArray[j] == CharType.FULLWIDTH_LETTER)
hasFullWidth = true;
j++;
}
// Found a Token from i to j. Type is LETTER char string.
charArray = Utility.STRING_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
wordType = hasFullWidth ? WordType.FULLWIDTH_STRING
: WordType.STRING;
token = new SegToken(charArray, i, j, wordType, frequency);
segGraph.addToken(token);
i = j;
break;
case CharType.FULLWIDTH_DIGIT:
hasFullWidth = true;
case CharType.DIGIT:
j = i + 1;
while (j
if (charTypeArray[j] == CharType.FULLWIDTH_DIGIT)
hasFullWidth = true;
j++;
}
// Found a Token from i to j. Type is NUMBER char string.
charArray = Utility.NUMBER_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
wordType = hasFullWidth ? WordType.FULLWIDTH_NUMBER
: WordType.NUMBER;
token = new SegToken(charArray, i, j, wordType, frequency);
segGraph.addToken(token);
i = j;
break;
case CharType.DELIMITER:
j = i + 1;
// No need to search the weight for the punctuation. Picking the
// highest frequency will work.
frequency = Utility.MAX_FREQUENCE;
charArray = new char[] { sentence.charAt(i) };
token = new SegToken(charArray, i, j, WordType.DELIMITER,
frequency);
segGraph.addToken(token);
i = j;
break;
default:
j = i + 1;
// Treat the unrecognized char symbol as unknown string.
// For example, any symbol not in GB2312 is treated as one of
// these.
charArray = Utility.STRING_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, i, j, WordType.STRING,
frequency);
segGraph.addToken(token);
i = j;
break;
}
}
// Add two more Tokens: "beginning xx beginning"
/* 在最前面添加起始标志以方便后续处理(请参考ICTCLAS的算法) */
charArray = Utility.START_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, -1, 0, WordType.SENTENCE_BEGIN,
frequency);
segGraph.addToken(token);
// "end xx end"
/* 同上,添加结束标志 */
charArray = Utility.END_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, length, length + 1,
WordType.SENTENCE_END, frequency);
segGraph.addToken(token);
return segGraph;
}
第二步:org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.generateBiSegGraph(SegGraph),创建BiSegGraph。
BiSegGraph在构造方法里先对segGraph进行了索引的生成,也就是给所有词一个索引号,后面会用到这个。然后就是调用generateBiSegGraph()来生成路径图。
/*
* Generate a BiSegGraph based upon a SegGraph
*/
private void generateBiSegGraph(SegGraph segGraph) {
double smooth = 0.1;
int wordPairFreq = 0;
int maxStart = segGraph.getMaxStart();
double oneWordFreq, weight, tinyDouble = 1.0 / Utility.MAX_FREQUENCE;
int next;
char[] idBuffer;
// get the list of tokens ordered and indexed
segTokenList = segGraph.makeIndex(); /* 这个是不是重复了呢?前面构造方法里已经做过了 */
// Because the beginning position of startToken is -1, therefore
// startToken can be obtained when key = -1
int key = -1;
List
/* 从头到尾处理 */
while (key
if (segGraph.isStartExist(key)) {
List
/**
* 处理给定key(也即起始位置)的所有token,对这些token分别去遍历它们下面可以邻接的token,并将
* 它们的邻接关系(token pair)存入图中。这个token对儿其实就是token之前的路径。
*/
// Calculate all tokens for a given key.
for (SegToken t1 : tokenList) {
OneWordFreq= t1.weight;
next = t1.endOffset;
nextTokens = null;
// Find the next corresponding Token.
// For example: "Sunny seashore", the present Token is
// "sunny", next one should be "sea" or "seashore".
// If we cannot find the next Token, then go to the end and
// repeat the same cycle.
/* 寻找可邻接的token的起始位置 */
while (next <= maxStart) {
// Because the beginning position of endToken is
// sentenceLen, so equal to sentenceLen can find
// endToken.
if (segGraph.isStartExist(next)) {
nextTokens = segGraph.getStartList(next);
break;
}
next++;
}
if (nextTokens == null) {
break;
}
/* 遍历可邻接的tokens */
for (SegToken t2 : nextTokens) {
idBuffer = new char[t1.charArray.length
+ t2.charArray.length + 1];
System.arraycopy(t1.charArray, 0, idBuffer, 0,
t1.charArray.length);
idBuffer[t1.charArray.length] = BigramDictionary.WORD_SEGMENT_CHAR;
System.arraycopy(t2.charArray, 0, idBuffer,
t1.charArray.length + 1, t2.charArray.length);
// Two linked Words frequency
wordPairFreq = bigramDict.getFrequency(idBuffer);
/* 计算权重的平滑算法还没研究,现在数学忘光了,汗一个!抽空再分析这里 */
// Smoothing
// -log{a*P(Ci-1)+(1-a)P(Ci|Ci-1)} Note 0
weight = -Math.log(smooth
* (1.0 + oneWordFreq)
/ (Utility.MAX_FREQUENCE + 0.0)
+ (1.0 - smooth)
* ((1.0 - tinyDouble) * wordPairFreq
/ (1.0 + oneWordFreq) + tinyDouble));
SegTokenPair tokenPair = new SegTokenPair(idBuffer,
t1.index, t2.index, weight);
this.addSegTokenPair(tokenPair);
}
}
}
key++;
}
}
第三步:org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath(),寻找最佳路径。这部分其实就是一个动态规划算法,可以参考数据结构和算法方面的书。
/**
* Find the shortest path with the Viterbi algorithm.
*
* @return {@link List}
*/
public List
int current;
int nodeCount = getToCount();
List
PathNode zeroPath = new PathNode();
zeroPath.weight = 0;
zeroPath.preNode = 0;
path.add(zeroPath); /* 插了个头,当做起始“原点”吧 */
/**
* 从头到尾,计算每步的最佳路径并保存相应信息。后面的计算会采用前面的数据,
* 所以此循环处理完成后,可以直接从最后一个节点顺着向前找出最佳路径。
*/
for (current = 1; current <= nodeCount; current++) {
double weight;
List
double minWeight = Double.MAX_VALUE;
SegTokenPair minEdge = null;
for (SegTokenPair edge : edges) {
weight = edge.weight;
PathNode preNode = path.get(edge.from);
if (preNode.weight + weight
minEdge = edge;
}
}
PathNode newNode = new PathNode();
newNode.weight = minWeight;
newNode.preNode = minEdge.from;
path.add(newNode);
}
// Calculate PathNodes
int preNode, lastNode;
lastNode = path.size() - 1;
current = lastNode;
List
List
rpath.add(current);
while (current != 0) {
PathNode currentPathNode = path.get(current);
preNode = currentPathNode.preNode;
rpath.add(Integer.valueOf(preNode));
current = preNode;
}
/* 为什么上一步不直接生成resultPath呢?是不是有点多余? */
for (int j = rpath.size() - 1; j >= 0; j--) {
Integer idInteger = (Integer) rpath.get(j);
int id = idInteger.intValue();
SegToken t = segTokenList.get(id);
resultPath.add(t);
}
return resultPath;
}
smartcn的主要算法过程就是这样子。传说HMM算法比较学院派,比较复杂。今天看来,主要流程还是挺好理解。当然得感谢ICTCLAS的原理讲解,不然只看代码还是会很费劲儿的。下一步要学学它的Smooth算法,分析一下它的词典(org.apache.lucene.analysis.cn.smart.hhmm.WordDictionary)和词语组合的二元字典(org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary)。