基于同义词词林扩展版的词语相似度计算

2019独角兽企业重金招聘Python工程师标准>>>

词语相似度计算

词义相似度计算在很多领域中都有广泛的应用&＃xff0c;例如信息检索、信息抽取、文本分类、词义排歧、基于实例的机器翻译等等。国内目前主要是使用知网和同义词词林来进行词语的相似度计算。

本文主要是根据《基于同义词词林的词语相似度计算方法—田久乐》论文中所提出的分层算法实现相似度计算&＃xff0c;程序采用Java语言编写。

同义词词林扩展版

《同义词词林》是梅家驹等人于1983年编纂而成&＃xff0c;这本词典中不仅包括了一个词语的同义词&＃xff0c;也包含了一定数量的同类词&＃xff0c;即广义的相关词。《同义词词林扩展版》是由哈尔滨工业大学信息检索实验室所重新修订所得。该版收录词语近7万条&＃xff0c;全部按意义进行编排&＃xff0c;是一部同义类词典。

同义词词林按照树状的层次结构把所有收录的词条组织到一起&＃xff0c;把词汇分成大、中、小三类&＃xff0c;大类有12个&＃xff0c;中类有97个&＃xff0c;小类有1400个。每个小类里都有很多的词&＃xff0c;这些词又根据词义的远近和相关性分成了若干个词群&＃xff08;段落&＃xff09;。每个段落中的词语又进一步分成了若干个行&＃xff0c;同一行的词语要么词义相同&＃xff0c;要么词义有很强的相关性。

《同义词词林》提供了5层编码&＃xff0c;第1级用大写英文字母表示&＃xff1b;第2级用小写英文字母表示&＃xff1b;第3级用二位十进制整数表示&＃xff1b;第4级用大写英文字母表示&＃xff1b;第5级用二位十进制整数表示。例如&＃xff1a;
Cb30A01&＃61; 这里这边此地此间此处此
Cb30A02# 该地该镇该乡该站该区该市该村
Cb30A03&＃64; 这方

分层及编码表如下所示

由于第5级有的行是同义词&＃xff0c;有的行是相关词&＃xff0c;有的行只有一个词&＃xff0c;分类结果需要特别说明&＃xff0c;可以分出具体的3种情况。使用特殊符号对这3种情况进行区别对待&＃xff0c;所以第8位的标记有3种&＃xff0c;分别是“&＃61;”代表“相等”、“同义”&＃xff1b;“#”代表“不等”、“同类”&＃xff0c;属于相关词语&＃xff1b;“&＃64;”代表“自我封闭”、“独立”&＃xff0c;它在词典中既没有同义词&＃xff0c;也没有相关词。

在对文本内容进行相似度计算中&＃xff0c;采用该论文中给出的计算公式&＃xff0c;两个义项的相似度用Sim表示
若两个义项不在同一棵树上&＃xff0c;Sim(A,B)&＃61;f
若两个义项在同一棵树上&＃xff1a;
若在第2层分支&＃xff0c;系数为a Sim(A,B)&＃61;1*a*cos*(n*π/180)((n-k&＃43;1)/n)
若在第3层分支&＃xff0c;系数为b Sim(A,B)&＃61;1*1*b*cos*(n*π/180)((n-k&＃43;1)/n)
若在第4层分支&＃xff0c;系数为c Sim(A,B)&＃61;1*1*1*c×cos*(n*π/180)((n-k&＃43;1)/n)
若在第5层分支&＃xff0c;系数为d Sim(A,B)&＃61;1*1*1*1*d*cos*(n*π/180)((n-k&＃43;1)/n)

当编码相同&＃xff0c;而只有末尾是“#”时&＃xff0c;那么认为其相似度为e。
例如Ad02B04# 非洲人亚洲人 则其相似度为e。

其中n是分支层的节点分支总数&＃xff0c;k是两个分支间的距离。
如&＃xff1a;人 Aa01A01&＃61; 和少儿 Ab04B01&＃61;
以A开头的子分支只有14个&＃xff0c;分别是Aa*——An*&＃xff0c;而不是以A开头的所有结点的个数&＃xff1b;在第2层&＃xff0c;人的编码是a&＃xff0c;少儿的编码是b所以k&＃61;1。

该文献中给出的参数值为a&＃61;0.65&＃xff0c;b&＃61;0.8&＃xff0c;c&＃61;0.9&＃xff0c;d&＃61;0.96&＃xff0c;e&＃61;0.5&＃xff0c;f&＃61;0.1。

如&＃xff1a;人 Aa01A01&＃61; 和少儿 Ab04B01&＃61;由于A开头的分支个数为14个&＃xff0c;所以n&＃61;14&＃xff1b;在第2层&＃xff0c;人的编码是a&＃xff0c;少儿的编码是b所以k&＃61;1。

Java代码实现

package edu.shu.similarity.cilin; import com.google.common.base.Preconditions; import lombok.extern.log4j.Log4j2; import org.apache.commons.io.IOUtils; import org.apache.commons.lang3.StringUtils; import org.junit.Test; import java.io.IOException; import java.io.InputStream; import java.util.*; import static java.lang.Math.PI; import static java.lang.Math.cos; /** *

* Created with IntelliJ IDEA. 2015/8/2 21:54 *


 * 
 * ClassName:WordSimilarity
 * 
 * 
 * Description:

 * "&＃61;" 代表 相等 同义 

 * "#" 代表 不等 同类 属于相关词语 

 * "&＃64;" 代表 自我封闭 独立 它在词典中既没有同义词, 也没有相关词 

 * 
 *
 * &＃64;author Wang Xu
 * &＃64;version V1.0.0
 * &＃64;since V1.0.0
 */
&＃64;Log4j2
//注意使用Log4j2的注解&＃xff0c;那么在pom中必须引入2.x版本的log4j&＃xff0c;如果使用Log4j注解&＃xff0c;pom中引入1.x版本的log4j
//相应的配置文件也要一致&＃xff0c;2.x版本配置文件为log4j2.xml&＃xff0c;1.x版本配置文件为log4j.xml
public class WordSimilarity {
    /**
     * when we use Lombok&＃39;s Annotation, such as &＃64;Log4j
     *
     * &＃64;Log4j 

     * public class LogExample {
     * }
     * 
     * will generate:
     * public class LogExample {
     * private static final org.apache.logging.log4j.Logger log &＃61; org.apache.logging.log4j.Logger.getLogger(LogExample.class);
     * }
     * 
     */

    //定义一些常数先
    private final double a &＃61; 0.65;
    private final double b &＃61; 0.8;
    private final double c &＃61; 0.9;
    private final double d &＃61; 0.96;
    private final double e &＃61; 0.5;
    private final double f &＃61; 0.1;

    private final double degrees &＃61; 180;


    //存放的是以词为key&＃xff0c;以该词的编码为values的List集合&＃xff0c;其中一个词可能会有多个编码
    private static Map> wordsEncode &＃61; new HashMap>();
    //存放的是以编码为key&＃xff0c;以该编码多对应的词为values的List集合&＃xff0c;其中一个编码可能会有多个词
    private static Map> encodeWords &＃61; new HashMap>();

    /**
     * 读取同义词词林并将其注入wordsEncode和encodeWords
     */
    private static void readCiLin() {
        InputStream input &＃61; WordSimilarity.class.getClassLoader().getResourceAsStream("cilin.txt");
        List contents &＃61; null;
        try {
            contents &＃61; IOUtils.readLines(input);
        } catch (IOException e) {
            e.printStackTrace();
            log.error(e.getMessage());
        }
        for (String content : contents) {
            content &＃61; Preconditions.checkNotNull(content);
            String[] strsArr &＃61; content.split(" ");
            String[] strs &＃61; Preconditions.checkNotNull(strsArr);
            String encode &＃61; null;
            int length &＃61; strs.length;
            if (length > 1) {
                encode &＃61; strs[0];//获取编码
            }
            ArrayList encodeWords_values &＃61; new ArrayList();
            for (int i &＃61; 1; i 
                encodeWords_values.add(strs[i]);
            }
            encodeWords.put(encode, encodeWords_values);//以编码为key&＃xff0c;其后所有值为value
            for (int i &＃61; 1; i 
                String key &＃61; strs[i];
                if (wordsEncode.containsKey(strs[i])) {
                    ArrayList values &＃61; wordsEncode.get(key);
                    values.add(encode);
                    //重新放置回去
                    wordsEncode.put(key, values);//以某个value为key&＃xff0c;其可能的所有编码为value
                } else {
                    ArrayList temp &＃61; new ArrayList();
                    temp.add(encode);
                    wordsEncode.put(key, temp);
                }
            }
        }
    }

    /**
     * 对外暴露的接口&＃xff0c;返回两个词的相似度的计算结果
     *
     * &＃64;param word1
     * &＃64;param word2
     * &＃64;return 相似度值
     */
    public double getSimilarity(String word1, String word2) {
        //如果比较词没有出现在同义词词林中&＃xff0c;则相似度为0
        if (!wordsEncode.containsKey(word1) || !wordsEncode.containsKey(word2)) {
            return 0;
        }
        //获取第一个词的编码
        ArrayList encode1 &＃61; getEncode(word1);
        //获取第二个词的编码
        ArrayList encode2 &＃61; getEncode(word2);

        double maxValue &＃61; 0;//最终的计算结果值&＃xff0c;取所有相似度里面结果最大的那个
        for (String e1 : encode1) {
            for (String e2 : encode2) {
                log.info(e1);
                log.info(e2);
                String commonStr &＃61; getCommonStr(e1, e2);
                int length &＃61; StringUtils.length(commonStr);
                double k &＃61; getK(e1, e2);
                double n &＃61; getN(commonStr);
                log.info("k--" &＃43; k);
                log.info("n--" &＃43; n);
                log.info("length--" &＃43; length);
                double res &＃61; 0;
                //如果有一个以“&＃64;”那么表示自我封闭&＃xff0c;肯定不在一棵树上&＃xff0c;直接返回f
                if (e1.endsWith("&＃64;") || e2.endsWith("&＃64;") || 0 &＃61;&＃61; length) {
                    if (f > maxValue) {
                        maxValue &＃61; f;
                    }
                    continue;
                }
                if (1 &＃61;&＃61; length) {
                    //说明在第二层上计算
                    res &＃61; a * cos(n * PI / degrees) * ((n - k &＃43; 1) / n);
                } else if (2 &＃61;&＃61; length) {
                    //说明在第三层上计算
                    res &＃61; b * cos(n * PI / degrees) * ((n - k &＃43; 1) / n);
                } else if (4 &＃61;&＃61; length) {
                    //说明在第四层上计算
                    res &＃61; c * cos(n * PI / degrees) * ((n - k &＃43; 1) / n);
                } else if (5 &＃61;&＃61; length) {
                    //说明在第五层上计算
                    res &＃61; d * cos(n * PI / degrees) * ((n - k &＃43; 1) / n);
                } else {
                    //注意不存在前面七个字符相同&＃xff0c;而结尾不同的情况&＃xff0c;所以这个分支一定是8个字符都相同&＃xff0c;那么只需比较结尾即可
                    if (e1.endsWith("&＃61;") && e2.endsWith("&＃61;")) {
                        //说明两个完全相同
                        res &＃61; 1;
                    } else if (e1.endsWith("#") && e2.endsWith("#")) {
                        //只有结尾不同&＃xff0c;说明结尾是“#”
                        res &＃61; e;
                    }
                }
                log.info("res :" &＃43; res);
                if (res > maxValue) {
                    maxValue &＃61; res;
                }
            }
        }
        return maxValue;
    }

    /**
     * 判断一个词在同义词词林中是否是自我封闭的&＃xff0c;是否是独立的
     *
     * &＃64;param source
     * &＃64;return
     */
    private boolean isIndependent(String source) {
        Iterator iter &＃61; wordsEncode.keySet().iterator();
        while (iter.hasNext()) {
            String key &＃61; iter.next();
            if (StringUtils.equalsIgnoreCase(key, source)) {
                ArrayList values &＃61; wordsEncode.get(key);
                for (String value : values) {
                    if (value.endsWith("&＃64;")) {
                        return true;
                    }
                }
            }

        }
        return false;
    }

    /**
     * 根据word的内容&＃xff0c;返回其对应的编码
     *
     * &＃64;param word
     * &＃64;return
     */
    public ArrayList getEncode(String word) {
        return wordsEncode.get(word);
    }

    /**
     * 计算N的值&＃xff0c;N表示所在分支层分支数&＃xff0c;如&＃xff1a;人 Aa01A01&＃61; 和 少儿 Ab04B01&＃61;&＃xff0c;以A开头的子分支只有14个
     * 这一点在论文中说的非常不清晰&＃xff0c;所以以国人的文章进行编码真是痛苦
     *
     * &＃64;param encodeHead 输入两个字符串的公共开头
     * &＃64;return 经过计算之后得到N的值
     */
    public int getN(String encodeHead) {
        int length &＃61; StringUtils.length(encodeHead);
        switch (length) {
            case 1:
                return getCount(encodeHead, 2);
            case 2:
                return getCount(encodeHead, 4);
            case 4:
                return getCount(encodeHead, 5);
            case 5:
                return getCount(encodeHead, 7);
            default:
                return 0;
        }
    }

    public int getCount(String encodeHead, int end) {
        Set res &＃61; new HashSet();
        Iterator iter &＃61; encodeWords.keySet().iterator();
        while (iter.hasNext()) {
            String curr &＃61; iter.next();
            if (curr.startsWith(encodeHead)) {
                String temp &＃61; curr.substring(0, end);
                if (res.contains(temp)) {
                    continue;
                } else {
                    res.add(temp);
                }
            }
        }
        return res.size();
    }

    /**
     * &＃64;param encode1 第一个编码
     * &＃64;param encode2 第二个编码
     * &＃64;return 这两个编码对应的分支间的距离&＃xff0c;用k表示
     */
    public int getK(String encode1, String encode2) {
        String temp1 &＃61; encode1.substring(0, 1);
        String temp2 &＃61; encode2.substring(0, 1);
        if (StringUtils.equalsIgnoreCase(temp1, temp2)) {
            temp1 &＃61; encode1.substring(1, 2);
            temp2 &＃61; encode2.substring(1, 2);
        } else {
            return Math.abs(temp1.charAt(0) - temp2.charAt(0));
        }
        if (StringUtils.equalsIgnoreCase(temp1, temp2)) {
            temp1 &＃61; encode1.substring(2, 4);
            temp2 &＃61; encode2.substring(2, 4);
        } else {
            return Math.abs(temp1.charAt(0) - temp2.charAt(0));
        }
        if (StringUtils.equalsIgnoreCase(temp1, temp2)) {
            temp1 &＃61; encode1.substring(4, 5);
            temp2 &＃61; encode2.substring(4, 5);
        } else {
            return Math.abs(Integer.valueOf(temp1) - Integer.valueOf(temp2));
        }
        if (StringUtils.equalsIgnoreCase(temp1, temp2)) {
            temp1 &＃61; encode1.substring(5, 7);
            temp2 &＃61; encode2.substring(5, 7);
        } else {
            return Math.abs(temp1.charAt(0) - temp2.charAt(0));
        }
        return Math.abs(Integer.valueOf(temp1) - Integer.valueOf(temp2));
    }

    /**
     * 获取编码的公共部分字符串
     *
     * &＃64;param encode1
     * &＃64;param encode2
     * &＃64;return
     */
    public String getCommonStr(String encode1, String encode2) {
        int length &＃61; StringUtils.length(encode1);
        StringBuilder sb &＃61; new StringBuilder();

        for (int i &＃61; 0; i 
            if (encode1.charAt(i) &＃61;&＃61; encode2.charAt(i)) {
                sb.append(encode1.charAt(i));
            } else {
                break;
            }
        }
        int sbLen &＃61; StringUtils.length(sb);
        //注意第三层和第五层均有两个字符&＃xff0c;所以长度不可能出现3和6的情况
        if (sbLen &＃61;&＃61; 3 || sbLen &＃61;&＃61; 6) {
            sb.deleteCharAt(sbLen - 1);
        }

        return String.valueOf(sb);
    }

    &＃64;Test
    public void testGetN() {
        readCiLin();
        int a &＃61; getN("A");
        System.out.println(a);
    }

    &＃64;Test
    public void testGetK() {
        int k &＃61; getK("Aa01A01&＃61;", "Aa01A01&＃61;");
        System.out.println(k);
    }

    &＃64;Test
    public void testGetCommonStr() {
        String commonStr &＃61; getCommonStr("Aa01A01&＃61;", "Aa01A03&＃61;");
        System.out.println(commonStr);
    }

    &＃64;Test
    public void testGetSimilarity() {
        readCiLin();
        double similarity &＃61; getSimilarity("人民", "国民");
        System.out.println("人民--" &＃43; "国民:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "群众");
        System.out.println("人民--" &＃43; "群众:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "党群");
        System.out.println("人民--" &＃43; "党群:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "良民");
        System.out.println("人民--" &＃43; "良民:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "同志");
        System.out.println("人民--" &＃43; "同志:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "成年人");
        System.out.println("人民--" &＃43; "成年人:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "市民");
        System.out.println("人民--" &＃43; "市民:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "亲属");
        System.out.println("人民--" &＃43; "亲属:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "志愿者");
        System.out.println("人民--" &＃43; "志愿者:" &＃43; similarity);
        similarity &＃61; getSimilarity("人民", "先锋");
        System.out.println("人民--" &＃43; "先锋:" &＃43; similarity);
    }

    &＃64;Test
    public void testGetSimilarity2() {
        readCiLin();
        double similarity &＃61; getSimilarity("非洲人", "亚洲人");
        System.out.println(similarity);
        double similarity1 &＃61; getSimilarity("骄傲", "仔细");
        System.out.println(similarity1);
    }

}

说明&＃xff0c;词语相似度是个数值&＃xff0c;一般取值范围在[0&＃xff0c;1]之间&＃xff0c;在原论文中&＃xff0c;使用cos函数计算主要是将值归一化到[0&＃xff0c;1]之间&＃xff0c;可以将cos函数看作是一个调节因子。

testGetSimilarity的测试结果如下所示&＃xff1a;

人民--国民:1.0
人民--群众:0.9576614882494312
人民--党群:0.8978076452338418
人民--良民:0.7182461161870735
人民--同志:0.6630145969121822
人民--成年人:0.6306922220793977
人民--市民:0.5405933332109123
人民--亲属:0.36039555547394153
人民--志愿者:0.22524722217121346
人民--先锋:0.18019777773697077