过去几个小时,我一直在看着SO上的nlp标签,我相信我没有错过任何东西,但如果我这样做,请指出我的问题.
同时,我会描述我正在努力做什么.我在许多帖子上观察到的一个常见概念是语义相似性很难.例如,从this帖子,接受的解决方案建议如下:
First of all, neither from the perspective of computational
linguistics nor of theoretical linguistics is it clear what
the term 'semantic similarity' means exactly. ....
Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact
opposite of 1, still it is about Pete and Rob (not) finding a
dog.
我的高级要求是利用k-means聚类,并根据语义相似性对文本进行分类,所以我需要知道的是它们是否是近似匹配.例如,在上面的例子中,我可以将1,2,4,5分类为一个类别,另一个分为3个(当然,3个将被一些更相似的句子备份).有些东西,找到相关的文章,但是他们不一定要100%相关.
我想我最终需要构建每个句子的向量表示,就像它的指纹一样,但是这个向量应该包含什么对我而言仍然是个开放的问题.它是n-gram,还是来自wordnet的东西,还是单个的词干或者其他的东西?
This线程做了一个很棒的工作,枚举所有相关的技术,但不幸的是,停止只是当该职位到达我想要的.对这一领域最新的最新技术有什么建议?