热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

Python自然语言处理学习笔记(13):2.5WordNet

原创翻译,如需转载,请与博主联系:yuxcer126.com新手上路,翻译不恰之处,恳请指出,不

原创翻译,如需转载,请与博主联系:yuxcer@126.com

新手上路,翻译不恰之处,恳请指出,不胜感谢

2.5 WordNet

WordNet is a semantically oriented dictionary of English, similar to a traditional thesaurus(辞典)but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym(同义词)sets. We’ll begin by looking at synonyms and how they are accessed in WordNet.

Senses and Synonyms 意义和同义词

Consider the sentence in (1a). If we replace the word motorcar in (1a) with automobile, to get (1b), the meaning of the sentence stays pretty much the same:

(1) a. Benz is credited with the invention of the motorcar.

b. Benz is credited with the invention of the automobile.

Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e., they are synonyms.

We can explore these words with the help of WordNet:

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('motorcar')

[Synset('car.n.01')]

Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or “synonym set,”(同义词集)a collection of synonymous words (or “lemmas”):

>>> wn.synset('car.n.01').lemma_names

['car', 'auto', 'automobile', 'machine', 'motorcar']

Each word of a synset can have several meanings, e.g., car can also signify a train carriage, a gondola(货车), or an elevator car. However, we are only interested in the single meaning that is common to all words of this synset. Synsets also come with a prose(平凡的) definition and some example sentences:

>>> wn.synset('car.n.01').definition

'a motor vehicle with four wheels; usually propelled by an internal combustion engine(内燃机)'

>>> wn.synset('car.n.01').examples

['he needs a car to get to work']

Although definitions help humans to understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a lemma(一个同义词集的单词配对称为词条). We can get all the lemmas for a given synset①, look up a particular lemma②, get the synset corresponding to a lemma③, and get the “name” of a lemma④:

>>> wn.synset('car.n.01').lemmas ①

[Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'),

Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]

>>> wn.lemma('car.n.01.automobile') ②

Lemma('car.n.01.automobile')

>>> wn.lemma('car.n.01.automobile').synset ③

Synset('car.n.01')

>>> wn.lemma('car.n.01.automobile').name ④

'automobile'

Unlike the words automobile and motorcar, which are unambiguous and have one synset, the word car is ambiguous, having five synsets:

>>> wn.synsets('car')

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'),

Synset('cable_car.n.01')]

>>> for synset in wn.synsets('car'):

... print synset.lemma_names

...

['car', 'auto', 'automobile', 'machine', 'motorcar']

['car', 'railcar', 'railway_car', 'railroad_car']

['car', 'gondola']

['car', 'elevator_car']

['cable_car', 'car']

For convenience, we can access all the lemmas involving the word car as follows:

>>> wn.lemmas('car')

[Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'),

Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]

Your Turn: Write down all the senses of the word dish that you can think of. Now, explore this word with the help of WordNet, using the same operations shown earlier.

The WordNet Hierarchy WordNet层次结构

WordNet synsets correspond to abstract concepts(抽象的概念), and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are called unique beginnersor root synsets(根同义词集). Others, such as gas guzzler(油老虎) and hatchback(带掀式后背的小轿车), are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2-8.

clip_image002

Figure 2-8. Fragment of WordNet concept hierarchy: Nodes correspond to synsets; edges indicate the hypernym(上位词)/hyponym(下位词) relation, i.e., the relation between superordinate(同hypernym) and subordinate(从属的) concepts

WordNet makes it easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific—the (immediate) hyponyms.

>>> motorcar = wn.synset('car.n.01')

>>> types_of_motorcar = motorcar.hyponyms()

>>> types_of_motorcar[26]

Synset('ambulance.n.01')

>>> sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])

['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon',

'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible',

'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car',

'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap',

'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover',

'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car',

'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer',

'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan',

'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car',

'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car',

'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon',

'wagon']

We can also navigate up(浏览) the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container.

>>> motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

>>> paths = motorcar.hypernym_paths()

>>> len(paths)

2

>>> [synset.name for synset in paths[0]]

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',

'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01',

'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

>>> [synset.name for synset in paths[1]]

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',

'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01',

'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

We can get the most general hypernyms (or root hypernyms) of a synset as follows:

>>> motorcar.root_hypernyms()

[Synset('entity.n.01')]

Your Turn:Try out NLTK’s convenient graphical WordNet browser: nltk.app.wordnet(). Explore the WordNet hierarchy by following the hypernym and hyponym links.

More Lexical Relations 更多词汇关系

Hypernyms and hyponyms are called lexical relations(词汇关系) because they relate one synset to another. These two relations navigate up and down the “is-a” hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms 部分) or to the things they are contained in (holonyms整体). For example, the parts of a tree are its trunk(树干), crown(树冠), and so on; these are the part_meronyms(). The substance(实质) a tree is made of includes heartwood(心材) and sapwood(边材), i.e., the substance_meronyms(). A collection of trees forms a forest, i.e., the member_holonyms():

>>> wn.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('stump.n.01'),

Synset('trunk.n.01'), Synset('limb.n.02')]

>>> wn.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

>>> wn.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

To see just how intricate(复杂的) things can get, consider the word mint, which has several closely related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

>>> for synset in wn.synsets('mint', wn.NOUN):

... print synset.name + ':', synset.definition

...

batch.n.02: (often followed by `of') a large number or amount or extent

mint.n.02: any north temperate(北温带)plant of the genus Mentha with aromatic leaves and

small mauve(淡紫色) flowers

mint.n.03: any member of the mint family of plants

mint.n.04: the leaves of a mint plant used fresh or candied

mint.n.05: a candy(糖果) that is flavored(加味)with a mint oil

mint.n.06: a plant where money is coined by authority of the government

>>> wn.synset('mint.n.04').part_holonyms()

[Synset('mint.n.02')]

>>> wn.synset('mint.n.04').substance_holonyms()

[Synset('mint.n.05')]

There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails(蕴涵) stepping. Some verbs have multiple entailments:

>>> wn.synset('walk.v.01').entailments()

[Synset('step.v.01')]

>>> wn.synset('eat.v.01').entailments()

[Synset('swallow.v.01'), Synset('chew.v.01')]

>>> wn.synset('tease.v.03').entailments()

[Synset('arouse.v.07'), Synset('disappoint.v.01')]

Some lexical relationships hold between lemmas, e.g., antonymy(反义词组):

>>> wn.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

>>> wn.lemma('rush.v.01.rush').antonyms()

[Lemma('linger.v.04.linger')]

>>> wn.lemma('horizontal.a.01.horizontal').antonyms()

[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]

>>> wn.lemma('staccato.r.01.staccato').antonyms() 不连续的

[Lemma('legato.r.01.legato')] 连奏的

You can see the lexical relations, and the other methods defined on a synset, using

dir(). For example, try dir(wn.synset('harmony.n.02')).

Semantic Similarity 语义相似度

We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term such as vehicle will match documents containing specific terms such as limousine(豪华轿车).

Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common(共同之处) (see Figure 2-8). If two synsets share a very specific hypernym—one that is low down(实情?) in the hypernym hierarchy—they must be closely related.

>>> right = wn.synset('right_whale.n.01')

>>> orca = wn.synset('orca.n.01')

>>> minke = wn.synset('minke_whale.n.01')

>>> tortoise = wn.synset('tortoise.n.01')

>>> novel = wn.synset('novel.n.01')

>>> right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

>>> right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

>>> right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

>>> right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

Of course we know that whale is very specific (and baleen whale even more so), whereas vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

>>> wn.synset('baleen_whale.n.01').min_depth()

14

>>> wn.synset('whale.n.02').min_depth()

13

>>> wn.synset('vertebrate.n.01').min_depth()

8

>>> wn.synset('entity.n.01').min_depth()

0

Similarity measures have been defined over the collection of WordNet synsets that incorporate this insight. For example, path_similarityassigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found 没有路径就返回-1). Comparing a synset with itself will return 1(与自己比较返回1). Consider the following similarity scores, relating right whale(露脊鲸) to minke whale(小须鲸

), orca(逆戟鲸), tortoise, and novel. Although the numbers won’t mean much, they decrease as we move away from the semantic space(语义空间) of sea creatures to inanimate objects(静物).

>>> right.path_similarity(minke)

0.25

>>> right.path_similarity(orca)

0.16666666666666666

>>> right.path_similarity(tortoise)

0.076923076923076927

>>> right.path_similarity(novel)

0.043478260869565216

Several other similarity measures are available; you can type help(wn) for more information. NLTK also includes VerbNet, a hierarchical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verb net.

转:https://www.cnblogs.com/yuxc/archive/2011/07/27/2118655.html



推荐阅读
  • Iamtryingtomakeaclassthatwillreadatextfileofnamesintoanarray,thenreturnthatarra ... [详细]
  • 本文介绍了Python爬虫技术基础篇面向对象高级编程(中)中的多重继承概念。通过继承,子类可以扩展父类的功能。文章以动物类层次的设计为例,讨论了按照不同分类方式设计类层次的复杂性和多重继承的优势。最后给出了哺乳动物和鸟类的设计示例,以及能跑、能飞、宠物类和非宠物类的增加对类数量的影响。 ... [详细]
  • JDK源码学习之HashTable(附带面试题)的学习笔记
    本文介绍了JDK源码学习之HashTable(附带面试题)的学习笔记,包括HashTable的定义、数据类型、与HashMap的关系和区别。文章提供了干货,并附带了其他相关主题的学习笔记。 ... [详细]
  • 微软头条实习生分享深度学习自学指南
    本文介绍了一位微软头条实习生自学深度学习的经验分享,包括学习资源推荐、重要基础知识的学习要点等。作者强调了学好Python和数学基础的重要性,并提供了一些建议。 ... [详细]
  • 本文介绍了设计师伊振华受邀参与沈阳市智慧城市运行管理中心项目的整体设计,并以数字赋能和创新驱动高质量发展的理念,建设了集成、智慧、高效的一体化城市综合管理平台,促进了城市的数字化转型。该中心被称为当代城市的智能心脏,为沈阳市的智慧城市建设做出了重要贡献。 ... [详细]
  • Java容器中的compareto方法排序原理解析
    本文从源码解析Java容器中的compareto方法的排序原理,讲解了在使用数组存储数据时的限制以及存储效率的问题。同时提到了Redis的五大数据结构和list、set等知识点,回忆了作者大学时代的Java学习经历。文章以作者做的思维导图作为目录,展示了整个讲解过程。 ... [详细]
  • 本文介绍了OC学习笔记中的@property和@synthesize,包括属性的定义和合成的使用方法。通过示例代码详细讲解了@property和@synthesize的作用和用法。 ... [详细]
  • 本文讨论了一个关于cuowu类的问题,作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案,并给出了两个可能导致错误的原因。 ... [详细]
  • ZSI.generate.Wsdl2PythonError: unsupported local simpleType restriction ... [详细]
  • Go语言实现堆排序的详细教程
    本文主要介绍了Go语言实现堆排序的详细教程,包括大根堆的定义和完全二叉树的概念。通过图解和算法描述,详细介绍了堆排序的实现过程。堆排序是一种效率很高的排序算法,时间复杂度为O(nlgn)。阅读本文大约需要15分钟。 ... [详细]
  • 第四章高阶函数(参数传递、高阶函数、lambda表达式)(python进阶)的讲解和应用
    本文主要讲解了第四章高阶函数(参数传递、高阶函数、lambda表达式)的相关知识,包括函数参数传递机制和赋值机制、引用传递的概念和应用、默认参数的定义和使用等内容。同时介绍了高阶函数和lambda表达式的概念,并给出了一些实例代码进行演示。对于想要进一步提升python编程能力的读者来说,本文将是一个不错的学习资料。 ... [详细]
  • 先看官方文档TheJavaTutorialshavebeenwrittenforJDK8.Examplesandpracticesdescribedinthispagedontta ... [详细]
  • 本文介绍了Codeforces Round #321 (Div. 2)比赛中的问题Kefa and Dishes,通过状压和spfa算法解决了这个问题。给定一个有向图,求在不超过m步的情况下,能获得的最大权值和。点不能重复走。文章详细介绍了问题的题意、解题思路和代码实现。 ... [详细]
  • Iamtryingtocreateanarrayofstructinstanceslikethis:我试图创建一个这样的struct实例数组:letinstallers: ... [详细]
  • Java编程实现邻接矩阵表示稠密图的方法及实现类介绍
    本文介绍了Java编程如何实现邻接矩阵表示稠密图的方法,通过一个名为AMWGraph.java的类来构造邻接矩阵表示的图,并提供了插入结点、插入边、获取邻接结点等功能。通过使用二维数组来表示结点之间的关系,并通过元素的值来表示权值的大小,实现了稠密图的表示和操作。对于对稠密图的表示和操作感兴趣的读者可以参考本文。 ... [详细]
author-avatar
仇顺嵐
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有