点互信息在自然语言处理中的应用与优化

作者：小王儿 | 来源：互联网 | 2024-11-02 16:01

点互信息（PointwiseMutualInformation,PMI）是一种用于评估两个事件之间关联强度的统计量，在自然语言处理领域具有广泛应用。本文探讨了PMI在词共现分析、语义关系提取和情感分析等任务中的具体应用，并提出了几种优化方法，以提高其在大规模数据集上的计算效率和准确性。通过实验验证，这些优化策略显著提升了模型的性能。

点互信息

Pointwise mutual information (PMI), or point mutual information, is a measure of association used in information theory andstatistics.

The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence.

来自

The mutual information (MI) of the random variables X and Y is the expected value of the PMI over all possible outcomes (w.r.t. the joint distribution

来自

http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/philip-pmi.pdf

Information-theory approach to find

collocations

– Measure of how much one word tells us about the

other. How much information we gain

– Can be negative or positive

Problems with PMI

• Bad with sparse data

– Suppose some words only occur once, but appear

together

– Get very high score PMI score

– Consider our word clouds. High PMI score might

not necessarily indicate importance of bigram

来自

点互信息由互信息而来

来自

Finally,

will increase if

is fixed but

decreases.

这就是一个不好的地方如果联系紧密必然一同出现 p(x|y) 那么取决于p(x)的值大小越不常见的x 值越大假设 p(y|x)&＃61;1 完全相同共现就就取决于变量的出现频度了只出现一次分数最高偏爱稀有低频情况

Bad with word dependence

– Suppose two words are perfectly dependent on

eachother

– Whenever one occurs, the other occurs

– I(x, y) &＃61; log (1 / P(y))

– So the rarer the word is, the higher the PMI is

– High PMI score doesn&＃39;t mean high word

dependence (could just mean rarer words)

– Threshold on word frequencies

来自

可以看做局部一个点的互信息

考虑互信息

来自

It can take positive or negative values, but is zero if X and Y areindependent. PMI maximizes when X and Y are perfectly associated, yielding the following bounds:

来自

例子

x	y	p(x, y)
0	0	0.1
0	1	0.7
1	0	0.15
1	1	0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

	p(x)	p(y)
0	.8	0.25
1	.2	0.75

With this example, we can compute four values for

. Using base-2 logarithms:

pmi(x&＃61;0;y&＃61;0)	−1
pmi(x&＃61;0;y&＃61;1)	0.222392421
pmi(x&＃61;1;y&＃61;0)	1.584962501
pmi(x&＃61;1;y&＃61;1)	−1.584962501

(For reference, the mutual information

would then be 0.214170945)

来自

和互信息的相似处

Where

is the self-information, or

来自

正规化的pmi npmi

Pointwise mutual information can be normalized between [-1,&＃43;1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and &＃43;1 for complete co-occurrence.

完全共现的时候可以认为 p(x,y) &＃61; p(x)&＃61;p(y) 结合

来自

Chain-rule for pmi

来自

没太明白这个TODO

This is easily proven by:

来自

推荐阅读

list
Transforming the Future of Virtual Worlds

Explore how Matterverse is redefining the metaverse experience, creating immersive and meaningful virtual environments that foster genuine connections and economic opportunities. ... [详细]

蜡笔小新 2024-12-28 09:44:49
text
Python 的 10 个开发技巧！太实用了

1.如何在运行状态查看源代码？查看函数的源代码，我们通常会使用IDE来完成。比如在PyCharm中，你可以Ctrl+鼠标点击进入函数的源代码。那如果没有IDE呢？当我们想使用一个函 ... [详细]

蜡笔小新 2024-12-27 18:36:54
text
使用 SQLiteJDBC 和 HikariCP 实现 Java 程序连接 SQLite 数据库

本文介绍了如何通过 Maven 依赖引入 SQLiteJDBC 和 HikariCP 包，从而在 Java 应用中高效地连接和操作 SQLite 数据库。文章提供了详细的代码示例，并解释了每个步骤的实现细节。 ... [详细]

蜡笔小新 2024-12-26 17:34:42
text
解析JSON格式文本并处理数据

本文介绍如何使用阿里云的fastjson库解析包含时间戳、IP地址和参数等信息的JSON格式文本，并进行数据处理和保存。 ... [详细]

蜡笔小新 2024-12-26 16:06:09
list
深入理解org.neo4j.helpers.collection.Iterators.single()方法及其应用

本文详细介绍了Java中org.neo4j.helpers.collection.Iterators.single()方法的功能、使用场景及代码示例，帮助开发者更好地理解和应用该方法。 ... [详细]

蜡笔小新 2024-12-28 10:51:55
list
Handling Null Object Encoding in OAuth 1.0a API Implementation

Explore a common issue encountered when implementing an OAuth 1.0a API, specifically the inability to encode null objects and how to resolve it. ... [详细]

蜡笔小新 2024-12-28 08:54:34
int
Python配置文件读写指南

本文详细介绍如何使用Python进行配置文件的读写操作，涵盖常见的配置文件格式（如INI、JSON、TOML和YAML），并提供具体的代码示例。 ... [详细]

蜡笔小新 2024-12-28 08:39:55
text
技术分享：从动态网站提取站点密钥的解决方案

本文探讨了如何从动态网站中提取站点密钥，特别是针对验证码（reCAPTCHA）的处理方法。通过结合Selenium和requests库，提供了详细的代码示例和优化建议。 ... [详细]

蜡笔小新 2024-12-28 04:11:47
text
java编写的简易计算器

主要用了2个类来实现的，话不多说，直接看运行结果，然后在奉上源代码1.Index.javaimportjava.awt.Color;im ... [详细]

蜡笔小新 2024-12-27 18:18:10
list
Java 序列化接口详解

本文深入探讨了 Java 中的 Serializable 接口，解释了其实现机制、用途及注意事项，帮助开发者更好地理解和使用序列化功能。 ... [详细]

蜡笔小新 2024-12-27 15:06:12
int
ImmutableX Poised to Pioneer Web3 Gaming Revolution

ImmutableX is set to spearhead the evolution of Web3 gaming, with its innovative technologies and strategic partnerships driving significant advancements in the industry. ... [详细]

蜡笔小新 2024-12-27 08:55:17
list
深入理解Python的os和sys模块

本文详细解析了Python中的os和sys模块，介绍了它们的功能、常用方法及其在实际编程中的应用。 ... [详细]

蜡笔小新 2024-12-26 22:04:19
int
macOS系统及其关键功能解析

本文详细介绍了macOS系统的核心组件，包括如何管理其安全特性——系统完整性保护（SIP），并探讨了不同版本的更新亮点。对于使用macOS系统的用户来说，了解这些信息有助于更好地管理和优化系统性能。 ... [详细]

蜡笔小新 2024-12-26 18:05:04
int
Python学习笔记：使用pydoc工具查询文档

本文介绍了在Windows环境下使用pydoc工具的方法，并详细解释了如何通过命令行和浏览器查看Python内置函数的文档。此外，还提供了关于raw_input和open函数的具体用法和功能说明。 ... [详细]

蜡笔小新 2024-12-26 17:05:56
list
Python 爬虫基础教程及代码实例

根据最新发布的《互联网人才趋势报告》，尽管大量IT从业者已转向Python开发，但随着人工智能和大数据领域的迅猛发展，仍存在巨大的人才缺口。本文将详细介绍如何使用Python编写一个简单的爬虫程序，并提供完整的代码示例。 ... [详细]

蜡笔小新 2024-12-26 10:42:40

小王儿

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章