当前位置: 开发笔记 > 编程语言 > 正文

python中有matlab中的lwt吗_在Python中简单实现Ngram,tfidf和余弦相似度

作者：zpcbb80569 | 来源：互联网 | 2023-06-28 13:38

检查NLTK包：http:www.nltk.org它有一切你需要的对于余弦相似性：defcosine_distance(u,v):Returnst

检查NLTK包&＃xff1a;

http://www.nltk.org它有一切你需要的

对于余弦相似性&＃xff1a;

def cosine_distance(u, v):

"""

Returns the cosine of the angle between vectors v and u. This is equal to

u.v / |u||v|.

"""

return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

对于ngram&＃xff1a;

def ngrams(sequence, n, pad_left&＃61;False, pad_right&＃61;False, pad_symbol&＃61;None):

"""

A utility that produces a sequence of ngrams from a sequence of items.

For example:

>>> ngrams([1,2,3,4,5], 3)

[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ingram for an iterator version of this function. Set pad_left

or pad_right to true in order to get additional ngrams:

>>> ngrams([1,2,3,4,5], 2, pad_right&＃61;True)

[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

&＃64;param sequence: the source data to be converted into ngrams

&＃64;type sequence: C{sequence} or C{iterator}

&＃64;param n: the degree of the ngrams

&＃64;type n: C{int}

&＃64;param pad_left: whether the ngrams should be left-padded

&＃64;type pad_left: C{boolean}

&＃64;param pad_right: whether the ngrams should be right-padded

&＃64;type pad_right: C{boolean}

&＃64;param pad_symbol: the symbol to use for padding (default is None)

&＃64;type pad_symbol: C{any}

&＃64;return: The ngrams

&＃64;rtype: C{list} of C{tuple}s

"""

if pad_left:

sequence &＃61; chain((pad_symbol,) * (n-1), sequence)

if pad_right:

sequence &＃61; chain(sequence, (pad_symbol,) * (n-1))

sequence &＃61; list(sequence)

count &＃61; max(0, len(sequence) - n &＃43; 1)

return [tuple(sequence[i:i&＃43;n]) for i in range(count)]

对于tf-idf&＃xff0c;你将不得不首先计算分布&＃xff0c;我使用Lucene做到这一点&＃xff0c;但你可能很好地做类似于NLTK的东西&＃xff0c;使用FreqDist&＃xff1a;

如果你喜欢pylucene&＃xff0c;这将告诉你如何喜欢tf.idf

# reader &＃61; lucene.IndexReader(FSDirectory.open(index_loc))

docs &＃61; reader.numDocs()

for i in xrange(docs):

tfv &＃61; reader.getTermFreqVector(i, fieldname)

if tfv:

rec &＃61; {}

terms &＃61; tfv.getTerms()

frequencies &＃61; tfv.getTermFrequencies()

for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):

df&＃61; searcher.docFreq(Term(fieldname, t)) # number of docs with the given term

tmap.setdefault(t, len(tmap))

rec[t] &＃61; sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF

# and normalize the values using cosine normalization

if cosine_normalization:

denom &＃61; sum([x**2 for x in rec.values()])**0.5

for k,v in rec.items():

rec[k] &＃61; v / denom

推荐阅读

case
编写有趣的VBScript恶作剧脚本

本文将介绍如何编写一些有趣的VBScript脚本，这些脚本可以在朋友之间进行无害的恶作剧。通过简单的代码示例，帮助您了解VBScript的基本语法和功能。 ... [详细]

蜡笔小新 2024-12-28 09:46:23
case
Handling Null Object Encoding in OAuth 1.0a API Implementation

Explore a common issue encountered when implementing an OAuth 1.0a API, specifically the inability to encode null objects and how to resolve it. ... [详细]

蜡笔小新 2024-12-28 08:54:34
import
深入理解org.neo4j.helpers.collection.Iterators.single()方法及其应用

本文详细介绍了Java中org.neo4j.helpers.collection.Iterators.single()方法的功能、使用场景及代码示例，帮助开发者更好地理解和应用该方法。 ... [详细]

蜡笔小新 2024-12-28 10:51:55
import
Transforming the Future of Virtual Worlds

Explore how Matterverse is redefining the metaverse experience, creating immersive and meaningful virtual environments that foster genuine connections and economic opportunities. ... [详细]

蜡笔小新 2024-12-28 09:44:49
case
java编写的简易计算器

主要用了2个类来实现的，话不多说，直接看运行结果，然后在奉上源代码1.Index.javaimportjava.awt.Color;im ... [详细]

蜡笔小新 2024-12-27 18:18:10
case
深入解析ExpandableComposite.addExpansionListener()方法及其应用

本文详细介绍了Java中org.eclipse.ui.forms.widgets.ExpandableComposite类的addExpansionListener()方法，并提供了多个实际代码示例，帮助开发者更好地理解和使用该方法。这些示例来源于多个知名开源项目，具有很高的参考价值。 ... [详细]

蜡笔小新 2024-12-27 16:11:49
post
使用 Azure Service Principal 和 Microsoft Graph API 获取 AAD 用户列表

本文介绍了一段通用代码示例，该代码不仅能够操作 Azure Active Directory (AAD)，还可以通过 Azure Service Principal 的授权访问和管理 Azure 订阅资源。Azure 的架构可以分为两个层级：AAD 和 Subscription。 ... [详细]

蜡笔小新 2024-12-27 16:07:12
case
Java 序列化接口详解

本文深入探讨了 Java 中的 Serializable 接口，解释了其实现机制、用途及注意事项，帮助开发者更好地理解和使用序列化功能。 ... [详细]

蜡笔小新 2024-12-27 15:06:12
command
golang常用库：配置文件解析库/管理工具viper使用

golang常用库：配置文件解析库管理工具-viper使用-一、viper简介viper配置管理解析库，是由大神SteveFrancia开发，他在google领导着golang的 ... [详细]

蜡笔小新 2024-12-28 13:47:52
case
使用Objective-C和dispatch库实现并发素数计算

本文介绍如何使用Objective-C结合dispatch库进行并发编程，以提高素数计数任务的效率。通过对比纯C代码与引入并发机制后的代码，展示dispatch库的强大功能。 ... [详细]

蜡笔小新 2024-12-28 08:44:35
import
Python配置文件读写指南

本文详细介绍如何使用Python进行配置文件的读写操作，涵盖常见的配置文件格式（如INI、JSON、TOML和YAML），并提供具体的代码示例。 ... [详细]

蜡笔小新 2024-12-28 08:39:55
request
技术分享：从动态网站提取站点密钥的解决方案

本文探讨了如何从动态网站中提取站点密钥，特别是针对验证码（reCAPTCHA）的处理方法。通过结合Selenium和requests库，提供了详细的代码示例和优化建议。 ... [详细]

蜡笔小新 2024-12-28 04:11:47
callback
解决Uploadify在IE浏览器中的兼容性问题

本文详细介绍了如何解决Uploadify插件在Internet Explorer（IE）9和10版本中遇到的点击失效及JQuery运行时错误问题。通过修改相关JavaScript代码，确保上传功能在不同浏览器环境中的一致性和稳定性。 ... [详细]

蜡笔小新 2024-12-27 22:07:40
post
词根词缀解析：greg、hap、helio及其他词源故事

本文基于刘洪波老师的《英文词根词缀精讲》，深入探讨了多个重要词根词缀的起源及其相关词汇，帮助读者更好地理解和记忆英语单词。 ... [详细]

蜡笔小新 2024-12-27 18:59:50
import
Java并发编程：LinkedBlockingQueue的实际应用

本文介绍了Java并发库中的阻塞队列（BlockingQueue）及其典型应用场景。通过具体实例，展示了如何利用LinkedBlockingQueue实现线程间高效、安全的数据传递，并结合线程池和原子类优化性能。 ... [详细]

蜡笔小新 2024-12-27 18:51:49

zpcbb80569

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章