当前位置: 开发笔记 > 编程语言 > 正文

Python中的分词器

作者：手机用户2602913901 | 来源：互联网 | 2023-07-25 16:34

众所周知，互联网上有大量的文本数据。但是，我们大多数人可能不熟悉开始处理这些文本数据的方法。此外，我们还知道，在机器学习中导航我们语言的字母是一个棘手的部

众所周知，互联网上有大量的文本数据。但是，我们大多数人可能不熟悉开始处理这些文本数据的方法。此外，我们还知道，在机器学习中导航我们语言的字母是一个棘手的部分，因为机器可以识别数字，而不是字母。

那么，如何进行文本数据操作和清理来创建模型呢？为了回答这个问题，让我们探索一下【自然语言处理(NLP) 下面的一些奇妙的概念。

解决自然语言处理问题是一个分为多个阶段的过程。首先，我们必须在进入建模阶段之前清理非结构化文本数据。数据清理包括一些关键步骤。这些步骤如下:

单词分词

每个标记的词性预测

文本词形还原

停止单词识别和删除，等等。

在接下来的教程中，我们将学习更多关于被称为分词的非常初级的步骤。我们将了解什么是分词，为什么它对自然语言处理是必要的。此外，我们还将在 Python 中发现一些执行分词的独特方法。

理解分词

分词据说是将大量文本分割成更小的片段，称为标记。这些片段或标记对于找到模式非常有用，并且被认为是词干化和词形还原的基础步骤。分词还支持用非敏感数据元素替换敏感数据元素。

自然语言处理(NLP) 用于创建文本分类、情感分析、智能聊天机器人、语言翻译等应用。因此，理解文本模式以达到上述目的变得很重要。

但是现在，考虑词干化和词条化作为在自然语言处理的帮助下清理文本数据的主要步骤。像文本分类或垃圾邮件过滤这样的任务使用自然语言处理以及像和 [Tensorflow](https://www.javatpoint.com/tensorflow) 这样的深度学习库。

**### 理解分词在自然语言处理中的意义

为了理解分词的意义，让我们以英语为例。让我们在理解下一节时，选择任何一个句子并牢记在心。

在处理自然语言之前，我们必须识别构成字符串的单词。因此，分词似乎是进行自然语言处理的最基本步骤

这一步是必要的，因为文本的实际含义可以通过分析文本中出现的每个单词来解释。

现在，让我们以下面的字符串为例:

对上述字符串执行分词后，我们将获得如下所示的输出:

['我的'，'名字'，'是'，'杰米'，'克拉克']

执行该操作有多种用途。我们可以利用分词的形式来:

数数课文中的单词总数。

计算单词出现的频率，即特定单词出现的总次数，还有更多。

现在，让我们了解在 Python 自然语言处理中执行分词的几种方法。

Python 中执行分词的一些方法

对文本数据执行分词有各种独特的方法。下面描述了其中一些独特的方法:

在 Python 中使用 split()函数进行分词

split() 函数是分割字符串的基本方法之一。此函数在通过特定分隔符拆分提供的字符串后返回字符串列表。默认情况下， split() 函数在每个空格处断开一个字符串。但是，我们可以根据需要指定分隔符。

让我们考虑以下例子:

示例 1.1:使用 split()函数的单词分词

my_text = """Let's play a game, Would You Rather! It's simple, you have to pick one or the other. Let's get started. Would you rather try Vanilla Ice Cream or Chocolate one? Would you rather be a bird or a bat? Would you rather explore space or the ocean? Would you rather live on Mars or on the Moon? Would you rather have many good friends or one very best friend? Isn't it easy though? When we have less choices, it's easier to decide. But what if the options would be complicated? I guess, you pretty much not understand my point, neither did I, at first place and that led me to a Bad Decision.""" print(my_text.split())

输出:

['Let's', 'play', 'a', 'game,', 'Would', 'You', 'Rather!', 'It's', 'simple,', 'you', 'have', 'to', 'pick', 'one', 'or', 'the', 'other.', 'Let's', 'get', 'started.', 'Would', 'you', 'rather', 'try', 'Vanilla', 'Ice', 'Cream', 'or', 'Chocolate', 'one?', 'Would', 'you', 'rather', 'be', 'a', 'bird', 'or', 'a', 'bat?', 'Would', 'you', 'rather', 'explore', 'space', 'or', 'the', 'ocean?', 'Would', 'you', 'rather', 'live', 'on', 'Mars', 'or', 'on', 'the', 'Moon?', 'Would', 'you', 'rather', 'have', 'many', 'good', 'friends', 'or', 'one', 'very', 'best', 'friend?', 'Isn't', 'it', 'easy', 'though?', 'When', 'we', 'have', 'less', 'choices,', 'it's', 'easier', 'to', 'decide.', 'But', 'what', 'if', 'the', 'options', 'would', 'be', 'complicated?', 'I', 'guess,', 'you', 'pretty', 'much', 'not', 'understand', 'my', 'point,', 'neither', 'did', 'I,', 'at', 'first', 'place', 'and', 'that', 'led', 'me', 'to', 'a', 'Bad', 'Decision.']

说明:

在上面的例子中，我们使用了 split() 方法，以便将段落分成更小的片段或说出单词。同样，我们也可以通过指定分隔符作为 split() 函数的参数来将段落分成句子。正如我们所知，一个句子通常以句号结尾；这意味着我们可以利用。”作为拆分字符串的分隔符。

让我们在下面的例子中考虑同样的情况:

示例 1.2:使用 split()函数的句子分词

my_text = """Dreams. Desires. Reality. There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality. Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us. We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try.""" print(my_text.split('. '))

输出:

['Dreams', 'Desires', 'Reality', 'There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality', 'Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us', 'We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try.']

说明:

在上面的例子中，我们使用了句号(。)作为其参数，以便在句号处中断段落。使用 split() 函数的一个主要缺点是，该函数一次只取一个参数。因此，我们只能使用分隔符来拆分字符串。此外， split() 函数不将标点符号视为单独的片段。

Python 中使用正则表达式的分词

在进入下一个方法之前，让我们简单地了解一下正则表达式。一个正则表达式，也被称为正则表达式，是一个特殊的字符序列，允许用户在该序列的帮助下找到或匹配其他字符串或字符串集作为模式。

为了开始使用正则表达式，Python 提供了名为 re 的库。 re 库是 Python 中预先安装的库之一。

让我们考虑以下基于使用 Python 中 RegEx 方法的单词分词和句子分词的示例。

示例 2.1:使用 Python 中的 RegEx 方法进行单词分词

import re my_text = """Joseph Arthur was a young businessman. He was one of the shareholders at Ryan Cloud's Start-Up with James Foster and George Wilson. The Start-Up took its flight in the mid-90s and became one of the biggest firms in the United States of America. The business was expanded in all major sectors of livelihood, starting from Personal Care to Transportation by the end of 2000\. Joseph was used to be a good friend of Ryan.""" my_tokens = re.findall

输出:

['Joseph', 'Arthur', 'was', 'a', 'young', 'businessman', 'He', 'was', 'one', 'of', 'the', 'shareholders', 'at', 'Ryan', 'Cloud', 's', 'Start', 'Up', 'with', 'James', 'Foster', 'and', 'George', 'Wilson', 'The', 'Start', 'Up', 'took', 'its', 'flight', 'in', 'the', 'mid', '90s', 'and', 'became', 'one', 'of', 'the', 'biggest', 'firms', 'in', 'the', 'United', 'States', 'of', 'America', 'The', 'business', 'was', 'expanded', 'in', 'all', 'major', 'sectors', 'of', 'livelihood', 'starting', 'from', 'Personal', 'Care', 'to', 'Transportation', 'by', 'the', 'end', 'of', '2000', 'Joseph', 'was', 'used', 'to', 'be', 'a', 'good', 'friend', 'of', 'Ryan']

说明:

在上面的例子中，我们已经导入了 re 库，以便使用它的功能。然后我们使用了 re 库的 findall() 功能。该函数帮助用户找到与参数中的模式匹配的所有单词，并将它们存储在列表中。

此外，“\ w”用于表示任何单词字符，指字母数字(包括字母、数字)和下划线（_）。“+”表示任意频率。因此，我们遵循了 [\w']+ 模式，这样程序应该查找并找到所有字母数字字符，直到遇到任何其他字符。

现在，让我们看一下使用 RegEx 方法的句子分词。

示例 2.2:使用 Python 中的 RegEx 方法进行句子分词

import re my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him.""" my_sentences = re.compile('[.!?] ').split(my_text) print(my_sentences)

输出:

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America', 'The product became so successful among the people that the production was increased', 'Two new plant sites were finalized, and the construction was started', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories', 'Many popular magazines were started publishing Critiques about him.']

说明:

在上面的例子中，我们使用了参数为“[”的 re 库的 compile() 函数。？！]，并使用 split() 方法从指定的分隔符中分离字符串。因此，一旦遇到这些字符，程序就会拆分句子。

Python 中使用自然语言工具包的分词

自然语言工具包，又名 NLTK ，是一个用 Python 编写的库。 NLTK 库一般用于符号和统计自然语言处理，与文本数据配合良好。

自然语言工具包(NLTK) 是一个第三方库，可以在命令 Shell 或终端中使用以下语法安装:

$ pip install --user -U nltk

Tokenizer in Python

为了验证安装，可以在程序中导入 nltk 库并执行，如下所示:

import nltk

如果程序没有产生错误，那么库已经成功安装。否则，建议再次遵循上述安装程序，并阅读官方文档了解更多详细信息。

自然语言工具包(NLTK) 有一个名为 tokenize() 的模块。本模块进一步分为两个子类别:单词分词和句子分词

单词 token ize:**单词 _tokenize()** 方法用于将字符串拆分为标记或说出单词。

句子 Tokenize: 使用send _ token ize()方法将字符串或段落拆分成句子。

让我们考虑一些基于这两种方法的例子:

示例 3.1:使用 Python 中的 NLTK 库进行单词分词

from nltk.tokenize import word_tokenize my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him.""" print(word_tokenize(my_text))

输出:

['The', 'Advertisement', 'was', 'telecasted', 'nationwide', ',', 'and', 'the', 'product', 'was', 'sold', 'in', 'around', '30', 'states', 'of', 'America', '.', 'The', 'product', 'became', 'so', 'successful', 'among', 'the', 'people', 'that', 'the', 'production', 'was', 'increased', '.', 'Two', 'new', 'plant', 'sites', 'were', 'finalized', ',', 'and', 'the', 'construction', 'was', 'started', '.', 'Now', ',', 'The', 'Cloud', 'Enterprise', 'became', 'one', 'of', 'America', "'s", 'biggest', 'firms', 'and', 'the', 'mass', 'producer', 'in', 'all', 'major', 'sectors', ',', 'from', 'transportation', 'to', 'personal', 'care', '.', 'Director', 'of', 'The', 'Cloud', 'Enterprise', ',', 'Ryan', 'Cloud', ',', 'was', 'now', 'started', 'getting', 'interviewed', 'over', 'his', 'success', 'stories', '.', 'Many', 'popular', 'magazines', 'were', 'started', 'publishing', 'Critiques', 'about', 'him', '.']

说明:

在上面的程序中，我们已经从 NLTK 库的 tokenize 模块导入了 word_tokenize() 方法。因此，该方法将字符串分解成不同的标记，并将其存储在列表中。最后，我们打印了清单。此外，该方法还包括 句号和其他标点符号 作为单独的标记。

示例 3.1:使用 Python 中的 NLTK 库进行句子分词

from nltk.tokenize import sent_tokenize my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him.""" print(sent_tokenize(my_text))

输出:

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America.', 'The product became so successful among the people that the production was increased.', 'Two new plant sites were finalized, and the construction was started.', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care.", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories.', 'Many popular magazines were started publishing Critiques about him.']

说明:

在上面的程序中，我们已经从 NLTK 库的 tokenize 模块导入了send _ token ize()方法。因此，该方法将段落分成不同的句子，并将其存储在列表中。最后，我们打印了清单。

结论

在上面的教程中，我们已经发现了分词的概念及其在整个自然语言处理(NLP) 管道中的作用。我们还讨论了 Python 中从特定文本或字符串进行分词的几种方法(包括单词分词和句子分词)。

推荐阅读

function
浏览器中的异常检测算法及其在深度学习中的应用

本文介绍了在浏览器中进行异常检测的算法，包括统计学方法和机器学习方法，并探讨了异常检测在深度学习中的应用。异常检测在金融领域的信用卡欺诈、企业安全领域的非法入侵、IT运维中的设备维护时间点预测等方面具有广泛的应用。通过使用TensorFlow.js进行异常检测，可以实现对单变量和多变量异常的检测。统计学方法通过估计数据的分布概率来计算数据点的异常概率，而机器学习方法则通过训练数据来建立异常检测模型。 ... [详细]

蜡笔小新 2023-12-12 16:22:39
get
Python张量流中的device spec make_merged_spec()方法使用说明

本文介绍了在Python张量流中使用make_merged_spec()方法合并设备规格对象的方法和语法，以及参数和返回值的说明，并提供了一个示例代码。 ... [详细]

蜡笔小新 2023-12-11 12:15:19
web
python打卡记录去重_Python零基础学习笔记与记录之一（了解Python这个小伙伴）

本人学习笔记，知识点均摘自于网络，用于学习和交流(如未注明出处，请提醒，将及时更正，谢谢)OS:我学习是为了上 ... [详细]

蜡笔小新 2023-10-17 16:05:58
web
「爆干7天7夜」入门AI人工智能学习路线一条龙，真的不能再透彻了

前言应广大粉丝要求，今天迪迦来和大家讲解一下如何去入门人工智能，也算是迪迦对自己学习人工智能这么多年的一个总结吧，本条学习路线并不会那么 ... [详细]

蜡笔小新 2023-10-16 12:17:31
less
微软头条实习生分享深度学习自学指南

本文介绍了一位微软头条实习生自学深度学习的经验分享，包括学习资源推荐、重要基础知识的学习要点等。作者强调了学好Python和数学基础的重要性，并提供了一些建议。 ... [详细]

蜡笔小新 2023-12-14 20:58:32
get
【机器学习】生成式对抗网络模型综述

生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型，可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ... [详细]

蜡笔小新 2023-12-14 17:51:18
search
深度学习中的Vision Transformer (ViT)详解

本文详细介绍了深度学习中的Vision Transformer (ViT)方法。首先介绍了相关工作和ViT的基本原理，包括图像块嵌入、可学习的嵌入、位置嵌入和Transformer编码器等。接着讨论了ViT的张量维度变化、归纳偏置与混合架构、微调及更高分辨率等方面。最后给出了实验结果和相关代码的链接。本文的研究表明，对于CV任务，直接应用纯Transformer架构于图像块序列是可行的，无需依赖于卷积网络。 ... [详细]

蜡笔小新 2023-12-12 15:26:38
search
【论文】ICLR 2020 九篇满分论文！！！

点击上方，选择星标或置顶，每天给你送干货！阅读大概需要11分钟跟随小博主，每天进步一丢丢来自：深度学习技术前沿 ... [详细]

蜡笔小新 2023-10-17 18:45:53
search
干货 | 携程AI推理性能的自动化优化实践

作者简介携程度假AI研发团队致力于为携程旅游事业部提供丰富的AI技术产品，其中性能优化组为AI模型提供全方位的优化方案，提升推理性能降低成本࿰ ... [详细]

蜡笔小新 2023-10-16 14:03:03
search
bat大牛带你深度剖析android 十大开源框架_请收好！5大领域，21个必知的机器学习开源工具...

全文共3744字，预计学习时长7分钟本文将介绍21个你可能没使用过的机器学习开源工具。每个开源工具都为数据科学家处理数据库提供了不同角度。本文将重点介绍五种机器学习的 ... [详细]

蜡笔小新 2023-10-15 15:52:17
search
【BERT】BERT的嵌入层是如何实现的？看完你就明白了

作者：__编译：ronghuaiyang导读非常简单直白的语言解释了BERT中的嵌入层的组成以及实现的方式。介绍在本文中，我将解释BERT ... [详细]

蜡笔小新 2023-10-15 11:40:54
char
深度学习下，中文分词是否还有必要？——ACL 2019论文阅读笔记

点击上方，选择星标或置顶，每天给你送干货！阅读大概需要4分钟跟随小博主，每天进步一丢丢来自：NLP太难了公众号 ... [详细]

蜡笔小新 2023-10-14 20:17:51
web
2018年GitHub上最流行50大Python开源项目（上）

2018年GitHub上最流行50大Python开源项目（上）,Go语言社区,Golang程序员人脉社 ... [详细]

蜡笔小新 2023-10-14 19:31:18
get
开源Keras Faster RCNN模型介绍及代码结构解析

本文介绍了开源Keras Faster RCNN模型的环境需求和代码结构，包括FasterRCNN源码解析、RPN与classifier定义、data_generators.py文件的功能以及损失计算。同时提供了该模型的开源地址和安装所需的库。 ... [详细]

蜡笔小新 2023-12-10 17:44:07
get
基于神经网络的智能对话系统（二）——机器学习背景知识

2.机器学习背景知识本章简要回顾了深度学习和强化学习，这些学习与后续章节中的会话AI最相关。2.1机器学习基础Mitchell（1997）将机器学习广义地定义为包括任何计算机程序， ... [详细]

蜡笔小新 2023-10-14 15:37:52

手机用户2602913901

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章