热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

NLP冻手之路(4)——pipeline管道函数的使用

✅NLP研0选手的学习笔记文章目录一、需要的环境二、pipeline简介三、pipeline的使用3.1情感分类3.2完形填空3.3文本生成3.4命名实体识别3.5摘要生成3.6文

✅ NLP 研 0 选手的学习笔记



文章目录

  • 一、需要的环境
  • 二、pipeline简介
  • 三、pipeline的使用
    • 3.1 情感分类
    • 3.2 完形填空
    • 3.3 文本生成
    • 3.4 命名实体识别
    • 3.5 摘要生成
    • 3.6 文本翻译
    • 3.7 阅读理解
  • 四、小结
  • 五、补充说明



上一篇文章链接: NLP冻手之路(3)——评价及指标函数的使用(Metric,以 BLEU和GLUE 为例)



一、需要的环境

python 需要 3.7+,pytorch 需要 1.10+

● 本文使用的库基于 Hugging Face Transformer,官网链接:https://huggingface.co/docs/transformers/index 【一个很不错的开源网站,针对于 transformer 框架做了很多大集成,目前 github 72.3k ⭐️】

● 安装 Hugging Face Transformer 的库只需要在终端输入 pip install transformers【这是 pip 安装方法】;如果你用的是 conda,则输入 conda install -c huggingface transformers





二、pipeline简介

● Hugging Face 提供了一个非常轻量化、简单的工具 pipeline,我们可以通过它来解决一些简单的 NLP 任务。pipeline 提供了专门用于多个任务的简单 API,包括命名实体识别、Mask 语言建模、情感分析、特征提取和问题回答等等。

● 通过学习和使用 pipeline,可以让我们更直观地、直接地体会到处理 NLP 任务的感觉。





三、pipeline的使用

3.1 情感分类

● 如果我们没有指定模型,那么它会自动下载模型 distilbert-base-uncased-finetuned-sst-2-english 到 ~/.cache/torch 文件夹当中。

● 如果下载速度过慢,可以先配置清华源再重新运行:pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

● 任务:对给定的文本进行情感二分类。

from transformers import pipelinemy_classifier = pipeline("sentiment-analysis")
result = my_classifier("This restaurant is good")
print(result)
result = my_classifier("我觉得这家餐馆不好吃")
print(result)

● 运行结果如下,首先他会下载情感分析模型,然后进行情感分析。

在这里插入图片描述



3.2 完形填空

● 如果我们没有指定模型,那么它会自动下载模型 distilroberta-base 到 ~/.cache/torch 文件夹当中。

● 任务:模型会对 处进行填空,分数代表填这个词的概率。

from transformers import pipeline
from pprint import pprint
my_unmasker = pipeline("fill-mask")
sentence = 'HuggingFace is creating a that the community uses to solve NLP tasks.'
result = my_unmasker(sentence)
pprint(result )输出:
[{'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.','score': 0.17927534878253937,'token': 3944,'token_str': ' tool'},{'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.','score': 0.11349416524171829,'token': 7208,'token_str': ' framework'},{'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.','score': 0.05243571847677231,'token': 5560,'token_str': ' library'},{'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.','score': 0.034935351461172104,'token': 8503,'token_str': ' database'},{'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.','score': 0.028602460399270058,'token': 17715,'token_str': ' prototype'}]



3.3 文本生成

● 如果我们没有指定模型,那么它会自动下载模型 gpt2 到 ~/.cache/torch 文件夹当中。

● 任务:给定模型一段话/一句话,模型接着生成后续的文本,生成的长度由 max_length 决定。

from transformers import pipelinetext_generator = pipeline("text-generation")
result = text_generator("As far as I am concerned, I will",max_length=50,do_sample=False)
print(result)输出:
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]



3.4 命名实体识别

● 如果我们没有指定模型,那么它会自动下载模型 dbmdz/bert-large-cased-finetuned-conll03-english 到 ~/.cache/torch 文件夹当中。

● 任务:给定模型一段话,模型对其中的人名、地名、城市名、公司名等等。

from transformers import pipelinener_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

for entity in ner_pipe(sequence):print(entity)输出:
{'entity': 'I-ORG', 'score': 0.9995786, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982225, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.999488, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994345, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993196, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514269, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9336589, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.97616535, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9914629, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}



3.5 摘要生成

● 如果我们没有指定模型,那么它会自动下载模型 sshleifer/distilbart-cnn-12-6 到 ~/.cache/torch 文件夹当中。

● 任务:略。

from transformers import pipelinesummarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
result = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
print(result)输出:
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]



3.6 文本翻译

● 如果我们没有指定模型,那么它会自动下载模型 t5-base 到 ~/.cache/torch 文件夹当中。

● 任务:略。

from transformers import pipeline#翻译
translator = pipeline("translation_en_to_de")
sentence = "Hugging Face is a technology company based in New York and Paris"
result = translator(sentence, max_length=40)
print(result)输出:
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]



3.7 阅读理解

● 如果我们没有指定模型,那么它会自动下载模型 t5-base 到 ~/.cache/torch 文件夹当中。

● 这段代码可能运行不成功,可能是原模型的 bug。

● 任务:给定一段文本,然后问文本一个问题,模型给出相应的答案。

from transformers import pipelinequestion_answerer = pipeline("question-answering")
# 字符串前面加 r 是为了消除转义字符对字符串的影响. 加了 r 之后, 再打印字符串就会打印出完整的字符串
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is extractive question answering?",context=context)
print(result)
result = question_answerer(question="What is a good example of a question answering dataset?",context=context)
print(result)输出:
{'score': 0.6177279353141785, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5152313113212585, 'start': 148, 'end': 161, 'answer': 'SQuAD dataset'}





四、小结

● 本小节不是重点,但是可以让我们直观地体会到 NLP 唾手可得的魅力,后续还有深入探索,一起加油~!



五、补充说明

上一篇文章链接: NLP冻手之路(3)——评价及指标函数的使用(Metric,以 BLEU和GLUE 为例)

● pipeline 会依据你指定的任务,去 hugging face 上面下载当前排名最高、最热门的模型下来,所以可能隔个几年,上面的 SOTA 模型都会有所变更。

● 若有写得不对的地方,或有疑问,欢迎评论交流。

● 参考视频:HuggingFace简明教程,BERT中文模型实战示例.NLP预训练模型,Transformers类库,datasets类库快速入门.


⭐️ ⭐️


推荐阅读
  • 本文介绍了Python对Excel文件的读取方法,包括模块的安装和使用。通过安装xlrd、xlwt、xlutils、pyExcelerator等模块,可以实现对Excel文件的读取和处理。具体的读取方法包括打开excel文件、抓取所有sheet的名称、定位到指定的表单等。本文提供了两种定位表单的方式,并给出了相应的代码示例。 ... [详细]
  • sklearn数据集库中的常用数据集类型介绍
    本文介绍了sklearn数据集库中常用的数据集类型,包括玩具数据集和样本生成器。其中详细介绍了波士顿房价数据集,包含了波士顿506处房屋的13种不同特征以及房屋价格,适用于回归任务。 ... [详细]
  • Go Cobra命令行工具入门教程
    本文介绍了Go语言实现的命令行工具Cobra的基本概念、安装方法和入门实践。Cobra被广泛应用于各种项目中,如Kubernetes、Hugo和Github CLI等。通过使用Cobra,我们可以快速创建命令行工具,适用于写测试脚本和各种服务的Admin CLI。文章还通过一个简单的demo演示了Cobra的使用方法。 ... [详细]
  • EzPP 0.2发布,新增YAML布局渲染功能
    EzPP发布了0.2.1版本,新增了YAML布局渲染功能,可以将YAML文件渲染为图片,并且可以复用YAML作为模版,通过传递不同参数生成不同的图片。这个功能可以用于绘制Logo、封面或其他图片,让用户不需要安装或卸载Photoshop。文章还提供了一个入门例子,介绍了使用ezpp的基本渲染方法,以及如何使用canvas、text类元素、自定义字体等。 ... [详细]
  • 本文介绍了在Python3中如何使用选择文件对话框的格式打开和保存图片的方法。通过使用tkinter库中的filedialog模块的asksaveasfilename和askopenfilename函数,可以方便地选择要打开或保存的图片文件,并进行相关操作。具体的代码示例和操作步骤也被提供。 ... [详细]
  • Python实现变声器功能(萝莉音御姐音)的方法及步骤
    本文介绍了使用Python实现变声器功能(萝莉音御姐音)的方法及步骤。首先登录百度AL开发平台,选择语音合成,创建应用并填写应用信息,获取Appid、API Key和Secret Key。然后安装pythonsdk,可以通过pip install baidu-aip或python setup.py install进行安装。最后,书写代码实现变声器功能,使用AipSpeech库进行语音合成,可以设置音量等参数。 ... [详细]
  • 开发笔记:加密&json&StringIO模块&BytesIO模块
    篇首语:本文由编程笔记#小编为大家整理,主要介绍了加密&json&StringIO模块&BytesIO模块相关的知识,希望对你有一定的参考价值。一、加密加密 ... [详细]
  • Python正则表达式学习记录及常用方法
    本文记录了学习Python正则表达式的过程,介绍了re模块的常用方法re.search,并解释了rawstring的作用。正则表达式是一种方便检查字符串匹配模式的工具,通过本文的学习可以掌握Python中使用正则表达式的基本方法。 ... [详细]
  • 本文介绍了如何使用python从列表中删除所有的零,并将结果以列表形式输出,同时提供了示例格式。 ... [详细]
  • 基于dlib的人脸68特征点提取(眨眼张嘴检测)python版本
    文章目录引言开发环境和库流程设计张嘴和闭眼的检测引言(1)利用Dlib官方训练好的模型“shape_predictor_68_face_landmarks.dat”进行68个点标定 ... [详细]
  • 31.项目部署
    目录1一些概念1.1项目部署1.2WSGI1.3uWSGI1.4Nginx2安装环境与迁移项目2.1项目内容2.2项目配置2.2.1DEBUG2.2.2STAT ... [详细]
  • 本文介绍了使用Spark实现低配版高斯朴素贝叶斯模型的原因和原理。随着数据量的增大,单机上运行高斯朴素贝叶斯模型会变得很慢,因此考虑使用Spark来加速运行。然而,Spark的MLlib并没有实现高斯朴素贝叶斯模型,因此需要自己动手实现。文章还介绍了朴素贝叶斯的原理和公式,并对具有多个特征和类别的模型进行了讨论。最后,作者总结了实现低配版高斯朴素贝叶斯模型的步骤。 ... [详细]
  • 全面介绍Windows内存管理机制及C++内存分配实例(四):内存映射文件
    本文旨在全面介绍Windows内存管理机制及C++内存分配实例中的内存映射文件。通过对内存映射文件的使用场合和与虚拟内存的区别进行解析,帮助读者更好地理解操作系统的内存管理机制。同时,本文还提供了相关章节的链接,方便读者深入学习Windows内存管理及C++内存分配实例的其他内容。 ... [详细]
  • Python操作MySQL(pymysql模块)详解及示例代码
    本文介绍了使用Python操作MySQL数据库的方法,详细讲解了pymysql模块的安装和连接MySQL数据库的步骤,并提供了示例代码。内容涵盖了创建表、插入数据、查询数据等操作,帮助读者快速掌握Python操作MySQL的技巧。 ... [详细]
  • 不同优化算法的比较分析及实验验证
    本文介绍了神经网络优化中常用的优化方法,包括学习率调整和梯度估计修正,并通过实验验证了不同优化算法的效果。实验结果表明,Adam算法在综合考虑学习率调整和梯度估计修正方面表现较好。该研究对于优化神经网络的训练过程具有指导意义。 ... [详细]
author-avatar
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有