当前位置: 开发笔记 > 编程语言 > 正文

NLP冻手之路(4)——pipeline管道函数的使用

作者： | 来源：互联网 | 2023-08-17 19:30

✅NLP研0选手的学习笔记文章目录一、需要的环境二、pipeline简介三、pipeline的使用3.1情感分类3.2完形填空3.3文本生成3.4命名实体识别3.5摘要生成3.6文

✅ NLP 研 0 选手的学习笔记

文章目录

一、需要的环境
二、pipeline简介
三、pipeline的使用
- 3.1 情感分类
- 3.2 完形填空
- 3.3 文本生成
- 3.4 命名实体识别
- 3.5 摘要生成
- 3.6 文本翻译
- 3.7 阅读理解
四、小结
五、补充说明

● 上一篇文章链接: NLP冻手之路(3)——评价及指标函数的使用(Metric&＃xff0c;以 BLEU和GLUE 为例)

一、需要的环境

● python 需要 3.7&＃43;&＃xff0c;pytorch 需要 1.10&＃43;

● 本文使用的库基于 Hugging Face Transformer&＃xff0c;官网链接&＃xff1a;https://huggingface.co/docs/transformers/index 【一个很不错的开源网站&＃xff0c;针对于 transformer 框架做了很多大集成&＃xff0c;目前 github 72.3k ⭐️】

● 安装 Hugging Face Transformer 的库只需要在终端输入 pip install transformers【这是 pip 安装方法】&＃xff1b;如果你用的是 conda&＃xff0c;则输入 conda install -c huggingface transformers

二、pipeline简介

● Hugging Face 提供了一个非常轻量化、简单的工具 pipeline&＃xff0c;我们可以通过它来解决一些简单的 NLP 任务。pipeline 提供了专门用于多个任务的简单 API&＃xff0c;包括命名实体识别、Mask 语言建模、情感分析、特征提取和问题回答等等。

● 通过学习和使用 pipeline&＃xff0c;可以让我们更直观地、直接地体会到处理 NLP 任务的感觉。

三、pipeline的使用

3.1 情感分类

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 distilbert-base-uncased-finetuned-sst-2-english 到 ~/.cache/torch 文件夹当中。

● 如果下载速度过慢&＃xff0c;可以先配置清华源再重新运行&＃xff1a;pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

● 任务&＃xff1a;对给定的文本进行情感二分类。

from transformers import pipelinemy_classifier &＃61; pipeline("sentiment-analysis") result &＃61; my_classifier("This restaurant is good") print(result) result &＃61; my_classifier("我觉得这家餐馆不好吃") print(result)

● 运行结果如下&＃xff0c;首先他会下载情感分析模型&＃xff0c;然后进行情感分析。

在这里插入图片描述

3.2 完形填空

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 distilroberta-base 到 ~/.cache/torch 文件夹当中。

● 任务&＃xff1a;模型会对处进行填空&＃xff0c;分数代表填这个词的概率。

from transformers import pipeline from pprint import pprint my_unmasker &＃61; pipeline("fill-mask") sentence &＃61; &＃39;HuggingFace is creating a that the community uses to solve NLP tasks.&＃39; result &＃61; my_unmasker(sentence) pprint(result )输出&＃xff1a; [{&＃39;sequence&＃39;: &＃39;HuggingFace is creating a tool that the community uses to solve NLP tasks.&＃39;,&＃39;score&＃39;: 0.17927534878253937,&＃39;token&＃39;: 3944,&＃39;token_str&＃39;: &＃39; tool&＃39;},{&＃39;sequence&＃39;: &＃39;HuggingFace is creating a framework that the community uses to solve NLP tasks.&＃39;,&＃39;score&＃39;: 0.11349416524171829,&＃39;token&＃39;: 7208,&＃39;token_str&＃39;: &＃39; framework&＃39;},{&＃39;sequence&＃39;: &＃39;HuggingFace is creating a library that the community uses to solve NLP tasks.&＃39;,&＃39;score&＃39;: 0.05243571847677231,&＃39;token&＃39;: 5560,&＃39;token_str&＃39;: &＃39; library&＃39;},{&＃39;sequence&＃39;: &＃39;HuggingFace is creating a database that the community uses to solve NLP tasks.&＃39;,&＃39;score&＃39;: 0.034935351461172104,&＃39;token&＃39;: 8503,&＃39;token_str&＃39;: &＃39; database&＃39;},{&＃39;sequence&＃39;: &＃39;HuggingFace is creating a prototype that the community uses to solve NLP tasks.&＃39;,&＃39;score&＃39;: 0.028602460399270058,&＃39;token&＃39;: 17715,&＃39;token_str&＃39;: &＃39; prototype&＃39;}]

3.3 文本生成

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 gpt2 到 ~/.cache/torch 文件夹当中。

● 任务&＃xff1a;给定模型一段话/一句话&＃xff0c;模型接着生成后续的文本&＃xff0c;生成的长度由 max_length 决定。

from transformers import pipelinetext_generator &＃61; pipeline("text-generation") result &＃61; text_generator("As far as I am concerned, I will",max_length&＃61;50,do_sample&＃61;False) print(result)输出&＃xff1a; [{&＃39;generated_text&＃39;: &＃39;As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea&＃39;}]

3.4 命名实体识别

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 dbmdz/bert-large-cased-finetuned-conll03-english 到 ~/.cache/torch 文件夹当中。

● 任务&＃xff1a;给定模型一段话&＃xff0c;模型对其中的人名、地名、城市名、公司名等等。

from transformers import pipelinener_pipe &＃61; pipeline("ner") sequence &＃61; """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window.""" for entity in ner_pipe(sequence):print(entity)输出&＃xff1a; {&＃39;entity&＃39;: &＃39;I-ORG&＃39;, &＃39;score&＃39;: 0.9995786, &＃39;index&＃39;: 1, &＃39;word&＃39;: &＃39;Hu&＃39;, &＃39;start&＃39;: 0, &＃39;end&＃39;: 2} {&＃39;entity&＃39;: &＃39;I-ORG&＃39;, &＃39;score&＃39;: 0.9909764, &＃39;index&＃39;: 2, &＃39;word&＃39;: &＃39;##gging&＃39;, &＃39;start&＃39;: 2, &＃39;end&＃39;: 7} {&＃39;entity&＃39;: &＃39;I-ORG&＃39;, &＃39;score&＃39;: 0.9982225, &＃39;index&＃39;: 3, &＃39;word&＃39;: &＃39;Face&＃39;, &＃39;start&＃39;: 8, &＃39;end&＃39;: 12} {&＃39;entity&＃39;: &＃39;I-ORG&＃39;, &＃39;score&＃39;: 0.999488, &＃39;index&＃39;: 4, &＃39;word&＃39;: &＃39;Inc&＃39;, &＃39;start&＃39;: 13, &＃39;end&＃39;: 16} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.9994345, &＃39;index&＃39;: 11, &＃39;word&＃39;: &＃39;New&＃39;, &＃39;start&＃39;: 40, &＃39;end&＃39;: 43} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.9993196, &＃39;index&＃39;: 12, &＃39;word&＃39;: &＃39;York&＃39;, &＃39;start&＃39;: 44, &＃39;end&＃39;: 48} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.9993794, &＃39;index&＃39;: 13, &＃39;word&＃39;: &＃39;City&＃39;, &＃39;start&＃39;: 49, &＃39;end&＃39;: 53} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.98625815, &＃39;index&＃39;: 19, &＃39;word&＃39;: &＃39;D&＃39;, &＃39;start&＃39;: 79, &＃39;end&＃39;: 80} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.9514269, &＃39;index&＃39;: 20, &＃39;word&＃39;: &＃39;##UM&＃39;, &＃39;start&＃39;: 80, &＃39;end&＃39;: 82} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.9336589, &＃39;index&＃39;: 21, &＃39;word&＃39;: &＃39;##BO&＃39;, &＃39;start&＃39;: 82, &＃39;end&＃39;: 84} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.97616535, &＃39;index&＃39;: 28, &＃39;word&＃39;: &＃39;Manhattan&＃39;, &＃39;start&＃39;: 114, &＃39;end&＃39;: 123} {&＃39;entity&＃39;: &＃39;I-LOC&＃39;, &＃39;score&＃39;: 0.9914629, &＃39;index&＃39;: 29, &＃39;word&＃39;: &＃39;Bridge&＃39;, &＃39;start&＃39;: 124, &＃39;end&＃39;: 130}

3.5 摘要生成

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 sshleifer/distilbart-cnn-12-6 到 ~/.cache/torch 文件夹当中。

● 任务&＃xff1a;略。

from transformers import pipelinesummarizer &＃61; pipeline("summarization") ARTICLE &＃61; """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the 2010 marriage license application, according to court documents. Prosecutors said the marriages were part of an immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. The case was referred to the Bronx District Attorney\&＃39;s Office by Immigration and Customs Enforcement and the Department of Homeland Security\&＃39;s Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. """result &＃61; summarizer(ARTICLE, max_length&＃61;130, min_length&＃61;30, do_sample&＃61;False) print(result)输出&＃xff1a; [{&＃39;summary_text&＃39;: &＃39; Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .&＃39;}]

3.6 文本翻译

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 t5-base 到 ~/.cache/torch 文件夹当中。

● 任务&＃xff1a;略。

from transformers import pipeline#翻译 translator &＃61; pipeline("translation_en_to_de") sentence &＃61; "Hugging Face is a technology company based in New York and Paris" result &＃61; translator(sentence, max_length&＃61;40) print(result)输出&＃xff1a; [{&＃39;translation_text&＃39;: &＃39;Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.&＃39;}]

3.7 阅读理解

● 如果我们没有指定模型&＃xff0c;那么它会自动下载模型 t5-base 到 ~/.cache/torch 文件夹当中。

● 这段代码可能运行不成功&＃xff0c;可能是原模型的 bug。

● 任务&＃xff1a;给定一段文本&＃xff0c;然后问文本一个问题&＃xff0c;模型给出相应的答案。

from transformers import pipelinequestion_answerer &＃61; pipeline("question-answering") # 字符串前面加 r 是为了消除转义字符对字符串的影响. 加了 r 之后, 再打印字符串就会打印出完整的字符串 context &＃61; r""" Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script. """ result &＃61; question_answerer(question&＃61;"What is extractive question answering?",context&＃61;context) print(result) result &＃61; question_answerer(question&＃61;"What is a good example of a question answering dataset?",context&＃61;context) print(result)输出&＃xff1a; {&＃39;score&＃39;: 0.6177279353141785, &＃39;start&＃39;: 34, &＃39;end&＃39;: 95, &＃39;answer&＃39;: &＃39;the task of extracting an answer from a text given a question&＃39;} {&＃39;score&＃39;: 0.5152313113212585, &＃39;start&＃39;: 148, &＃39;end&＃39;: 161, &＃39;answer&＃39;: &＃39;SQuAD dataset&＃39;}

四、小结

● 本小节不是重点&＃xff0c;但是可以让我们直观地体会到 NLP 唾手可得的魅力&＃xff0c;后续还有深入探索&＃xff0c;一起加油~&＃xff01;

五、补充说明

● 上一篇文章链接: NLP冻手之路(3)——评价及指标函数的使用(Metric&＃xff0c;以 BLEU和GLUE 为例)

● pipeline 会依据你指定的任务&＃xff0c;去 hugging face 上面下载当前排名最高、最热门的模型下来&＃xff0c;所以可能隔个几年&＃xff0c;上面的 SOTA 模型都会有所变更。

● 若有写得不对的地方&＃xff0c;或有疑问&＃xff0c;欢迎评论交流。

● 参考视频&＃xff1a;HuggingFace简明教程,BERT中文模型实战示例.NLP预训练模型,Transformers类库,datasets类库快速入门.

⭐️ ⭐️

推荐阅读

get
如何利用Java 5 Executor框架高效构建和管理线程池

Java 5 引入了 Executor 框架，为开发人员提供了一种高效管理和构建线程池的方法。该框架通过将任务提交与任务执行分离，简化了多线程编程的复杂性。利用 Executor 框架，开发人员可以更灵活地控制线程的创建、分配和管理，从而提高服务器端应用的性能和响应能力。此外，该框架还提供了多种线程池实现，如固定线程池、缓存线程池和单线程池，以适应不同的应用场景和需求。 ... [详细]

蜡笔小新 2024-11-07 17:05:32
get
Java 并发编程：深入解析 AtomicInteger 和 CAS 无锁算法

在多线程并发环境中，普通变量的操作往往是线程不安全的。本文通过一个简单的例子，展示了如何使用 AtomicInteger 类及其核心的 CAS 无锁算法来保证线程安全。 ... [详细]

蜡笔小新 2024-11-12 16:40:04
copy
开机自启动的几种方式

0x01快速自启动目录快速启动目录自启动方式源于Windows中的一个目录，这个目录一般叫启动或者Startup。位于该目录下的PE文件会在开机后进行自启动 ... [详细]

蜡笔小新 2024-11-12 11:16:30
get
微信获取用户数据：隐私与安全的考量

微信平台通过盛派SDK（sdk.weixin.senparc.com）允许服务号和订阅号使用appId和token读取关注用户的个人信息。然而，这一过程需严格遵守隐私保护和数据安全的相关规定，确保用户数据的安全性和隐私性。 ... [详细]

蜡笔小新 2024-11-06 15:16:05
get
深入解析 Kubernetes 亲和性调度机制及其优化策略

在 Kubernetes 中，Pod 的调度通常由集群的自动调度策略决定，这些策略主要关注资源充足性和负载均衡。然而，在某些场景下，用户可能需要更精细地控制 Pod 的调度行为，例如将特定的服务（如 GitLab）部署到特定节点上，以提高性能或满足特定需求。本文深入解析了 Kubernetes 的亲和性调度机制，并探讨了多种优化策略，帮助用户实现更高效、更灵活的资源管理。 ... [详细]

蜡笔小新 2024-11-05 17:27:07
search
投融资周报 | Circle 达成 4 亿美元融资协议，唯一艺术平台 A 轮融资超千万美元

投融资周报 | Circle 达成 4 亿美元融资协议，唯一艺术平台 A 轮融资超千万美元 ... [详细]

蜡笔小新 2024-11-05 04:56:42
get
如何使用 net.sf.extjwnl.data.Word 类及其代码示例详解

如何使用 net.sf.extjwnl.data.Word 类及其代码示例详解 ... [详细]

蜡笔小新 2024-11-01 19:30:32
get
利用OpenCV和线性SVM实现人脸识别

本文介绍如何使用OpenCV和线性支持向量机（SVM）模型来开发一个简单的人脸识别系统，特别关注在只有一个用户数据集时的处理方法。 ... [详细]

蜡笔小新 2024-11-13 14:50:37
range
探讨Redis的最佳应用场景

本文将深入探讨Redis在不同场景下的最佳应用，包括其优势和适用范围。 ... [详细]

蜡笔小新 2024-11-13 12:35:53
text
MySQL索引详解及其优化策略

本文详细解析了MySQL索引的概念、数据结构及管理方法，并探讨了如何正确使用索引以提升查询性能。文章还深入讲解了联合索引与覆盖索引的应用场景，以及它们在优化数据库性能中的重要作用。此外，通过实例分析，进一步阐述了索引在高读写比系统中的必要性和优势。 ... [详细]

蜡笔小新 2024-11-05 10:36:17
get
Elasticsearch 写入与查询的底层机制解析

本文深入解析了Elasticsearch写入与查询的底层机制。在数据写入过程中，首先会将数据暂存至内存缓冲区，在此阶段数据尚不可被搜索。同时，为了保证数据的持久性和可靠性，系统会将这些数据同步记录到事务日志（translog）中。当内存缓冲区接近满载时，系统会触发刷新操作，将缓冲区中的数据写入到磁盘上的段文件中，从而使其可被搜索。此外，文章还探讨了查询过程中涉及的索引分片、倒排索引等关键技术，为读者提供了全面的技术理解。 ... [详细]

蜡笔小新 2024-11-04 19:00:33
copy
CentOS 7环境下Jenkins的安装与前后端应用部署详解

CentOS 7环境下Jenkins的安装与前后端应用部署详解 ... [详细]

蜡笔小新 2024-11-04 16:46:02
get
基于OpenCV的图像拼接技术实践与示例代码解析

图像拼接技术在全景摄影中具有广泛应用，如手机全景拍摄功能，通过将多张照片根据其关联信息合成为一张完整图像。本文详细探讨了使用Python和OpenCV库实现图像拼接的具体方法，并提供了示例代码解析，帮助读者深入理解该技术的实现过程。 ... [详细]

蜡笔小新 2024-11-03 12:48:59
import
从2019年AI顶级会议最佳论文，探索深度学习的理论根基与前沿进展

从2019年AI顶级会议最佳论文，探索深度学习的理论根基与前沿进展 ... [详细]

蜡笔小新 2024-11-03 10:42:12
get
技术日志：深入探讨Spark Streaming与Spark SQL的融合应用

技术日志：深入探讨Spark Streaming与Spark SQL的融合应用 ... [详细]

蜡笔小新 2024-10-30 14:20:53

Tags | 热门标签

RankList | 热门文章