ELMo代码详解(一)：数据准备

作者：歪友46300606 | 来源：互联网 | 2023-09-17 17:11

ELMo代码解读笔记1.数据准备数据准备包括:1.生成word的词汇表类;2.生成字符的词汇表类；3.以word-ids作为输入的训练batch生成类;4.以char

ELMo代码解读笔记

1.数据准备

数据准备包括:1.生成word的词汇表类; 2.生成字符的词汇表类&＃xff1b; 3.以word-ids作为输入的训练batch生成类; 4.以char-ids作为输入的训练batch生成类; 5.生成语言模型输入的数据集类

1.1 word词汇表类(Vocabulary)

根据一个词汇表文件&＃xff0c;生成word和索引的相互对应关系&＃xff0c;即_id_to_word和_word_to_id&＃xff0c;前者是一个数组&＃xff0c;后者是一个字典。当然&＃xff0c;我们也需要加上一个特殊的词&＃xff0c;比如, &＃xff0c;&＃xff08;分别表示句首&＃xff0c;句尾和不知词&＃xff09;。主要的代码如下:

def __init__(self, filename, validate_file&＃61;False):&＃39;&＃39;&＃39;filename &＃61; the vocabulary file. It is a flat text file with one(normalized) token per line. In addition, the file should alsocontain the special tokens , , (case sensitive).vocab文件&＃xff0c;是一个纯文本&＃xff0c;每一行只有一个词。另外&＃xff0c;这个文件应该包含特殊词&＃xff0c;比如, , 等&＃39;&＃39;&＃39;self._id_to_word &＃61; []self._word_to_id &＃61; {}self._unk &＃61; -1self._bos &＃61; -1self._eos &＃61; -1with open(filename) as f:idx &＃61; 0for line in f: #词汇表中一行就是一个单词word_name &＃61; line.strip()if word_name &＃61;&＃61; &＃39;&＃39;:self._bos &＃61; idxelif word_name &＃61;&＃61; &＃39;&＃39;:self._eos &＃61; idxelif word_name &＃61;&＃61; &＃39;&＃39;:self._unk &＃61; idxif word_name &＃61;&＃61; &＃39;!!!MAXTERMID&＃39;:continueself._id_to_word.append(word_name)self._word_to_id[word_name] &＃61; idxidx &＃43;&＃61; 1# check to ensure file has special tokensif validate_file:if self._bos &＃61;&＃61; -1 or self._eos &＃61;&＃61; -1 or self._unk &＃61;&＃61; -1:raise ValueError("Ensure the vocabulary file has "", , tokens")

当然&＃xff0c;类中还有两个很实用的函数&＃xff0c;一个是编码函数encode&＃xff0c;另一个是解码函数decode。编码器encode的作用是将一条句子sentence转化为一个word-ids列表&＃xff0c;注意要加上句首和句尾token。当然包括反转选项&＃xff0c;用来做双向的LSTM。而解码器decode就是将word-ids列表转化为相应的单词。

def encode(self, sentence, reverse&＃61;False, split&＃61;True):"""Convert a sentence to a list of ids, with special tokens added.Sentence is a single string with tokens separated by whitespace.If reverse, then the sentence is assumed to be reversed, andthis method will swap the BOS/EOS tokens appropriately.将一个sentenct转化为ids序列并提供句子反转的功能"""if split:word_ids &＃61; [self.word_to_id(cur_word) for cur_word in sentence.split()]else:word_ids &＃61; [self.word_to_id(cur_word) for cur_word in sentence]if reverse:return np.array([self.eos] &＃43; word_ids &＃43; [self.bos], dtype&＃61;np.int32) #在每一条句子首位加上了和else:return np.array([self.bos] &＃43; word_ids &＃43; [self.eos], dtype&＃61;np.int32)def decode(self, cur_ids):"""Convert a list of ids to a sentence, with space inserted.将一个ids序列转化为word序列"""return &＃39; &＃39;.join([self.id_to_word(cur_id) for cur_id in cur_ids])

1.2 字符词汇表(UnicodeCharsVocabulary)

注意这个类是上面word词汇表Vocabulary的子类&＃xff0c;这意味着这个字符类包含了Vocabulary的所有变量和方法&＃xff01;
每个字符(character)的id是用该字符对应的utf-8编码&＃xff0c;这样也就可以形成id和char之间的转换&＃xff0c;因为使用utf-8编码&＃xff0c;这将限制char词汇表中所有可能的id数量为256。当然&＃xff0c;我们也需要加入5个额外的特殊字符&＃xff0c;包括:句首&＃xff0c;句尾&＃xff0c;词头&＃xff0c;词尾和padding。通过词汇表文件&＃xff0c;形成字符词汇表的_word_char_ids的代码为:

#将词转化为char_ids def _convert_word_to_char_ids(self, word):code &＃61; np.zeros([self.max_word_length], dtype&＃61;np.int32)code[:] &＃61; self.pad_char#将word中每一个字符转化为utf-8编码&＃xff0c;然后用数组存起来&＃xff0c;例如:#english中&＃xff0c;e:101, n:110, g:103, l:108, h:105, s:115, h:104word_encoded &＃61; word.encode(&＃39;utf-8&＃39;, &＃39;ignore&＃39;)[:(self.max_word_length-2)]code[0] &＃61; self.bow_char #加上词开始和结尾的编码for k, chr_id in enumerate(word_encoded, start&＃61;1):code[k] &＃61; chr_idcode[k &＃43; 1] &＃61; self.eow_charreturn codedef __init__(self, filename, max_word_length, **kwargs):#调用父类Vocabulary&＃xff0c;生成word和id之间的转换等super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs)self._max_word_length &＃61; max_word_length #每个词对应最大字符长# char ids 0-255 come from utf-8 encoding bytes# assign 256-300 to special charsself.bos_char &＃61; 256 # self.eos_char &＃61; 257 # self.bow_char &＃61; 258 # self.eow_char &＃61; 259 # self.pad_char &＃61; 260 # num_words &＃61; len(self._id_to_word) #单词的个数&＃xff0c;父类中的属性#每个词都会对应一个char_ids列表self._word_char_ids &＃61; np.zeros([num_words, max_word_length],dtype&＃61;np.int32)# the charcter representation of the begin/end of sentence characters# 对句首或者句尾的token来一个字符的表示def _make_bos_eos(c):r &＃61; np.zeros([self.max_word_length], dtype&＃61;np.int32)r[:] &＃61; self.pad_charr[0] &＃61; self.bow_char #词的开始r[1] &＃61; cr[2] &＃61; self.eow_char #词的结束return rself.bos_chars &＃61; _make_bos_eos(self.bos_char) #句子开始对应的char_idsself.eos_chars &＃61; _make_bos_eos(self.eos_char) #句子的结尾对应的char_idsfor i, word in enumerate(self._id_to_word): #遍历id2word数组&＃xff0c;得到每一个词的char_idsself._word_char_ids[i] &＃61; self._convert_word_to_char_ids(word)self._word_char_ids[self.bos] &＃61; self.bos_chars #将句子开头和结尾当作一个word处理self._word_char_ids[self.eos] &＃61; self.eos_chars

通过以上两个函数&＃xff0c;我们就可以得到每个单词(word)对应的字符id序列(char-ids)&＃xff0c;包括句首和句尾的字符id序列表示。
这个类还提供将句子转化为相应的char-ids数组的功能&＃xff0c;它首先查词汇表字典_word_char_ids来得到每个词的char_ids表示&＃xff0c;然后组成句子&＃xff0c;返回的是一个二维数组。实现如下:

#返回word对应的char_ids数组 def word_to_char_ids(self, word):if word in self._word_to_id:return self._word_char_ids[self._word_to_id[word]]else:return self._convert_word_to_char_ids(word)def encode_chars(self, sentence, reverse&＃61;False, split&＃61;True):&＃39;&＃39;&＃39;Encode the sentence as a white space delimited string of tokens.对一整句话进行编码&＃xff0c;编码成chars&＃39;&＃39;&＃39;if split: #如果切割了句子chars_ids &＃61; [self.word_to_char_ids(cur_word) for cur_word in sentence.split()]else:chars_ids &＃61; [self.word_to_char_ids(cur_word)for cur_word in sentence]if reverse:return np.vstack([self.eos_chars] &＃43; chars_ids &＃43; [self.bos_chars]) #在每一条句子上都加了和 else:return np.vstack([self.bos_chars] &＃43; chars_ids &＃43; [self.eos_chars])

1.3 生成word-ids输入的batch类(TokenBatcher)

将一个batch的句子文本转化为相应的word-ids形式。主要代码如下:

def batch_sentences(self, sentences: List[List[str]]):&＃39;&＃39;&＃39;Batch the sentences as character ids确定是character_ids?而不是word_idsEach sentence is a list of tokens without or , e.g.[[&＃39;The&＃39;, &＃39;first&＃39;, &＃39;sentence&＃39;, &＃39;.&＃39;], [&＃39;Second&＃39;, &＃39;.&＃39;]]&＃39;&＃39;&＃39;n_sentences &＃61; len(sentences)max_length &＃61; max(len(sentence) for sentence in sentences) &＃43; 2X_ids &＃61; np.zeros((n_sentences, max_length), dtype&＃61;np.int64) #word_ids是二维的&＃xff0c;[batch_size, max_len]for k, sent in enumerate(sentences):length &＃61; len(sent) &＃43; 2ids_without_mask &＃61; self._lm_vocab.encode(sent, split&＃61;False)# add one so that 0 is the mask valueX_ids[k, :length] &＃61; ids_without_mask &＃43; 1 #0表示mask值return X_ids

1.4 生成char-ids输入的类(Batcher)

和上面类似&＃xff0c;只是这里生成的是一个batch的句子文本的char-ids的表示&＃xff0c;形成的是一个三维数组。主要代码为:

def batch_sentences(self, sentences: List[List[str]]):&＃39;&＃39;&＃39;Batch the sentences as character idsEach sentence is a list of tokens without or , e.g.[[&＃39;The&＃39;, &＃39;first&＃39;, &＃39;sentence&＃39;, &＃39;.&＃39;], [&＃39;Second&＃39;, &＃39;.&＃39;]]&＃39;&＃39;&＃39;n_sentences &＃61; len(sentences) #句子个数max_length &＃61; max(len(sentence) for sentence in sentences) &＃43; 2 #句子最大长度&＃xff0c;加上句首和句尾?X_char_ids &＃61; np.zeros( #三维数组&＃xff0c;每条句子中每个单词对应的char_ids数组(n_sentences, max_length, self._max_token_length),dtype&＃61;np.int64)#遍历数组for k, sent in enumerate(sentences):length &＃61; len(sent) &＃43; 2char_ids_without_mask &＃61; self._lm_vocab.encode_chars( #对每个sentence得到char_ids数组sent, split&＃61;False)# add one so that 0 is the mask value, 加上1&＃xff0c;所以0是mask值X_char_ids[k, :length, :] &＃61; char_ids_without_mask &＃43; 1 #直接复制粘贴?将对应值加1&＃xff0c;其他值填0return X_char_ids

接着定义了一个生成各种数据的batch的方法&＃xff0c;该方法每次从输入中读取一个batch的数据&＃xff0c;batch中每个数据条目就是一条句子&＃xff0c;每个条目包括句子的word-ids表示&＃xff0c;char-ids表示和targets&＃xff08;即句子每个词要预测的下一个词&＃xff09;。该方法中有一个生成器(generator)&＃xff0c;每次会产生一条句子的数据&＃xff0c;包括句子的word-ids和char-ids表示&＃xff0c;所有只要重复调用该generator的next方法batch_size次就能够构造出一个batch的数据&＃xff0c;代码如下:

def _get_batch(generator, batch_size, num_steps, max_word_length): """Read batches of input.都一个batch的输入 """ cur_stream &＃61; [None] * batch_size #None表示任意大小no_more_data &＃61; False while True:inputs &＃61; np.zeros([batch_size, num_steps], np.int32) #batch中word_ids if max_word_length is not None: #batch中每条句子每个word对应的char_idschar_inputs &＃61; np.zeros([batch_size, num_steps, max_word_length],np.int32)else:char_inputs &＃61; Nonetargets &＃61; np.zeros([batch_size, num_steps], np.int32) #我们的目标是预测下一个词来优化emlo&＃xff0c;所以我们以向右滑动的1个词作为targetfor i in range(batch_size): #每一条句子cur_pos &＃61; 0 #这个值?while cur_pos

1.5 语言模型的数据集类(LMDataset)

数据集类为语言模型训练提供相应的数据输入。它是随机的从数据文件列表中选取一个文件(数据不是仅仅在一个文件里面&＃xff0c;而是很多文件)&＃xff0c;一次读取所有数据到内存中&＃xff0c;然后提供一个句子生成器&＃xff0c;再调用上面定义的_get_batch()函数来每次产生一个batch的数据集。具体实现代码如下:

def get_sentence(self):"""构造一个生成器吗?"""while True:if self._i &＃61;&＃61; self._nids:self._ids &＃61; self._load_random_shard() #重新加载文件读取ret &＃61; self._ids[self._i] #一次仅仅训练一条句子?self._i &＃43;&＃61; 1yield retdef iter_batches(self, batch_size, num_steps):"""一个生成数据的迭代器"""for X in _get_batch(self.get_sentence(), batch_size, num_steps,self.max_word_length):# token_ids &＃61; (batch_size, num_steps)# char_inputs &＃61; (batch_size, num_steps, 50) of character ids# targets &＃61; word ID of next word (batch_size, num_steps)yield X

上面的语言模型只是普通的语言模型的输入&＃xff0c;为了构建双向的LSTM模型&＃xff0c;我们得将正常的数据反转&＃xff0c;得到反向LSTM的输入。于是有了BidirectionalLMDataset类&＃xff0c;其核心代码如下:

def __init__(self, filepattern, vocab, test&＃61;False, shuffle_on_load&＃61;False):&＃39;&＃39;&＃39;bidirectional version of LMDataset前向的LSTM传播过程数据正常取反向的LSTM传播过程只需要将数据反转就好了&＃39;&＃39;&＃39;self._data_forward &＃61; LMDataset( #正向数据集filepattern, vocab, reverse&＃61;False, test&＃61;test,shuffle_on_load&＃61;shuffle_on_load)self._data_reverse &＃61; LMDataset(filepattern, vocab, reverse&＃61;True, test&＃61;test, #反向数据集shuffle_on_load&＃61;shuffle_on_load)def iter_batches(self, batch_size, num_steps):"""将二者合成一个数据集?"""max_word_length &＃61; self._data_forward.max_word_lengthfor X, Xr in zip(_get_batch(self._data_forward.get_sentence(), batch_size,num_steps, max_word_length),_get_batch(self._data_reverse.get_sentence(), batch_size,num_steps, max_word_length)):for k, v in Xr.items(): #都合并到X中去#形成token_ids_reverse, token_characters_reverse等X[k &＃43; &＃39;_reverse&＃39;] &＃61; v yield X

推荐阅读

java
在类中定义数组时出错 - Error on defining arrays in class

Iamtryingtomakeaclassthatwillreadatextfileofnamesintoanarray,thenreturnthatarra ... [详细]

蜡笔小新 2023-12-14 17:38:12
text
Spring源码解密之默认标签的解析方式分析

本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断，区分默认命名空间和自定义命名空间，并采用不同的解析方式。其中，bean标签的解析最为复杂和重要。 ... [详细]

蜡笔小新 2023-12-14 17:24:50
text
向QTextEdit拖放文件的方法及实现步骤

本文介绍了在使用QTextEdit时如何实现拖放文件的功能，包括相关的方法和实现步骤。通过重写dragEnterEvent和dropEvent函数，并结合QMimeData和QUrl等类，可以轻松实现向QTextEdit拖放文件的功能。详细的代码实现和说明可以参考本文提供的示例代码。 ... [详细]

蜡笔小新 2023-12-14 16:06:38
ip
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
ip
android listview OnItemClickListener失效原因

最近在做listview时发现OnItemClickListener失效的问题，经过查找发现是因为button的原因。不仅listitem中存在button会影响OnItemClickListener事件的失效，还会导致单击后listview每个item的背景改变，使得item中的所有有关焦点的事件都失效。本文给出了一个范例来说明这种情况，并提供了解决方法。 ... [详细]

蜡笔小新 2023-12-14 14:25:50
java
关于cuowu类的错误提示和使用AdjustmentListener的问题

本文讨论了一个关于cuowu类的问题，作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案，并给出了两个可能导致错误的原因。 ... [详细]

蜡笔小新 2023-12-13 22:09:56
text
南邮ctf-web的writeup

本文介绍了南邮ctf-web的writeup，包括签到题和md5 collision。在CTF比赛和渗透测试中，可以通过查看源代码、代码注释、页面隐藏元素、超链接和HTTP响应头部来寻找flag或提示信息。利用PHP弱类型，可以发现md5('QNKCDZO')='0e830400451993494058024219903391'和md5('240610708')='0e462097431906509019562988736854'。 ... [详细]

蜡笔小新 2023-12-13 10:58:55
filter
关于Linq to sql 实现模糊查询 string数组

前景：当UI一个查询条件为多项选择，或录入多个条件的时候，比如查询所有名称里面包含以下动态条件，需要模糊查询里面每一项时比如是这样一个数组条件：newstring[]{兴业银行, ... [详细]

蜡笔小新 2023-12-13 09:34:59
ip
MySQL显示SQL语句执行时间的实例详解

本文详细介绍了如何使用MySQL来显示SQL语句的执行时间，并通过MySQL Query Profiler获取CPU和内存使用量以及系统锁和表锁的时间。同时介绍了效能分析的三种方法：瓶颈分析、工作负载分析和基于比率的分析。 ... [详细]

蜡笔小新 2023-12-12 16:16:42
search
Python自动提取文本中的时间（包含中文日期）及特殊时间识别方法

本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期，包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时，还介绍了一段使用正则表达式的代码，可以支持中文日期和一些特殊的时间识别，例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]

蜡笔小新 2023-12-12 12:09:33
controller
iOS实现UITextField+Limit的字符限制方法

本文介绍了在iOS开发中使用UITextField实现字符限制的方法，包括利用代理方法和使用BNTextField-Limit库的实现策略。通过这些方法，开发者可以方便地限制UITextField的字符个数和输入规则。 ... [详细]

蜡笔小新 2023-12-12 09:50:30
text
模板引擎StringTemplate的使用方法和特点

本文介绍了模板引擎StringTemplate的使用方法和特点，包括强制Model和View的分离、Lazy-Evaluation、Recursive enable等。同时，还介绍了StringTemplate语法中的属性和普通字符的使用方法，并提供了向模板填充属性的示例代码。 ... [详细]

蜡笔小新 2023-12-11 21:45:03
java
Swing组件及其用法，图标接口的定义和创建方法

本文介绍了Swing组件的用法，重点讲解了图标接口的定义和创建方法。图标接口用来将图标与各种组件相关联，可以是简单的绘画或使用磁盘上的GIF格式图像。文章详细介绍了图标接口的属性和绘制方法，并给出了一个菱形图标的实现示例。该示例可以配置图标的尺寸、颜色和填充状态。 ... [详细]

蜡笔小新 2023-12-11 21:03:59
controller
IOS开发之短信发送与拨打电话的方法详解

本文详细介绍了在IOS开发中实现短信发送和拨打电话的两种方式，一种是使用系统底层发送，虽然无法自定义短信内容和返回原应用，但是简单方便；另一种是使用第三方框架发送，需要导入MessageUI头文件，并遵守MFMessageComposeViewControllerDelegate协议，可以实现自定义短信内容和返回原应用的功能。 ... [详细]

蜡笔小新 2023-12-11 20:15:47
controller
VueCLI多页分目录打包的步骤记录

本文介绍了使用VueCLI进行多页分目录打包的步骤，包括页面目录结构、安装依赖、获取Vue CLI需要的多页对象等内容。同时还提供了自定义不同模块页面标题的方法。 ... [详细]

蜡笔小新 2023-12-11 16:14:11

歪友46300606

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章