热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

深度学习中Embedding层有什么用?

这篇博客翻译自国外的深度学习系列文章的第四篇,想查看其他文章请点击下面的链接,人工翻译也是劳动,如果你觉得有用请打赏,转载请打赏:SettingupAWS&Im

word embeddings
这篇博客翻译自国外的深度学习系列文章的第四篇,想查看其他文章请点击下面的链接,人工翻译也是劳动,如果你觉得有用请打赏,转载请打赏:

  1. Setting up AWS & Image Recognition
  2. Convolutional Neural Networks
  3. More on CNNs & Handling Overfitting

在深度学习实验中经常会遇Eembedding层,然而网络上的介绍可谓是相当含糊。比如 Keras中文文档中对嵌入层 Embedding的介绍除了一句 “嵌入层将正整数(下标)转换为具有固定大小的向量”之外就不愿做过多的解释。那么我们为什么要使用嵌入层 Embedding呢? 主要有这两大原因:

  1. 使用One-hot 方法编码的向量会很高维也很稀疏。假设我们在做自然语言处理(NLP)中遇到了一个包含2000个词的字典,当时用One-hot编码时,每一个词会被一个包含2000个整数的向量来表示,其中1999个数字是0,要是我的字典再大一点的话这种方法的计算效率岂不是大打折扣?

  2. 训练神经网络的过程中,每个嵌入的向量都会得到更新。如果你看到了博客上面的图片你就会发现在多维空间中词与词之间有多少相似性,这使我们能可视化的了解词语之间的关系,不仅仅是词语,任何能通过嵌入层 Embedding 转换成向量的内容都可以这样做。

上面说的概念可能还有点含糊. 那我们就举个栗子看看嵌入层 Embedding 对下面的句子做了什么:)。Embedding的概念来自于word embeddings,如果您有兴趣阅读更多内容,可以查询 word2vec 。

“deep learning is very deep”

使用嵌入层embedding 的第一步是通过索引对该句子进行编码,这里我们给每一个不同的句子分配一个索引,上面的句子就会变成这样:

1 2 3 4 1

接下来会创建嵌入矩阵,我们要决定每一个索引需要分配多少个‘潜在因子’,这大体上意味着我们想要多长的向量,通常使用的情况是长度分配为32和50。在这篇博客中,为了保持文章可读性这里为每个索引指定6个潜在因子。嵌入矩阵就会变成这样:

Embedding Matrix

这样,我们就可以使用嵌入矩阵来而不是庞大的one-hot编码向量来保持每个向量更小。简而言之,嵌入层embedding在这里做的就是把单词“deep”用向量[.32, .02, .48, .21, .56, .15]来表达。然而并不是每一个单词都会被一个向量来代替,而是被替换为用于查找嵌入矩阵中向量的索引。其次这种方法面对大数据时也可有效计算。由于在深度神经网络的训练过程中嵌入向量也会被更新,我们就可以探索在高维空间中哪些词语之间具有彼此相似性,再通过使用t-SNE 这样的降维技术就可以将这些相似性可视化。

t-SNE visualization of word embeddings


Not Just Word Embeddings

These previous examples showed that word embeddings are very important in the world of Natural Language Processing. They allow us to capture relationships in language that are very difficult to capture otherwise. However, embedding layers can be used to embed many more things than just words. In my current research project I am using embedding layers to embed online user behavior. In this case I am assigning indices to user behavior like ‘page view on page type X on portal Y’ or ‘scrolled X pixels’. These indices are then used for constructing a sequence of user behavior.

In a comparison of ‘traditional’ machine learning models (SVM, Random Forest, Gradient Boosted Trees) with deep learning models (deep neural networks, recurrent neural networks) I found that this embedding approach worked very well for deep neural networks.

The ‘traditional’ machine learning models rely on a tabular input that is feature engineered. This means that we, as researchers, decide what gets turned into a feature. In these cases features could be: amount of homepages visited, amount of searches done, total amount of pixels scrolled. However, it is very difficult to capture the spatial (time) dimension when doing feature-engineering. By using deep learning and embedding layers we can efficiently capture this spatial dimension by supplying a sequence of user behavior (as indices) as input for the model.

In my research the Recurrent Neural Network with Gated Recurrent Unit/Long-Short Term Memory performed best. The results were very close. From the ‘traditional’ feature engineered models Gradient Boosted Trees performed best. I will write a blog post about this research in more detail in the future. I think my next blog post will explore Recurrent Neural Networks in more detail.

Other research has explored the use of embedding layers to encode student behavior in MOOCs (Piech et al., 2016) and users’ path through an online fashion store (Tamhane et al., 2017).


Recommender Systems

Embedding layers can even be used to deal with the sparse matrix problem in recommender systems. Since the deep learning course (fast.ai) uses recommender systems to introduce embedding layers I want to explore them here as well.

Recommender systems are being used everywhere and you are probably being influenced by them every day. The most common examples are Amazon’s product recommendation and Netflix’s program recommendation systems. Netflix actually held a $1,000,000 challenge to find the best collaborative filtering algorithm for their recommender system. You can see a visualization of one of these models here.

There are two main types of recommender systems and it is important to distinguish between the two.

  1. Content-based filtering. This type of filtering is based on data about the item/product. For example, we have our users fill out a survey on what movies they like. If they say that they like sci-fi movies we recommend them sci-fi movies. In this case al lot of meta-information has to be available for all items.
  2. Collaborative filtering: Let’s find other people like you, see what they liked and assume you like the same things. People like you = people who rated movies that you watched in a similar way. In a large dataset this has proven to work a lot better than the meta-data approach. Essentially asking people about their behavior is less good compared to looking at their actual behavior. Discussing this further is something for the psychologists among us.

In order to solve this problem we can create a huge matrix of the ratings of all users against all movies. However, in many cases this will create an extremely sparse matrix. Just think of your Netflix account. What percentage of their total supply of series and movies have you watched? It’s probably a pretty small percentage. Then, through gradient descent we can train a neural network to predict how high each user would rate each movie. Let me know if you would like to know more about the use of deep learning in recommender systems and we can explore it further together. In conclusion, embedding layers are amazing and should not be overlooked.

If you liked this posts be sure to recommend it so others can see it. You can also follow this profile to keep up with my process in the Fast AI course. See you there!

References

Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. In Advances in Neural Information Processing Systems (pp. 505–513).

Tamhane, A., Arora, S., & Warrier, D. (2017, May). Modeling Contextual Changes in User Behaviour in Fashion e-Commerce. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 539–550). Springer, Cham.

Embedding layers in keras

嵌入层embedding用在网络的开始层将你的输入转换成向量,所以当使用 Embedding前应首先判断你的数据是否有必要转换成向量。如果你有categorical数据或者数据仅仅包含整数(像一个字典一样具有固定的数量)你可以尝试下Embedding 层。
如果你的数据是多维的你可以对每个输入共享嵌入层或尝试单独的嵌入层。

from keras.layers.embeddings import Embedding

Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_cOnstraint=None, mask_zero=False, input_length=None)

-The first value of the Embedding constructor is the range of values in the input. In the example it’s 2 because we give a binary vector as input.
- The second value is the target dimension.
- The third is the length of the vectors we give.
- input_dim: int >= 0. Size of the vocabulary, ie. 1+maximum integer
index occuring in the input data.

本文译自:https://medium.com/towards-data-science/deep-learning-4-embedding-layers-f9a02d55ac12

How does embedding work? An example demonstrates best what is going on.

Assume you have a sparse vector [0,1,0,1,1,0,0] of dimension seven. You can turn it into a non-sparse 2d vector like so:

model = Sequential()
model.add(Embedding(2, 2, input_length=7))
model.compile('rmsprop', 'mse')
model.predict(np.array([[0,1,0,1,1,0,0]]))
array([[[ 0.03005414, -0.02224021], [ 0.03396987, -0.00576888], [ 0.03005414, -0.02224021], [ 0.03396987, -0.00576888], [ 0.03396987, -0.00576888], [ 0.03005414, -0.02224021], [ 0.03005414, -0.02224021]]], dtype=float32)

Where do these numbers come from? It’s a simple map from the given range to a 2d space:

model.layers[0].W.get_value()
array([[ 0.03005414, -0.02224021], [ 0.03396987, -0.00576888]], dtype=float32)

The 0-value is mapped to the first index and the 1-value to the second as can be seen by comparing the two arrays. The first value of the Embedding constructor is the range of values in the input. In the example it’s 2 because we give a binary vector as input. The second value is the target dimension. The third is the length of the vectors we give.
So, there is nothing magical in this, merely a mapping from integers to floats.

Now back to our ‘shining’ detection. The training data looks like a sequences of bits:

X
array([[ 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], [ 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], [ 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [ 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.]])

If you want to use the embedding it means that the output of the embedding layer will have dimension (5, 19, 10). This works well with LSTM or GRU (see below) but if you want a binary classifier you need to flatten this to (5, 19*10):

model = Sequential()
model.add(Embedding(3, 10, input_length= X.shape[1] ))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y, batch_size=200, nb_epoch=700, verbose=0, validation_split=0.2, show_accuracy=True, shuffle=True)

It detects ‘shining’ flawlessly:

model.predict(X)
array([[ 1.00000000e+00], [ 8.39483363e-08], [ 9.71878720e-08], [ 7.35597965e-08], [ 9.91844118e-01]], dtype=float32)

An LSTM layer has historical memory and so the dimension outputted by the embedding works in this case, no need to flatten things:

model = Sequential()

model.add(Embedding(vocab_size, 10))
model.add(LSTM(5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y,  nb_epoch=500, verbose=0, validation_split=0.2, show_accuracy=True, shuffle=True)

Obviously, it predicts things as well:

model.predict(X)
array([[ 0.96855599], [ 0.01917232], [ 0.01917362], [ 0.01917258], [ 0.02341695]], dtype=float32)

本文译自:http://www.orbifold.net/default/2017/01/10/embedding-and-tokenizer-in-keras/


推荐阅读
  • 生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型,可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ... [详细]
  • 本文介绍了九度OnlineJudge中的1002题目“Grading”的解决方法。该题目要求设计一个公平的评分过程,将每个考题分配给3个独立的专家,如果他们的评分不一致,则需要请一位裁判做出最终决定。文章详细描述了评分规则,并给出了解决该问题的程序。 ... [详细]
  • 本文介绍了Python语言程序设计中文件和数据格式化的操作,包括使用np.savetext保存文本文件,对文本文件和二进制文件进行统一的操作步骤,以及使用Numpy模块进行数据可视化编程的指南。同时还提供了一些关于Python的测试题。 ... [详细]
  • IhaveconfiguredanactionforaremotenotificationwhenitarrivestomyiOsapp.Iwanttwodiff ... [详细]
  • 本文主要解析了Open judge C16H问题中涉及到的Magical Balls的快速幂和逆元算法,并给出了问题的解析和解决方法。详细介绍了问题的背景和规则,并给出了相应的算法解析和实现步骤。通过本文的解析,读者可以更好地理解和解决Open judge C16H问题中的Magical Balls部分。 ... [详细]
  • sklearn数据集库中的常用数据集类型介绍
    本文介绍了sklearn数据集库中常用的数据集类型,包括玩具数据集和样本生成器。其中详细介绍了波士顿房价数据集,包含了波士顿506处房屋的13种不同特征以及房屋价格,适用于回归任务。 ... [详细]
  • 【MicroServices】【Arduino】装修甲醛检测,ArduinoDart甲醛、PM2.5、温湿度、光照传感器等,数据记录于SD卡,Python数据显示,UI5前台,微服务后台……
    这篇文章介绍了一个基于Arduino的装修甲醛检测项目,使用了ArduinoDart甲醛、PM2.5、温湿度、光照传感器等硬件,并将数据记录于SD卡,使用Python进行数据显示,使用UI5进行前台设计,使用微服务进行后台开发。该项目还在不断更新中,有兴趣的可以关注作者的博客和GitHub。 ... [详细]
  • 本文讨论了如何使用IF函数从基于有限输入列表的有限输出列表中获取输出,并提出了是否有更快/更有效的执行代码的方法。作者希望了解是否有办法缩短代码,并从自我开发的角度来看是否有更好的方法。提供的代码可以按原样工作,但作者想知道是否有更好的方法来执行这样的任务。 ... [详细]
  • 浏览器中的异常检测算法及其在深度学习中的应用
    本文介绍了在浏览器中进行异常检测的算法,包括统计学方法和机器学习方法,并探讨了异常检测在深度学习中的应用。异常检测在金融领域的信用卡欺诈、企业安全领域的非法入侵、IT运维中的设备维护时间点预测等方面具有广泛的应用。通过使用TensorFlow.js进行异常检测,可以实现对单变量和多变量异常的检测。统计学方法通过估计数据的分布概率来计算数据点的异常概率,而机器学习方法则通过训练数据来建立异常检测模型。 ... [详细]
  • 本文讨论了在手机移动端如何使用HTML5和JavaScript实现视频上传并压缩视频质量,或者降低手机摄像头拍摄质量的问题。作者指出HTML5和JavaScript无法直接压缩视频,只能通过将视频传送到服务器端由后端进行压缩。对于控制相机拍摄质量,只有使用JAVA编写Android客户端才能实现压缩。此外,作者还解释了在交作业时使用zip格式压缩包导致CSS文件和图片音乐丢失的原因,并提供了解决方法。最后,作者还介绍了一个用于处理图片的类,可以实现图片剪裁处理和生成缩略图的功能。 ... [详细]
  • 本文讨论了如何在codeigniter中识别来自angularjs的请求,并提供了两种方法的代码示例。作者尝试了$this->input->is_ajax_request()和自定义函数is_ajax(),但都没有成功。最后,作者展示了一个ajax请求的示例代码。 ... [详细]
  • iOS Swift中如何实现自动登录?
    本文介绍了在iOS Swift中如何实现自动登录的方法,包括使用故事板、SWRevealViewController等技术,以及解决用户注销后重新登录自动跳转到主页的问题。 ... [详细]
  • IOS开发之短信发送与拨打电话的方法详解
    本文详细介绍了在IOS开发中实现短信发送和拨打电话的两种方式,一种是使用系统底层发送,虽然无法自定义短信内容和返回原应用,但是简单方便;另一种是使用第三方框架发送,需要导入MessageUI头文件,并遵守MFMessageComposeViewControllerDelegate协议,可以实现自定义短信内容和返回原应用的功能。 ... [详细]
  • 本文介绍了禅道作为一款国产开源免费的测试管理工具的特点和功能,并提供了禅道的搭建和调试方法。禅道是一款B/S结构的项目管理工具,可以实现组织管理、后台管理、产品管理、项目管理和测试管理等功能。同时,本文还介绍了其他软件测试相关工具,如功能自动化工具和性能自动化工具,以及白盒测试工具的使用。通过本文的阅读,读者可以了解禅道的基本使用方法和优势,从而更好地进行测试管理工作。 ... [详细]
  • Introduction(简介)Forbeingapowerfulobject-orientedprogramminglanguage,Cisuseda ... [详细]
author-avatar
醣荳_448
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有