热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

用CNTK搞深度学习(二)训练基于RNN的自然语言模型(languagemodel)

前一篇文章用CNTK搞深度学习(一)入门介绍了用CNTK构建简单前向神经网络的例子。现在假设读者已经懂得了使用CNTK的基本方法。现在我们做一个稍微复杂一点,也是自然语言挖掘中很火

前一篇文章  用 CNTK 搞深度学习 (一) 入门    介绍了用CNTK构建简单前向神经网络的例子。现在假设读者已经懂得了使用CNTK的基本方法。现在我们做一个稍微复杂一点,也是自然语言挖掘中很火的一个模型: 用递归神经网络构建一个语言模型。

递归神经网络 (RNN),用图形化的表示则是隐层连接到自己的神经网络(当然只是RNN中的一种):

技术分享

不同于普通的神经网络,RNN假设样例之间并不是独立的。例如要预测“上”这个字的下一个字是什么,那么在“上”之前出现过的字就很重要,如果之前出现过“工作”,那么很可能是在说“上班”; 如果之前出前过“家乡”,那么很可能就是“上海”。 RNN就可以很好的学习出时序的特征。简单的说,RNN把前一时刻的隐层的值也作为一类feature,作为下一时刻输入的一部分。

我们这里构建这样一种language model:给定一个单词,预测下一个可能出现的单词。 

这个RNN的输入是dim维的,dim等于词汇量的大小。输入向量只有在代表这个单词的分量上是1,其余为0,即[0,0,0,...0,1,0,...0]。 输出也是dim维的向量,表示每个单词出现的概率。

CNTK上构建RNN模型,主要有两点与普通的神经网络很不一样:

(1)输入格式。 此时输入的是按句子分开的文本,同一个句子内部的单词是有顺序的。所以输入要指定成 LMSequenceReader 的格式。 这个格式很麻烦(再吐槽一下,我也不是很懂,就不详细解释了,大家可以按照格式自行领悟)

 (2) 模型:要使用递归模型。 主要是Delay() 函数的使用

一个可用的代码如下(再次被官方教程坑了好久,现代码改编自 CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\Config ):

# Parameters can be overwritten on the command line
# for example: cntk cOnfigFile=myConfigFile RootDir=../.. 
# For running from Visual Studio add
# currentDirectory=$(SolutionDir)/ 
RootDir = ".."

ConfigDir = "$RootDir$/Config"
DataDir = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir = "$OutputDir$/Models"

# deviceId=-1 for CPU, >=0 for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available
deviceId = "-1"

command = writeWordAndClassInfo:train
#command = write

precision = "float"
traceLevel = 1
modelPath = "$ModelDir$/rnn.dnn"

# uncomment the following line to write logs to a file
stderr=$OutputDir$/rnnOutput

type = double
numCPUThreads = 4

confVocabSize = 3000
confClassSize = 50

#trainFile = "ptb.train.txt"
trainFile = "review_tokens_split_first5w_lines.txt"
#validFile = "ptb.valid.txt"
testFile = "review_tokens_split_first10_lines.txt"

writeWordAndClassInfo = [
    action = "writeWordAndClass"
    inputFile = "$DataDir$/$trainFile$"
    outputVocabFile = "$ModelDir$/vocab.txt"
    outputWord2Cls = "$ModelDir$/word2cls.txt"
    outputCls2Index = "$ModelDir$/cls2idx.txt"
    vocabSize = "$confVocabSize$"
    nbrClass = "$confClassSize$"
    cutoff = 1
    printValues = true
]

#######################################
#  TRAINING CONFIG                    #
#######################################

train = [
    action = "train"
    minibatchSize = 10
    traceLevel = 1
    epochSize = 0
    recurrentLayer = 1
    defaultHiddenActivity = 0.1
    useValidation = true
    rnnType = "CLASSLM"

     # uncomment below and comment SimpleNetworkBuilder section to use NDL to train RNN LM
     NDLNetworkBuilder=[
        networkDescription="D:\tools\Deep Learning\CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\AdditionalFiles\RNNLM\rnnlm.ndl"
     ]
  

    SGD = [
        learningRatesPerSample = 0.1
        momentumPerMB = 0
        gradientClippingWithTruncation = true
        clippingThresholdPerSample = 15.0
        maxEpochs = 6
        unroll = false
        numMBsToShowResult = 100
        gradUpdateType = "none"
        loadBestModel = true

        # settings for Auto Adjust Learning Rate
        AutoAdjust = [
            autoAdjustLR = "adjustAfterEpoch"
            reduceLearnRateIfImproveLessThan = 0.001
            continueReduce = false
            increaseLearnRateIfImproveMoreThan = 1000000000
            learnRateDecreaseFactor = 0.5
            learnRateIncreaseFactor = 1.382
            numMiniBatch4LRSearch = 100
            numPrevLearnRates = 5
            numBestSearchEpoch = 1
        ]

        dropoutRate = 0.0
    ]

    reader = [
        readerType = "LMSequenceReader"
        randomize = "none"
        nbruttsineachrecurrentiter = 16

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType=BinaryReader

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"
        
        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000
        
        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$trainFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]
      
        # sequence break table, list indexes into sequence records, so we know when a sequence starts/stops
        sequence = [
            dim = 1
            wrecords = 2
            # write definition
            sectionType = "data"
        ]
        
        #labels sections
        labelIn = [
            dim = 1
            labelType = "Category"
            beginSequence = ""
            endSequence = ""

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"
            
            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            mapping = [
                # redefine number of records for this section, since we dont need to save it for each data record
                wrecords = 11                
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
        
        # labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
            
            # Write definition 
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            mapping = [
                # redefine number of records for this section, since we dont need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = categoryLabels
            ]
        ]
    ] 
]



write = [
    action = "write"

    outputPath = "$OutputDir$/Write"
    #outputPath = "-"                    # "-" will write to stdout; useful for debugging
    outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word" # when processing one sentence per minibatch, this is the sentence posterior
    #format = [
        #sequencePrologue = "log P(W)="    # (using this to demonstrate some formatting strings)
        #type = "real"
    #]

    minibatchSize = 1              # choose this to be big enough for the longest sentence
    # need to be small since models are updated for each minibatch
    traceLevel = 1
    epochSize = 0

    reader = [
        # reader to use
        readerType = "LMSequenceReader"
        randomize = "none"              # BUGBUG: This is ignored.
        nbruttsineachrecurrentiter = 1  # one sentence per minibatch
        cacheBlockSize = 1              # workaround to disable randomization

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType = "BinaryReader"

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"
        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000
        
        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$testFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]
        
        #labels sections
        labelIn = [
            dim = 1

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"
            
            labelType = "Category"
            beginSequence = ""
            endSequence = ""

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            
            mapping = [
                # redefine number of records for this section, since we dont need to save it for each data record
                wrecords = 11
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
        
        #labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            
            mapping = [
                # redefine number of records for this section, since we dont need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]
            
            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
    ]
]    

rnnlm.ndl:

run=ndlCreateNetwork

ndlCreateNetwork=[
    # vocabulary size
    featDim=3000
    # vocabulary size
    labelDim=3000
    # hidden layer size
    hiddenDim=200
    # number of classes
    nbrClass=50
    
    initScale=6
    
    features=SparseInput(featDim, tag="feature")
    
    # labels in classbasedCrossEntropy is dense and contain 4 values for each sample
    labels=Input(4, tag="label")

    # define network
    WFeat2Hid=Parameter(hiddenDim, featDim, init="uniform", initValueScale=initScale)
    WHid2Hid=Parameter(hiddenDim, hiddenDim, init="uniform", initValueScale=initScale)

    # WHid2Word is special that it is hiddenSize X labelSize
    WHid2Word=Parameter( hiddenDim,labelDim,  init="uniform", initValueScale=initScale)
     WHid2Class=Parameter(nbrClass, hiddenDim, init="uniform", initValueScale=initScale)
   
    PastHid = Delay(hiddenDim, HidAfterSig, delayTime=1, needGradient=true)    
    HidFromHeat = Times(WFeat2Hid, features)
    HidFromRecur = Times(WHid2Hid, PastHid)
    HidBeforeSig = Plus(HidFromHeat, HidFromRecur)
    HidAfterSig = Sigmoid(HidBeforeSig)
    
    Out = TransposeTimes(WHid2Word, HidAfterSig)  #word part
    
    ClassProbBeforeSoftmax=Times(WHid2Class, HidAfterSig)
    
    cr = ClassBasedCrossEntropyWithSoftmax(labels, HidAfterSig, WHid2Word, ClassProbBeforeSoftmax, tag="criterion")
    EvalNodes=(Cr)
    OutputNodes=(Cr)
]

从代码上看,CNTK会让人花很大一部分精力在Data Reader上。

writeWordAndClassInfo 是简单的对所有词汇做个统计,并对单词聚类。 这里用的class based RNN,主要是为了加速计算,先把单词分成不相交的几类。 这个模块输出的文件有4列,分别是单词索引,出现频率,单词,类别。
Train 当然就是训练模型了,文本量大的话,训练还是很慢的。
Write 是输出模块,注意看这一行:
outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word"

我想最多人关心的应该是对于一个句子,运行这个训练好的RNN之后,如何得到隐层的值吧? 我的做法是把训练好的RNN的参数给保存下来,然后...然后无论是用java还是用python的人,都能根据这个参数还原一个RNN网络,然后我们想干嘛就能干嘛了。

 Train中我是用了自己定义的模型:NDLNetworkBuilder 。 也可以用通用的递归模型,此时只要简单地规定一个参数就行了,例如

SimpleNetworkBuilder=[
        trainingCriterion=classcrossentropywithsoftmax
        evalCriterion=classcrossentropywithsoftmax
        nodeType=Sigmoid
        initValueScale=6.0
        layerSizes=10000:200:10000
        addPrior=false
        addDropoutNodes=false
        applyMeanVarNorm=false
        uniformInit=true;

        # these are for the class information for class-based language modeling
        vocabSize=10000
        nbrClass=50
    ]

我这里使用自己定义的网络,主要是为了日后想改成LSTM结构。

原创博客,未经允许,请勿转载。

用CNTK搞深度学习 (二) 训练基于RNN的自然语言模型 ( language model )


推荐阅读
  • 本文详细介绍了Linux中进程控制块PCBtask_struct结构体的结构和作用,包括进程状态、进程号、待处理信号、进程地址空间、调度标志、锁深度、基本时间片、调度策略以及内存管理信息等方面的内容。阅读本文可以更加深入地了解Linux进程管理的原理和机制。 ... [详细]
  • CentOS 7部署KVM虚拟化环境之一架构介绍
    本文介绍了CentOS 7部署KVM虚拟化环境的架构,详细解释了虚拟化技术的概念和原理,包括全虚拟化和半虚拟化。同时介绍了虚拟机的概念和虚拟化软件的作用。 ... [详细]
  • 基于layUI的图片上传前预览功能的2种实现方式
    本文介绍了基于layUI的图片上传前预览功能的两种实现方式:一种是使用blob+FileReader,另一种是使用layUI自带的参数。通过选择文件后点击文件名,在页面中间弹窗内预览图片。其中,layUI自带的参数实现了图片预览功能。该功能依赖于layUI的上传模块,并使用了blob和FileReader来读取本地文件并获取图像的base64编码。点击文件名时会执行See()函数。摘要长度为169字。 ... [详细]
  • 本文介绍了使用Java实现大数乘法的分治算法,包括输入数据的处理、普通大数乘法的结果和Karatsuba大数乘法的结果。通过改变long类型可以适应不同范围的大数乘法计算。 ... [详细]
  • 1,关于死锁的理解死锁,我们可以简单的理解为是两个线程同时使用同一资源,两个线程又得不到相应的资源而造成永无相互等待的情况。 2,模拟死锁背景介绍:我们创建一个朋友 ... [详细]
  • 后台获取视图对应的字符串
    1.帮助类后台获取视图对应的字符串publicclassViewHelper{将View输出为字符串(注:不会执行对应的ac ... [详细]
  • 《数据结构》学习笔记3——串匹配算法性能评估
    本文主要讨论串匹配算法的性能评估,包括模式匹配、字符种类数量、算法复杂度等内容。通过借助C++中的头文件和库,可以实现对串的匹配操作。其中蛮力算法的复杂度为O(m*n),通过随机取出长度为m的子串作为模式P,在文本T中进行匹配,统计平均复杂度。对于成功和失败的匹配分别进行测试,分析其平均复杂度。详情请参考相关学习资源。 ... [详细]
  • 本文介绍了通过ABAP开发往外网发邮件的需求,并提供了配置和代码整理的资料。其中包括了配置SAP邮件服务器的步骤和ABAP写发送邮件代码的过程。通过RZ10配置参数和icm/server_port_1的设定,可以实现向Sap User和外部邮件发送邮件的功能。希望对需要的开发人员有帮助。摘要长度:184字。 ... [详细]
  • 动态规划算法的基本步骤及最长递增子序列问题详解
    本文详细介绍了动态规划算法的基本步骤,包括划分阶段、选择状态、决策和状态转移方程,并以最长递增子序列问题为例进行了详细解析。动态规划算法的有效性依赖于问题本身所具有的最优子结构性质和子问题重叠性质。通过将子问题的解保存在一个表中,在以后尽可能多地利用这些子问题的解,从而提高算法的效率。 ... [详细]
  • Java验证码——kaptcha的使用配置及样式
    本文介绍了如何使用kaptcha库来实现Java验证码的配置和样式设置,包括pom.xml的依赖配置和web.xml中servlet的配置。 ... [详细]
  • 高质量SQL书写的30条建议
    本文提供了30条关于优化SQL的建议,包括避免使用select *,使用具体字段,以及使用limit 1等。这些建议是基于实际开发经验总结出来的,旨在帮助读者优化SQL查询。 ... [详细]
  • 本文介绍了指针的概念以及在函数调用时使用指针作为参数的情况。指针存放的是变量的地址,通过指针可以修改指针所指的变量的值。然而,如果想要修改指针的指向,就需要使用指针的引用。文章还通过一个简单的示例代码解释了指针的引用的使用方法,并思考了在修改指针的指向后,取指针的输出结果。 ... [详细]
  • 在project.properties添加#Projecttarget.targetandroid-19android.library.reference.1..Sliding ... [详细]
  • 猜字母游戏
    猜字母游戏猜字母游戏——设计数据结构猜字母游戏——设计程序结构猜字母游戏——实现字母生成方法猜字母游戏——实现字母检测方法猜字母游戏——实现主方法1猜字母游戏——设计数据结构1.1 ... [详细]
  • 本文介绍了一种解析GRE报文长度的方法,通过分析GRE报文头中的标志位来计算报文长度。具体实现步骤包括获取GRE报文头指针、提取标志位、计算报文长度等。该方法可以帮助用户准确地获取GRE报文的长度信息。 ... [详细]
author-avatar
小洲相册居士
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有