HierarchicalDeepReinforcementLearning:IntegratingTemporalAbstractionandIntrinsicMotivation

作者：过客松鼠_230 | 来源：互联网 | 2023-07-25 12:37

ResearchTopicLearninggoal-directedbehaviorinenvironmentswithsparsefeedbackisamajorchallen

Research Topic

Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms.
这里有两个名词需要注意&＃xff1a;goal-directed behavior, sparse feedback

这篇文章提出了一种hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning.

The model takes decisions over two levels of hierarchy:

the top level module (meta-controller)
takes in the state and picks a new goal
the lower-level module (controller)
uses both the state and the chosen goal to select actions either until the goal is reached or the episode is terminated.

In their work, they propose a scheme for temporal abstraction that involves simultaneously learning options and a control policy to compose options in a deep reinforcement learning setting.

这里有必要对intrinsic motivation和extrinsic motivation进行解释一下&＃xff0c;这其实都是心理学名词&＃xff1a;

intrinsic motivation
使用内部评价体系的人&＃xff0c;对别人的评价不大在乎&＃xff0c;他们做事情的动力&＃xff0c;是来自于自己内心, 内在动机提供了一个促进学习和发展的自然力量&＃xff0c;它在没有外在奖赏和压力的情况下&＃xff0c;可以激发行为。
extrinsic motivation
使用外部评价体系的人&＃xff0c;对别人的评价特别在乎&＃xff0c;甚至会内化别人对自己的评价&＃xff0c;认为自己就是这样的。这样的人他们在做事情时&＃xff0c;首先考虑的&＃xff0c;也是别人怎么看、怎么认为。他们做事情的动力&＃xff0c;常是为了博取别人的认可、金钱等

现在的强化学习对agent的研究基本都集中在外部动机上&＃xff0c;一般认为外部强化是激发外部动机的必要条件&＃xff0c;在强化条件下个体会产生对下一步强化的期待&＃xff0c;从而以获得外部强化作为个体行为的目标。

Model

agents

现在的exploration method&＃xff08;e.g. $ϵ−greedy\epsilon-greedy$ &＃xff09;只对local exploration有用&＃xff0c; 但是fail to provide provide impetus for the agent to explore different areas of the state space.
因此&＃xff0c;为了解决这个问题&＃xff0c;引入了一个重要的概念——goals
Goals provide intrinsic motivation for the agent. The agent focuses on setting and achieving sequences of goals in order to maximize cumulative extrinsic reward.

use temporal abstraction of options to define policy $πg\pi_{g}$ for each goal $g$

其实本文的目标就是有两个&＃xff1a;

learning option policy
learning the optimal sequence of goals to follow

Temporal Abstraction

as below:
在这里插入图片描述
critic的作用&＃xff1a;
The internal critic is responsible for evaluating whether a goal has been reached and providing an appropriate reward $r_{t}(g)$ to the controller.
The intrinsic reward functions are dynamic and temporally dependent on the sequential history of goals.

Deep Reinforcement Learning with Temporal Abstraction

这篇文章使用了deep Q-Learning framework to learn policies for both the controller and the meta-controller.

the controller estimates the following Q-value function:
the meta-controller estimates the following Q-value function:

It is important to note that the transitions $s_{t}, g_{t}, f_{t}, s_{t&＃43;N})$ generate by $Q_{2}$ run at a slower time-scale than the transitions $s_{t}, a_{t}, g_{t}, r_{t}, s_{t&＃43;1})$ generate by $Q_{1}$

Learning Algorithm

Parameters of h-DQN are learned using stochastic gradient descent at different time-scales.
在这里插入图片描述

Experiments

ATARI game with delayed rewards

Model Architecture

在这里插入图片描述
The internal critic is defined in the space of $< e n t i t y 1, r e l a t i o n, e n t i t y 2 >$ , where relation is a function over configurations of the entities.

Training Procedure

First Phase
set the exploration parameter $ϵ2\epsilon_{2}$ of the meta-controller to 1 and train the controller on actions. This effectively leads to pre-training the controller so that it can learn to solve a subset of the goals.
Second Phase
jointly train the controller and meta-controller

推荐阅读

request
优化后的标题：深入探讨网关安全：将微服务升级为OAuth2资源服务器的最佳实践

本文深入探讨了如何将微服务升级为OAuth2资源服务器，以订单服务为例，详细介绍了在POM文件中添加 `spring-cloud-starter-oauth2` 依赖，并配置Spring Security以实现对微服务的保护。通过这一过程，不仅增强了系统的安全性，还提高了资源访问的可控性和灵活性。文章还讨论了最佳实践，包括如何配置OAuth2客户端和资源服务器，以及如何处理常见的安全问题和错误。 ... [详细]

蜡笔小新 2024-11-09 16:13:27
io
人脸识别中的损失函数

本文主要是针对人脸识别中的各种loss进行总结。背景对于分类问题，我们常用的lossfunction是softmax，表示为：,当然有softmax肯定也有hardmax:，so ... [详细]

蜡笔小新 2024-10-08 18:21:04
request
CentOS 7 中配置开机自动挂载 NFS 的解决方案

本文详细介绍了在 CentOS 7 系统中配置 fstab 文件以实现开机自动挂载 NFS 共享目录的方法，并解决了常见的配置失败问题。 ... [详细]

蜡笔小新 2024-11-13 12:05:24
tree
com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例

com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例 ... [详细]

蜡笔小新 2024-11-13 10:47:33
request
在范围[0..n-1]中产生m个不同的随机数 - Generating m distinct random numbers in the range [0..n-1]

Ihavetwomethodsofgeneratingmdistinctrandomnumbersintherange[0..n-1]我有两种方法在范围[0.n-1]中生 ... [详细]

蜡笔小新 2024-11-13 09:49:14
main
Java 并发编程：深入解析 AtomicInteger 和 CAS 无锁算法

在多线程并发环境中，普通变量的操作往往是线程不安全的。本文通过一个简单的例子，展示了如何使用 AtomicInteger 类及其核心的 CAS 无锁算法来保证线程安全。 ... [详细]

蜡笔小新 2024-11-12 16:40:04
io
掌握MySQL数据库的基础语法与核心操作

本文详细介绍了MySQL数据库的基础语法与核心操作，涵盖从基础概念到具体应用的多个方面。首先，文章从基础知识入手，逐步深入到创建和修改数据表的操作。接着，详细讲解了如何进行数据的插入、更新与删除。在查询部分，不仅介绍了DISTINCT和LIMIT的使用方法，还探讨了排序、过滤和通配符的应用。此外，文章还涵盖了计算字段以及多种函数的使用，包括文本处理、日期和时间处理及数值处理等。通过这些内容，读者可以全面掌握MySQL数据库的核心操作技巧。 ... [详细]

蜡笔小新 2024-11-11 23:39:51
uri
使用 Matplotlib 保存 Python 动态图像为视频文件的方法与技巧

本文介绍了如何利用 `matplotlib` 库中的 `FuncAnimation` 类将 Python 中的动态图像保存为视频文件。通过详细解释 `FuncAnimation` 类的参数和方法，文章提供了多种实用技巧，帮助用户高效地生成高质量的动态图像视频。此外，还探讨了不同视频编码器的选择及其对输出文件质量的影响，为读者提供了全面的技术指导。 ... [详细]

蜡笔小新 2024-11-11 22:11:30
io
MySQL Decimal 类型的最大值解析及其在数据处理中的应用艺术

在关系型数据库中，表的设计与SQL语句的编写对性能的影响至关重要，甚至可占到90%以上。本文将重点探讨MySQL中Decimal类型的最大值及其在数据处理中的应用技巧，通过实例分析和优化建议，帮助读者深入理解并掌握这一重要知识点。 ... [详细]

蜡笔小新 2024-11-11 19:36:19
object
Python中判断一个集合是否为另一集合子集的两种高效方法及其应用场景分析

Python中判断一个集合是否为另一集合子集的两种高效方法及其应用场景分析 ... [详细]

蜡笔小新 2024-11-11 19:27:53
main
利用Python Paramiko库批量更新多台服务器的登录密码

本文介绍了如何使用Python的Paramiko库批量更新多台服务器的登录密码。通过示例代码展示了具体实现方法，确保了操作的高效性和安全性。Paramiko库提供了强大的SSH2协议支持，使得远程服务器管理变得更加便捷。此外，文章还详细说明了代码的各个部分，帮助读者更好地理解和应用这一技术。 ... [详细]

蜡笔小新 2024-11-11 19:17:23
cmd
Cacti 数据库错误：SQL 查询失败，错误代码 145

在使用 Cacti 进行监控时，发现已运行的转码机未产生流量，导致 Cacti 监控界面显示该转码机处于宕机状态。进一步检查 Cacti 日志，发现数据库中存在 SQL 查询失败的问题，错误代码为 145。此问题可能是由于数据库表损坏或索引失效所致，建议对相关表进行修复操作以恢复监控功能。 ... [详细]

蜡笔小新 2024-11-11 12:57:49
stream
Spring框架中枚举参数的正确使用方法与技巧

本文详细阐述了在Spring Boot框架中正确使用枚举参数的方法与技巧，旨在帮助开发者更高效地掌握和应用枚举类型的数据传递，适合对Spring Boot感兴趣的读者深入学习。 ... [详细]

蜡笔小新 2024-11-09 20:34:17
io
深入解析 SQL 数据库查询技术

本文深入探讨了SQL数据库查询技术，重点讲解了单表查询的各种方法。首先，介绍了如何从表中选择特定的列，包括查询指定列、查询所有列以及计算值的查询。此外，还详细解释了如何使用列别名来修改查询结果的列标题，并介绍了更名运算的应用场景和实现方式。通过这些内容，读者可以更好地理解和掌握SQL查询的基本技巧和高级用法。 ... [详细]

蜡笔小新 2024-11-09 18:21:57
main
将数组划分为两个子集，在它们的最大值和最小值之间进行最小位异或

将数组划分为两个子集，在它们的最大值和最小值之间进行最小位异或原 ... [详细]

蜡笔小新 2023-10-15 14:40:04

过客松鼠_230

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章