强化学习中的预期SARSA

作者：手机用户2502859387 | 来源：互联网 | 2023-09-25 13:24

强化学习中的预期SARSA原文:https://www.g

强化学习中的预期 SARSA

原文:https://www . geeksforgeeks . org/expected-sarsa-in-reduction-learning/

先决条件: SARSA
SARSA 和强化学习中的 Q-Learning 技术是使用时间差异(TD)更新来改善代理行为的算法。预期的 SARSA 技术是改进代理策略的替代方案。它与 SARSA 和 Q-Learning 非常相似，不同之处在于它遵循的动作值函数。
我们知道 SARSA 是一种策略上的技术，Q 学习是一种策略外的技术，但是 Expected SARSA 既可以作为策略上的技术，也可以作为策略外的技术。与这两种算法相比，这是 Expected SARSA 更加灵活的地方。
我们来对比一下三种算法的作用值函数，看看在 Expected SARSA 中有什么不同。

讽刺: $Q(S_{t}, A_{t}) = Q(S_{t}, A_{t}) + \alpha (R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_{t}, A_{t}))$

Q-Learning:
$Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \alpha (r_{t+1}+\gamma max_{a}Q(s_{t+1}, a)-Q(s_{t}, a_{t}))$

预计 SARSA:
$Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \alpha (r_{t+1}+\gamma \sum_{a} \pi (a | s_{t+1}) Q(s_{t+1}, a)-Q(s_{t}, a_{t}))$

我们看到，预期 SARSA 根据采取行动的概率，对所有可能的后续行动进行加权求和。如果期望收益相对于期望收益是贪婪的，那么这个方程被转化为 Q 学习。否则，预期的 SARSA 将按策略运行，并计算所有操作的预期回报，而不是像 SARSA 那样随机选择一个操作。
牢记理论和公式，让我们用一个实验来比较这三种算法。我们将实现一个 Cliff Walker 作为我们的环境，该环境由健身房图书馆
提供代码:Python 代码来创建类 Agent，该类 Agent 将被其他代理继承以避免重复代码。

Python 3

# Agent.py import numpy as np class Agent: """ The Base class that is implemented by other classes to avoid the duplicate 'choose_action' method """ def choose_action(self, state): action = 0 if np.random.uniform(0, 1) action = self.action_space.sample() else: action = np.argmax(self.Q[state, :]) return action

代码:创建 SARSA 代理的 Python 代码。

Python 3

# SarsaAgent.py import numpy as np from Agent import Agent class SarsaAgent(Agent): """ The Agent that uses SARSA update to improve it's behaviour """ def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space): """ Constructor Args: epsilon: The degree of exploration gamma: The discount factor num_state: The number of states num_actions: The number of actions action_space: To call the random action """ self.epsilon = epsilon self.alpha = alpha self.gamma = gamma self.num_state = num_state self.num_actiOns= num_actions self.Q = np.zeros((self.num_state, self.num_actions)) self.action_space = action_space def update(self, prev_state, next_state, reward, prev_action, next_action): """ Update the action value function using the SARSA update. Q(S, A) = Q(S, A) + alpha(reward + (gamma * Q(S_, A_) - Q(S, A)) Args: prev_state: The previous state next_state: The next state reward: The reward for taking the respective action prev_action: The previous action next_action: The next action Returns: None """ predict = self.Q[prev_state, prev_action] target = reward + self.gamma * self.Q[next_state, next_action] self.Q[prev_state, prev_action] += self.alpha * (target - predict)

代码:创建 Q-Learning Agent 的 Python 代码。

Python 3

# QLearningAgent.py import numpy as np from Agent import Agent class QLearningAgent(Agent): def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space): """ Constructor Args: epsilon: The degree of exploration gamma: The discount factor num_state: The number of states num_actions: The number of actions action_space: To call the random action """ self.epsilon = epsilon self.alpha = alpha self.gamma = gamma self.num_state = num_state self.num_actiOns= num_actions self.Q = np.zeros((self.num_state, self.num_actions)) self.action_space = action_space def update(self, state, state2, reward, action, action2): """ Update the action value function using the Q-Learning update. Q(S, A) = Q(S, A) + alpha(reward + (gamma * Q(S_, A_) - Q(S, A)) Args: prev_state: The previous state next_state: The next state reward: The reward for taking the respective action prev_action: The previous action next_action: The next action Returns: None """ predict = self.Q[state, action] target = reward + self.gamma * np.max(self.Q[state2, :]) self.Q[state, action] += self.alpha * (target - predict)

代码:创建预期 SARSA 代理的 Python 代码。在这个实验中，我们使用下面的公式来计算策略。
$\pi (a | s_{t+1}) = \begin{cases} \dfrac{\epsilon}{A} &\text{if a = Greedy }\ 1 - \epsilon + \dfrac{\epsilon}{\text{Number of Greedy }} &\text{if a = Non-Greedy }\ \end{cases}$

Python 3

# ExpectedSarsaAgent.py import numpy as np from Agent import Agent class ExpectedSarsaAgent(Agent): def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space): """ Constructor Args: epsilon: The degree of exploration gamma: The discount factor num_state: The number of states num_actions: The number of actions action_space: To call the random action """ self.epsilon = epsilon self.alpha = alpha self.gamma = gamma self.num_state = num_state self.num_actiOns= num_actions self.Q = np.zeros((self.num_state, self.num_actions)) self.action_space = action_space def update(self, prev_state, next_state, reward, prev_action, next_action): """ Update the action value function using the Expected SARSA update. Q(S, A) = Q(S, A) + alpha(reward + (pi * Q(S_, A_) - Q(S, A)) Args: prev_state: The previous state next_state: The next state reward: The reward for taking the respective action prev_action: The previous action next_action: The next action Returns: None """ predict = self.Q[prev_state, prev_action] expected_q = 0 q_max = np.max(self.Q[next_state, :]) greedy_actiOns= 0 for i in range(self.num_actions): if self.Q[next_state][i] == q_max: greedy_actions += 1 non_greedy_action_probability = self.epsilon / self.num_actions greedy_action_probability = ((1 - self.epsilon) / greedy_actions) + non_greedy_action_probability for i in range(self.num_actions): if self.Q[next_state][i] == q_max: expected_q += self.Q[next_state][i] * greedy_action_probability else: expected_q += self.Q[next_state][i] * non_greedy_action_probability target = reward + self.gamma * expected_q self.Q[prev_state, prev_action] += self.alpha * (target - predict)

Python 代码创建环境并测试所有三种算法。

Python 3

# main.py import gym import numpy as np from ExpectedSarsaAgent import ExpectedSarsaAgent from QLearningAgent import QLearningAgent from SarsaAgent import SarsaAgent from matplotlib import pyplot as plt # Using the gym library to create the environment env = gym.make('CliffWalking-v0') # Defining all the required parameters epsilon = 0.1 total_episodes = 500 max_steps = 100 alpha = 0.5 gamma = 1 """ The two parameters below is used to calculate the reward by each algorithm """ episodeReward = 0 totalReward = { 'SarsaAgent': [], 'QLearningAgent': [], 'ExpectedSarsaAgent': [] } # Defining all the three agents expectedSarsaAgent = ExpectedSarsaAgent( epsilon, alpha, gamma, env.observation_space.n, env.action_space.n, env.action_space) qLearningAgent = QLearningAgent( epsilon, alpha, gamma, env.observation_space.n, env.action_space.n, env.action_space) sarsaAgent = SarsaAgent( epsilon, alpha, gamma, env.observation_space.n, env.action_space.n, env.action_space) # Now we run all the episodes and calculate the reward obtained by # each agent at the end of the episode agents = [expectedSarsaAgent, qLearningAgent, sarsaAgent] for agent in agents: for _ in range(total_episodes): # Initialize the necessary parameters before # the start of the episode t = 0 state1 = env.reset() action1 = agent.choose_action(state1) episodeReward = 0 while t # Getting the next state, reward, and other parameters state2, reward, done, info = env.step(action1) # Choosing the next action action2 = agent.choose_action(state2) # Learning the Q-value agent.update(state1, state2, reward, action1, action2) state1 = state2 action1 = action2 # Updating the respective vaLues t += 1 episodeReward += reward # If at the end of learning process if done: break # Append the sum of reward at the end of the episode totalReward[type(agent).__name__].append(episodeReward) env.close() # Calculate the mean of sum of returns for each episode meanReturn = { 'SARSA-Agent': np.mean(totalReward['SarsaAgent']), 'Q-Learning-Agent': np.mean(totalReward['QLearningAgent']), 'Expected-SARSA-Agent': np.mean(totalReward['ExpectedSarsaAgent']) } # Print the results print(f"SARSA Average Sum of Reward: {meanReturn['SARSA-Agent']}") print(f"Q-Learning Average Sum of Return: {meanReturn['Q-Learning-Agent']}") print(f"Expected Sarsa Average Sum of Return: {meanReturn['Expected-SARSA-Agent']}")

输出:

结论:
我们已经看到，Expected SARSA 在某些问题上的表现相当不错。它在选择特定的行动之前考虑所有可能的结果。事实上，预期的 SARSA 既可以用作关闭策略，也可以用作打开策略，这是该算法如此动态的原因。

推荐阅读

rsa
Linux服务器密码过期策略、登录次数限制、私钥登录等配置方法

本文介绍了在Linux服务器上进行密码过期策略、登录次数限制、私钥登录等配置的方法。通过修改配置文件中的参数，可以设置密码的有效期、最小间隔时间、最小长度，并在密码过期前进行提示。同时还介绍了如何进行公钥登录和修改默认账户用户名的操作。详细步骤和注意事项可参考本文内容。 ... [详细]

蜡笔小新 2023-12-14 17:57:01
rsa
Java猜拳小游戏代码

本文介绍了一个Java猜拳小游戏的代码，通过使用Scanner类获取用户输入的拳的数字，并随机生成计算机的拳，然后判断胜负。该游戏可以选择剪刀、石头、布三种拳，通过比较两者的拳来决定胜负。 ... [详细]

蜡笔小新 2023-12-14 15:39:08
function
Backwardsincompatible change made.

Commit1ced2a7433ea8937a1b260ea65d708f32ca7c95eintroduceda+Clonetraitboundtom ... [详细]

蜡笔小新 2023-12-14 15:35:09
list
Java容器中的compareto方法排序原理解析

本文从源码解析Java容器中的compareto方法的排序原理，讲解了在使用数组存储数据时的限制以及存储效率的问题。同时提到了Redis的五大数据结构和list、set等知识点，回忆了作者大学时代的Java学习经历。文章以作者做的思维导图作为目录，展示了整个讲解过程。 ... [详细]

蜡笔小新 2023-12-14 13:53:31
list
sklearn数据集库中的常用数据集类型介绍

本文介绍了sklearn数据集库中常用的数据集类型，包括玩具数据集和样本生成器。其中详细介绍了波士顿房价数据集，包含了波士顿506处房屋的13种不同特征以及房屋价格，适用于回归任务。 ... [详细]

蜡笔小新 2023-12-13 17:45:15
function
不同优化算法的比较分析及实验验证

本文介绍了神经网络优化中常用的优化方法，包括学习率调整和梯度估计修正，并通过实验验证了不同优化算法的效果。实验结果表明，Adam算法在综合考虑学习率调整和梯度估计修正方面表现较好。该研究对于优化神经网络的训练过程具有指导意义。 ... [详细]

蜡笔小新 2023-12-13 16:05:14
list
STL迭代器的种类及其功能介绍

本文介绍了标准模板库(STL)定义的五种迭代器的种类和功能。通过图表展示了这几种迭代器之间的关系，并详细描述了各个迭代器的功能和使用方法。其中，输入迭代器用于从容器中读取元素，输出迭代器用于向容器中写入元素，正向迭代器是输入迭代器和输出迭代器的组合。本文的目的是帮助读者更好地理解STL迭代器的使用方法和特点。 ... [详细]

蜡笔小新 2023-12-10 15:17:25
list
颜色迁移（reinhard VS welsh）

不要谈什么天分，运气，你需要的是一个截稿日，以及一个不交稿就能打爆你狗头的人，然后你就会被自己的才华吓到。------ ... [详细]

蜡笔小新 2023-10-17 21:20:36
ip
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
ip
OC学习笔记之@property和@synthesize

本文介绍了OC学习笔记中的@property和@synthesize，包括属性的定义和合成的使用方法。通过示例代码详细讲解了@property和@synthesize的作用和用法。 ... [详细]

蜡笔小新 2023-12-14 12:05:06
function
[译]技术公司十年经验的职场生涯回顾

本文是一位在技术公司工作十年的职场人士对自己职业生涯的总结回顾。她的职业规划与众不同，令人深思又有趣。其中涉及到的内容有机器学习、创新创业以及引用了女性主义者在TED演讲中的部分讲义。文章表达了对职业生涯的愿望和希望，认为人类有能力不断改善自己。 ... [详细]

蜡笔小新 2023-12-14 11:31:05
function
PE总结9PE文件结构之解析导出表

本文介绍了PE文件结构中的导出表的解析方法，包括获取区段头表、遍历查找所在的区段等步骤。通过该方法可以准确地解析PE文件中的导出表信息。 ... [详细]

蜡笔小新 2023-12-13 11:47:24
function
C++中的三角函数计算及其应用

本文介绍了C++中的三角函数的计算方法和应用，包括计算余弦、正弦、正切值以及反三角函数求对应的弧度制角度的示例代码。代码中使用了C++的数学库和命名空间，通过赋值和输出语句实现了三角函数的计算和结果显示。通过学习本文，读者可以了解到C++中三角函数的基本用法和应用场景。 ... [详细]

蜡笔小新 2023-12-13 10:06:01
function
【openwrt】设备mt7628关于wan侧eth0.1 mac地址固定的问题

本文讨论了在openwrt-17.01版本中，mt7628设备上初始化启动时eth0的mac地址总是随机生成的问题。每次随机生成的eth0的mac地址都会写到/sys/class/net/eth0/address目录下，而openwrt-17.01原版的SDK会根据随机生成的eth0的mac地址再生成eth0.1、eth0.2等，生成后的mac地址会保存在/etc/config/network下。 ... [详细]

蜡笔小新 2023-12-12 17:47:48
function
【机器学习手册】日期和时区操作的重要性及应用

本文介绍了机器学习手册中关于日期和时区操作的重要性以及其在实际应用中的作用。文章以一个故事为背景，描述了学童们面对老先生的教导时的反应，以及上官如在这个过程中的表现。同时，文章也提到了顾慎为对上官如的恨意以及他们之间的矛盾源于早年的结局。最后，文章强调了日期和时区操作在机器学习中的重要性，并指出了其在实际应用中的作用和意义。 ... [详细]

蜡笔小新 2023-12-12 17:40:14

手机用户2502859387

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章