Learning Notes: Morvan - Reinforcement Learning, Part 2: Q-learning

news/2024/7/5 19:18:56

Q-learning

  • 2.1 小例子
  • 2.2 Q-learning 算法更新
  • 2.3 Q-learning 思维决策

Auxiliary Material

  1. A Painless Q-Learning Tutorial

  2. Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning

  3. 6.5 Q-Learning: Off-Policy TD Control (Sutton and Barto's Reinforcement Learning ebook)

Note

  1. tabular

    扁平的,表格式的

  2. Q-learning by Morvan

    Q-learning 是一种记录行为值 (Q value) 的方法, 每种在一定状态的行为都会有一个值 Q(s, a), 就是说 行为 a 在 s 状态的值是 Q(s, a).

    s 在上面的探索者游戏中, 就是 o 所在的地点了. 而每一个地点探索者都能做出两个行为 left/right, 这就是探索者的所有可行的 a 啦.

  3. Q-learning by Wikipedia

    Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

    When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations.

  4. Psudocode

  5. The transition rule of Q learning is a very simple formula (source):

    Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

  6. epsilon greedy

    EPSILON 就是用来控制贪婪程度的值。EPSILON 可以随着探索时间不断提升(越来越贪婪)。

  7. Why is Q-learning considered an off-policy control method? (Exercise 6.9 of Sutton and Barto book's)

    If the algorithm estimates the value function of the policy generating the data, the method is called on-policy. Otherwise it is called off-policy.

    if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. (source)

    Q-learning 是一个 off-policy 的算法, 因为里面的 max action 让 Q table 的更新可以不基于正在经历的经验(可以是现在学习着很久以前的经验,甚至是学习他人的经验).

 

转载于:https://www.cnblogs.com/casperwin/p/6305351.html


http://www.niftyadmin.cn/n/1258223.html

相关文章

利用 JMetal 实现大规模聚类问题的研究(一)JMetal配置

研究多目标优化问题,往往需要做实验来对比效果,所以需要很多多目标方面的经典代码,比如NSGA-II, SPEA, MOEA,MOEA/D, 或者PSO等等。 想亲自实现这些代码,非常浪费时间,还有可能出错,最好的方法就是找一些网…

centos 7中编译安装httpd-2.4.25.tar.gz

检查是否已经安装了下载工具wget和编译环境gcc、make:[rootjianxiangqiao ~]# rpm -qa|grep -e wget -e ^gcc -e makegcc-4.8.3-9.el7.x86_64make-3.82-21.el7.x86_64wget-1.14-10.el7_0.1.x86_64如果没有安装,则使用下面的命令安装:yum -y i…

北风网视频菜单4

List<CaiDan> caiDanList是我们要自定义菜单的东西 按照真实的数据List<CaiDan> caiDanList自定义菜单 先不管CaiDanList,设置它为空测试一下 用不同的微信账号登录可以进去不同的微信测试号 就算是web.xml报错导出的WAR包还是不可以用的,所以至少得先保证本地项目…

AMD规范

AMD规范就是其中比较著名一个&#xff0c;全称是Asynchronous Module Definition&#xff0c;即异步模块加载机制。从它的规范描述页面看&#xff0c;AMD很短也很简单&#xff0c;但它却完整描述了模块的定义&#xff0c;依赖关系&#xff0c;引用关系以及加载机制。从它被requ…

谈谈java中字节byte有负数的现象

在研究编码时&#xff0c;无意中发现java中输出编码后的字节数据的值有的是负值&#xff0c;比如utf-8编码后的字节数据&#xff0c;通过遍历&#xff0c;打印都是负值&#xff0c;java中字节byte有负数的现象让我产生了兴趣&#xff0c;在此探讨一下。 关于编码的字节有负数的…

VIM的概念和基础操作

vi 命令行下面的文本编辑工具vim是vi的增强版本命令vim可以启动vim编辑器一般可以通过vim目标文件路径的形式使用vim如果目标文件存在&#xff0c;则vim打开该文件若目标文件不存在&#xff0c;则新建该文件vi拥有三种模式&#xff1a;命令模式、插入模式和ex模式任何模式都可以…

使用命令行生成jar包

JAR包是Java中所特有一种压缩文档&#xff0c;其实大家可以把它理解为zip包。当然也是有区别的&#xff0c;JAR包中有一个META-INF\MANIFEST.MF文件&#xff0c;当你生成JAR包时&#xff0c;它会自动生成。 JAR包是由JDK安装目录\bin\jar.exe命令生成的&#xff0c;当我们安装…

scala集合类详解

对scala中的集合类虽然有使用&#xff0c;但是一直处于一知半解的状态。尤其是与java中各种集合类的混合使用&#xff0c;虽然用过很多次&#xff0c;但是一直也没有做比较深入的了解与分析。正好趁着最近项目的需要&#xff0c;加上稍微有点时间&#xff0c;特意多花了一点时间…