Some naive thoughts on reinforcement learning

Multi-armed Bandits[1]

The Easiest model with the Easiest method.

Overview

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period.

Policy

Evaluating each action’s expected reward.

Choosing actions randomly.

If $Action_a$ performs better than the others.

Then, choosing actions other than a with a probability of $\epsilon$.

Result

Action’s Expected Reward Distribution

distribution

Actual Performence with different $\epsilon$

result

Background(Nature)

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions.[1]

Example: $DNN: Picture \rightarrow Actions_{DNN} \rightarrow Dog/Cat \rightarrow True/False.\\ RL: Action_1 \rightarrow +5 \rightarrow Action_2 -> -2 ...$
So, it’s not hard to understand the whole learning process is keeping trying.

Key Words
- Explore
- Exploit
By the way, when l am writing the passage, a question came into my mind. Why not RL agent keep exploring?

Luckily I gave an explanation myself that the action with higher reward may be more pormising. More importantly, with $\epsilon > 0$ it means keeping trying.

Research Hotspot(Safe RL)

The conclusion is mainly drawn through the survey in 2015[2].(The survey can be pushlished in A-reference.)

Abstract

Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric.

External knowledge/Guidence

Papers below have been read and summarized by self.

Expert demonstrations for controlling helicopters.[3]
Devising a framework for controlling the reward variance(Economy related).[4]
Student-Teacher Learning via comparing the confidence between them.[5]
Evolutionary reinforcement learning method.[6]
Dynamic wave expansion neural network.(All Chinese)[7]
Hardcore mathematical derivation…[8]

My Opinion

Restriction

Experiment based on hardware
Hardcore mathematical derivation
Cross-domain knowledge(Economy)

Naive Thought

When people learn to dirve cars, they won’t hit the wall anyway.

Reward

One of the basic elements of RL up to human.

Safety related
Convenient to improve
Novel

Proposition: Maybe well-designed reward function can converge more quickly.

Problem

Theoretically wrong
Theoretically right with no feasible experiments
Theoretically right with several experiments supported
Theoretically right but can’t proof its correct

Plan

Narrow down the range of papers and continue to read.
Continue to make experiments based on the open source framework.
Involved in other research areas.(LOL)

Reference

《Reinforcement Learning: An Introduction》. second edition. Richard S. Sutton and Andrew G. Barto
Garcıa J, Fernández F. A comprehensive survey on safe reinforcement learning[J]. Journal of Machine Learning Research, 2015, 16(1): 1437-1480.
Tang J, Singh A, Goehausen N, et al. Parameterized maneuver learning for autonomous helicopter flight[C]//2010 IEEE International Conference on Robotics and Automation. IEEE, 2010: 1142-1148.
Di Castro D, Tamar A, Mannor S. Policy gradients with variance related risk criteria[J]. arXiv preprint arXiv:1206.6404, 2012.
Torrey L, Taylor M E. Help an agent out: Student/teacher learning in sequential decision tasks[C]//Proceedings of the Adaptive and Learning Agents workshop (at AAMAS-12). 2012.
de Lope J. Learning autonomous helicopter flight with evolutionary reinforcement learning[C]//International Conference on Computer Aided Systems Theory. Springer, Berlin, Heidelberg, 2009: 75-82.
Song Y, Li Y, Li C, et al. An efficient initialization approach of Q-learning for mobile robots[J]. International Journal of Control, Automation and Systems, 2012, 10(1): 166-172.
Tamar A, Xu H, Mannor S. Scaling up robust MDPs by reinforcement learning[J]. arXiv preprint arXiv:1306.6189, 2013.