Asynchronous Methods for Deep Reinforcement Learning

Reinforcement Learning Background

Environment $E$, time step $t$, state $s_t$, action $a_t$, possible actions $A$, policy $\pi$, scalar reward $r_t$.

Return $R_t = \sum_{k=0}^\infty \gamma^k r_{t+k}, \gamma \in (0, 1]$.

Action value $Q^\pi(s, a) = \mathbb{E}[R_t | s_t = s, a]$ (expected return).

Optimal action value function $Q^*(s, a) = \max_\pi Q^\pi(s, a)$.

State value $V^\pi(s) = \mathbb{E}[R_t | s_t = s] = \sum_{a \in A} \pi(a | s) · Q^\pi(s, a)$.

value-based method (Q-learning)

$Q(s, a ; \theta) \approx Q^*(s, a)$.

The $i$th loss function:

$$ L_i(\theta_i) = \mathbb{E}(r + \gamma \max_{a'} Q(s', a' ; \theta{i-1}) - Q(s, a ; \theta)) ^ 2 $$

where $s'$ is the state encountered after state $s$.

One-step Q-learning.

policy-based methods

$\pi(a | s ; \theta)$.

Update the parameters $\theta$ by performing, typically approximate, gradient ascent on $E[R_t]$.

REINFORCE:

Updates the policy parameters $\theta$ in the direction $\nabla_\theta \log \pi(a_t | s_t ; \theta) R_t$ which is an unbiased estimate of $\nabla_\theta \mathbb{E} [R_t]$.
Reduce the variance of this estimate while keeping it unbiased by subtracting a learned function of the state $b_t(s_t)$, known as a baseline (Williams, 1992), from the return. Resulting gradient: $\nabla_\theta \log \pi(a_t | s_t ; \theta) (R_t - b_t(s_t))$.

Actor-critic:

A learned estimate of the value function is commonly used as the baseline $b_t(s_t) \approx V^\pi(s_t)$ leading to a much lower variance estimate of the policy gradient.
The quantity $R_t − b_t$ used to scale the policy gradient can be seen as an estimate of the advantage of action at in state $s_t$, or $A(a_t, s_t) = Q(a_t, s_t)−V(s_t)$.

Asynchronous RL Framework

First, asynchronous actor-learners, multiple CPU threads on a single machine.

Second, make the observation that multiple actors-learners running in parallel are likely to be exploring different parts of the environment, use different exploration policies in each actor-learner to maximize this diversity.

Benefits:

Stabilizing learning: By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates.
Reduction in training time that is roughly linear in the number of parallel actor-learners.
Able to use on-policy reinforcement learning methods such as Sarsa and actor-critic to train neural networks in a stable way.

Fig 3.

Asynchronous Methods for Deep Reinforcement Learning

Reinforcement Learning Background

value-based method (Q-learning)

policy-based methods

Asynchronous RL Framework

Asynchronous onestep Q-learning

Asynchronous one-step Sarsa

Asynchronous n-step Q-learning

Asynchronous advantage actor-critic (A3C)