(INCOMPLETE) Deep Reinforcement Learning is an exciting new field that encompasses many different fields: computer science, optimal control, statistics, machine learning, and so on. Its application are numerous.

Policy Gradient

Definition A Markov Decision Process contains

  • \( \pi : \mathcal{S} \rightarrow \Delta(\mathcal(A)) \), the stochastic policy.
  • \( \eta(\pi) = \mathbb{E} \left[ R_0 + \gamma R_1 + \gamma^2 R_2 +
    … \right] \)
  • \( p: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R} \), the state transition probability
  • \( \mu : \mathcal{S} \rightarrow \mathbb{R} \), the probability distribution over the initial state, \( s_0 \)
  • \( \theta \in \mathbb{R}^n \), a vector of parameter that parameterizes the stochastic policy \(\pi\)

A policy gradient algorithm then calculate \( \nabla_ {\theta} \eta (\theta) \), and make proceed as a standard gradient ascent algorithm. We approximate the gradient using Monte Carlo estimation, since we don’t have access to the underlying probability distribution.

Theorem Monte Carlo estimation. Let \( X : \Omega \rightarrow \mathbb{R} ^ n \) be a random variable with probability distribution \(q\), and \(f : \mathbb {R}^n \rightarrow \mathbb{R} \). Then

Armed with this theorem, we can use a sample average to estimate the expectation.