Readers are recommended to have prior knowledge for Reinforcement Learning basics. A good tutorial can be found here.


Policy Optimization and Q-Learning are two main model-free RL approaches. While the former is more principled and stable, the latter exploits sampled trajectories more efficiently. Soft Actor-Critic, or SAC, is an interpolation of both approaches.

Policy Gradient

Following stochastic parameterized policy , we can sample trajectory The aim is to maximize the expected return We aim to update via gradient descent
The gradient can be expanded into:

Entropy-Regularized Reinforcement Learning

Q-function tends to dramatically overestimate Q-values, which then leads to the policy breaking because it exploits the errors in the Q-function. To address this issue, we ought to discount the Q-values by some metric.
The entropy of a random variable is defined as:
At each time step we give the agent a bonus reward proportional to the entropy of the policy. The Bellman Equation is thus changed to:
where is the trade-off coefficient (or temperature). Higher temperature encourages early exploration and prevents the policy from prematurely converging to a bad local optimum.
We can approximate the expectation with samples from the action space:

Q-Learning Side

Mean Squared Bellman Error

The Bellman equation describing the optimal action-value function is given by:
With sampled trajectories stored in replay buffer we learn an approximator to with neural network
The mean squared Bellman Error (MSBE) is computed as:
where if is a terminal state and otherwise.

Target Networks

The optimization target is given by:
Since we wish to get rid of the parameters in the target to stabilize the training process, we replace it with the target network which is cached and only updated once per main network update by Polyak averaging:

Clipped double-Q

To further suppress Q-values, in SAC we learn two Q-functions instead of one, regressing both sets of parameter with a shared target, calculated with the smaller Q-value of the two:

Policy Learning Side

Since calculating is expensive, we can approximate it with where is the target policy. The objective then becomes to learning a policy that maximizes
Here we adopt a squashed state-dependent gaussian policy:
Under the context of Entropy-Regularized Reinforcement Learning, we modify the target with:
This reparameterization removes the dependence of the expectation on policy parameters:
We perform a gradient ascent optimizing:


Since we’re trying to do offline-online combined updates for performance improvement, we need to tackle with the offline reinforcement learning problem with generated samples. From prior works regarding offline RL [6][7], OOD actions and function approximation errors will pose problems for Q function estimation. Therefore, we adopt conservative Q-learning method proposed by prior work [8] to address this issue.

Conservative Off-Policy Evaluation

We aim to estimate the value of a target policy given access to a dataset generated by pretrained SAC behavioral policy . Because we are interested in preventing overestimation of the policy value, we learn a conservative, lower-bound Q-function by additionally minimizing Q-values alongside a standard Bellman error objective. Our choice of penalty is to minimize the expected Q-value under a particular distribution of state-action pairs . We can define a iterative optimization for training the Q-function:
where is the Bellman operator and is the tradeoff factor. The optimality for this update as: and we can show it lower-bounds for all state-action pairs . We can further tighten this bound if we are only interested in estimating . In this case, we can improve our iterative process as:
By adding a Q-maximizing term, although it may not be true for being the point-wise lower-bound for , we still have when . For detailed theoretical analysis, we will refer to prior work [8].

Conservative Q-Learning for Offline RL

We now adopt a general approach for offline policy learning, which we refer to as conservative Q-learning (CQL). This algorithm was first presented by prior work [8]. We denote as a CQL algorithm with a particular choice of regularizer . We can formulate the optimization problem in a min-max fashion:
Since we’re utilizing CQL-SAC, we will chose the regularizer as the entropy , making it . In this case, the optimization problem will be reduced as:
More specifically, we let the regularizer , where is a prior distribution. We can then derive . We take the prior distribution as a uniform distribution . In this way, we can retrieve the optimization target above. For detailed derivations and theoretical analysis we refer to [8].


Architecture for SAC and its CQL-modified version is illustrated as follows:
notion image
The overall pipeline is visualized below:
notion image


notion image


For an implementation of the CQL-SAC algorithm, please refer to our Github repo.


[1] Spinning Up in Deep Reinforcement Learning, Achiam, Joshua, (2018).
[2] Haarnoja, Tuomas, et al. "Soft actor-critic algorithms and applications." arXiv preprint arXiv:1812.05905 (2018). [3] Kumar, Aviral, et al. "Conservative q-learning for offline reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 1179-1191. [4] Zhang, Shangtong, and Richard S. Sutton. "A deeper look at experience replay." arXiv preprint arXiv:1712.01275 (2017). [5] Fujimoto, Scott, Herke Hoof, and David Meger. "Addressing function approximation error in actor-critic methods." International conference on machine learning. PMLR, 2018.
[6] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
[7] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761–11771, 2019.
[8]Aviral Kumar, Aurick Zhou, George Tucker and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
语丝Solution for Project Euler [484]
  • Twikoo
  • Giscus