problem : Is there a policy update method that theoretically improves performance unconditionally?
idea : Find the lower bound proved by the conservative policy iteration for the general policy network and maximize this lower bound as a surrogate function.
input/output : {s, a, r, … } -> policy
architecture : conv+ linear
baseline : deep Q-learning
result : Not bad performance. Not much better than Deep Q-learning