Both reviews point out that the paper is well written and
thorough experiments have been conducted.
However, reviewer 1 nevertheless comments on some of the
claims and statements in the paper.
The authors are asked to carefully prove the arguments of
reviewer 1 and add a discussion to the paper if necessary.
In addition the authors should carefully follow the
suggestions of reviewer 2 as well.
----------------------------------------
Comments on Video Attachment:
Reviewer 2 of ICRA 2019 submission 364
Comments to the author
======================
The paper introduces a framework for learning policies
robust to unexpected situations (states). The premise is
that an RL agent by virtue of optimality will focus
on states of high value. This agent when put in a slightly
different environment at test time might reach states of
low value that were not sufficiently explored during
training and for which no good recovery policy has been
learned. In order to learn a more robust policy, the
authors propose to augment the training environment with an
adversarial agent that drives the learner towards state of
low value during training time.
The paper is well written and contains an appreciable
summary of related approaches. The proposed approach is
evaluated on a driving simulator and compared to several
variants including one baseline from the literature.
In addition to adding an adversarial agent that minimizes
the reward, the authors propose to use risk averse policies
that select actions with high value and low variance.
Although the use of variance is common in risk averse RL,
it is meant usually as the variance of random variables X
where X is the sum of discounted rewards upon taking action
a in state s, and where the expectation of X is the Q
value. Here instead it is the variance of several
Q-function estimates. This is not related to robustness as
is meant in e.g. reference [8] in the paper, but is rather
a way to combat the positive bias of the Q-function
estimate as in Double Q-Learning. Indeed, in the limit and
assuming that all Q-function estimates will converge the
variance term as used in Eq. (1) will be zero and the
optimal policy of the agent will converge to the usual
definition of optimality in RL. Not a robust variant
thereof.
It is also not clear why these two concepts (risk averse
through the use of variance and an adversary agent) are
studied together. The authors put sufficient effort into
evaluating the different parts of the proposed framework
but I believe a more thorough interpretation of the results
is necessary to establish in a more convincing manner the
relation between the added components and robustness.
As for the adversarial part, Fig. 6 is a convincing
argument why the adversary might bring the learner to more
challenging states to recover from than random
perturbations, but is a bit misleading since at the end the
adversary is only allowed to take control for a single
time-step.
Reviewer 3 of ICRA 2019 submission 364
Comments to the author
======================
This paper introduces risk-averse robust adversarial
reinforcement learning (RARARL), using a risk-averse
protagonist and a risk-seeking adversary and it tests it on
a self-driver vehicle controller. The author claims are
tested and justified however the paper lacks some clarity
in some sections. In particular, the catastrophic events
and their probability were modeled using the variance of
the value function estimation calculated using a set of
Q-Networks. The normal Bellman equation has been adjusted
for the purposes of the presented work but it is not clear
why it has been chosen in this way. The paper distinguishes
its novelty from other already established models in the
literature mostly clearly.
"While [37] proposed a method to estimate that variance, it
assumes that the number of states is limited." Does this
paper assume no limit on the number of states? i.e. we have
non-countable states space? Even if there is a limited
state space the limit can be very big that any tabular RL
method is impractical to apply hence a general function
approximation (such as neural networks) is needed. This
should be made clear in the paper
It is more difficult and realistic to assume simultaneous
actions performed by the adversarial and the protagonist.
The following two sentences need to be rephrased as they
create confusion: "The work of [27] proposed to introduce
Bayesian optimization to actively select environment
variables that may induce catastrophes so that models
trained can be robust to these environment dynamics.
However, they did not systematically explore dynamics
variations and therefore the model may be vulnerable to
changing dynamics even if it is robust to a handful of rare
events." This means that the Bayesian model is actually
robust to environment dynamics including catastrophic
events that are stemming out of specific environment
variables, the author should consider rephrasing to say
just that [27] did not systematically explore dynamics
variations.
"However, Rajeswaran et al. [32] requires training on a
target domain and experienced performance degradation when
the target domain has a different model parameter
distribution from the source." Is not that just off-policy
learning?
Is each Q networks initialized to different sets of random
weights? If so then this model is actually eliminating the
effect of the randomized weight initialization along with
training different Q networks using different trajectory
samples.
The phrase "and use this head throughout one episode."
Should be "and use this head throughout one episode to
choose the action that will be applied by the agent."
"At test time, the mean value "Q(s;a) is used for selecting
actions." Does each Q network produces a set of values for
each action and then at test time the model is choosing the
max or is it e-greedy selection?
"At training" paragraph in page 3 lacks information and
clarity. For example: "When updating Q functions, our
algorithm (like DQN [20]) samples a batch of data of size B
from the replay buffer"
maybe should be "When updating Q functions, our algorithm
(like DQN [20]) samples a set of k batches of size B from
the replay buffer, they are then passed to the k Q-networks
to train each with a different dataset"? Or is it that you
are training the Q-networks that has been selected by the
mask for one batch update of size B.? This has been
clarified later in page 4 but still better to clarify it
here or point out for a coming explanation.
The choice of a Poisson distribution can be justified from
a Poisson Process perspective having a large number of data
points coming from the online stream and we try to select
some of these.
Equation 4 (and 5) seems to assume that the protagonist is
at the end of its current set of m steps since the latest
reward is taken into account + a set of n rewards that are
coming from the adversary? Why not m rewards from the
protagonist are summed with n rewards from the adversary in
the equation?
The evaluation is elaborate and provides a good background
for comparisons. In particular, the different models (dqn,
bsdqn, bsdqnrand, etc.) are commendable as well as the
advantage of using the proposed model is clarified through
experiment.
For the purpose of catastrophic events can the model be
replaced by extra exploration that is dimmed through time
which encourages the protagonist to explore regardless of
the consequences dictated by the value function estimation?
Just a soft-max policy with a high temperature at the start
and then cool it down the later episodes? The comparison of
vanilla DQN may not be fair because a different
encouragement for exploration may have resulted in better
response to a catastrophe.
The video seems to suggest that DQN is better than the
proposed model when testing for adversarial.
Minor:
Page 1: stronger adversaries; stronger than what?
Please maintain either risk averse or risk-averse
either agent => either agents