[83] Variance Networks: When Expectation Does Not Meet Your Expectations

TL;DR

task : stochastic DNN => image classification, reinforcement learning, adversarial example
IDEA : What if we create a stochastic layer where only the variance is learned, not the mean?
architecture : LeNet-5-Caffe
objective : objective for each task
**baseline :**VGG-like architecture, Deterministic Policy
data : CIFAR-10, CIFAR-100
result : ?
contribution : ?
limitation or something I didn’t understand : I’ll have to reread this later when I have more time.

Details

DNN in stochastic setting

Methods include stochastic layer, stochastic optimization texhinques, etc.
Used to reduce overfitting, estimate uncertainty, more efficient exploration for reinforcement learning
Learning a stochastic model can be interpreted as a kind of Bayesian model
One way to do this is to replace the deterministic weight $w_{ij}$ with $\hat w_{ij} \sim q(\hat w_{ij}|\phi_{ij})$. Then, during training, we are learning about the distribution of weights rather than a single point estimate of this weight.
However, the test ends up averaging over the distribution of these weights, which is where things like “mean propagation” and “weight scaling rule” come into play.

Stochastic Neural Network

DNNs are ultimately about predicting target T given object x and weights W. Consider a stochastic neural network whose weights W are sampled from the parametric distribution $q(W|\phi)$. As training progresses, $\phi$ is trained by the training data (X, T) and a regularization term $R(\phi)$ is added, which can be written as To train this model, techniques such as binary dropout, variational dropout, and dropout-connection are used, where finding the exact $E_{q(W|\phi)}p(t|x,W)$ is usually intractable. So we usually approximate by taking K samples and averaging them, which is called “test-time averaging”. To make the calculation a bit more efficient, we find $E_qW$ instead of $\hat W_k$, which is called the “weight scaling rule”. In this paper, we want to consider layers with $E_qW=0$, which means that p(t|x, EW=0), so using a weight scaling rule every time would result in a random guess (which is why they don’t use weight scaling rules, right?). We want to call these layers “variance layers” and define them as “variance network” because the mean value does not store any information, only the variance.