
TL;DR
- task : stochastic DNN => image classification, reinforcement learning, adversarial example
- IDEA : What if we create a stochastic layer where only the variance is learned, not the mean?
- architecture : LeNet-5-Caffe
- objective : objective for each task
- **baseline :**VGG-like architecture, Deterministic Policy
- data : CIFAR-10, CIFAR-100
- result : ?
- contribution : ?
- limitation or something I didn’t understand : I’ll have to reread this later when I have more time.
Details
DNN in stochastic setting
- Methods include stochastic layer, stochastic optimization texhinques, etc.
- Used to reduce overfitting, estimate uncertainty, more efficient exploration for reinforcement learning
- Learning a stochastic model can be interpreted as a kind of Bayesian model
- One way to do this is to replace the deterministic weight $w_{ij}$ with $\hat w_{ij} \sim q(\hat w_{ij}|\phi_{ij})$. Then, during training, we are learning about the distribution of weights rather than a single point estimate of this weight.
- However, the test ends up averaging over the distribution of these weights, which is where things like “mean propagation” and “weight scaling rule” come into play.
Stochastic Neural Network
DNNs are ultimately about predicting target T given object x and weights W.
Consider a stochastic neural network whose weights W are sampled from the parametric distribution $q(W|\phi)$.
As training progresses, $\phi$ is trained by the training data (X, T) and a regularization term $R(\phi)$ is added, which can be written as
To train this model, techniques such as binary dropout, variational dropout, and dropout-connection are used, where finding the exact $E_{q(W|\phi)}p(t|x,W)$ is usually intractable. So we usually approximate by taking K samples and averaging them, which is called “test-time averaging”.
To make the calculation a bit more efficient, we find $E_qW$ instead of $\hat W_k$, which is called the “weight scaling rule”.
In this paper, we want to consider layers with $E_qW=0$, which means that p(t|x, EW=0), so using a weight scaling rule every time would result in a random guess (which is why they don’t use weight scaling rules, right?). We want to call these layers “variance layers” and define them as “variance network” because the mean value does not store any information, only the variance.
Variance Layer
The activation does not depend on $\mu_{ij}$, only on the variance.

Result
Good results in classification / reinforcement learning / adversarial example
