
TL;DR
problem : Few-shot classification works well in the same domain (= predicting unseen labels within ImageNet), but not well when doing few-shot with different domains (trained with ImageNet does not work well when doing few-shot test with CUB data). solution : Added a feature-wise transformation layer (affine transformation) to the feature encoder, where the hyperparameters are learned using the learning-to-learn methodology. result : Good generalization performance when applying the above feature-wise transformation to MatchingNet, RelationNet, and Graph Neural Network.
details
- The difference between domain adaptation / generalization is that for generalization, you need to generalize without using an unseen domain in the learning phase.
- We turn the domain generalization problem into a problem of classifying novel classes in a few-shot setting.
3.1. Preliminaries
few-shot terms
- N_w : # of categories
- N_s : # of labeled examples for each categories
The figure below shows an example of a 3 way 3 shot few shot

The algorithm of metric-based consists of a feature encoder E and a metric function M.
At each iteration, N_w categories are drawn and a task T is created. Let X be the input image and Y the corresponding label. Task T consists of a support set S={(X_s, T_s)} and a query set {(X_q, Y_q)}.
The feature encoder E extracts features from the support and query images and feeds them into the metric function M to predict the category of the query image based on the label of the support image.

The learning objective function is the classification loss of the images for the query set.

The main difference between the various metric-based algorithms is the architecture of E from which the image features are drawn. For example, MatchingNet uses LSTM, RelationNet uses CNN, GNN uses GCN, etc.
We trained with seen domains and evaluated against unseen domains.
3.2. feature-wise transformation layer

Our goal is to get better generalization over unseen data, which we need to prevent since the metric function M is prone to overfitting the seen domain.
Intuitively, it seems that applying an affine transformation to the feature encoder E would allow us to represent more diverse distributions.
hyper parameter means standard dev for sampling affine transformation parameters.

After the batch norm, apply the feature-wise transformation layer below.

3.3 Learning the feature-wise transformation layers(=FT layer)
- We could choose the above hyperparameters empirically, but we want to make them learnable. We designed a learning-to-learn algorithm for this purpose. The main idea is to apply the FT layer so that what it learns on the seen domain performs better on the unseen domain.

At each training iteration t, create a pseudo-seen domain (ps) and a pseudo-unseen domain (pu) by sampling from the seen domain. Apply the feature encoder and metric function with parameters for the FT layer and find the loss for the seen domain task only.

To measure generalization, we 1) remove the FT layer of the model and 2) calculate the classification loss of the updated model for the pseudo-unseen task. In other words,

Finally, since the loss above reflects the efficiency of the FT layer, we update the hyper parameter as follows

In other words, the metric-based model and the feature-wise transformation layer (FT) are trained together in the training phase.
Experimental Results
FT: As a rule of thumb, when setting up an FT layer with even hyperparameters, you should use

LFT: When the hyperparameters of the FT layer are learnable,

tSNE results by domain. Domains are well mixed -> cross-domain adaptation is good.
