
Details
Multi-task Learning
Why does it work?
- prevent overfitting for one task, 2) aggregate data, 3) learn “inductive bias”, and 4) learn good features.
hard parameter sharing vs soft parameter sharing
- hard parameter sharing

Common MTL Model Structure
- soft parameter sharing

Stack networks for each task and impose an L2 norm loss so that the parameters of each network don’t vary too much.
Recent work on MTL for deep learning
Deep Relationship Networks Impose a matrix prior on FCNs to allow the model to learn the relationship between tasks

Cross-stitch network

Have separate networks for each task, with the parameters of each network being a linear combination of trainable $\alpha$.
- Weighting losses with uncertainty

Measure the uncertainty of each task and add relative weight to the multi-task loss function -> You might also like to read this!
Auxiliary tasks
- related task Related tasks are better
- adversarial Learning by doing the opposite of what you want, e.g., predicting the domain of the input in domain adaptation and reversing the gradient in an adversarial task? Ganin, 2015
- Hint Use a slightly easier task. For example, learn a task that predicts the sentiment of a sentence by dividing it into positive/negative -> connectivity experiment Remind me!
- Representation learning Making the representation good can be an auxiliary task, since it’s all about making a good representation. For example, language modeling or autoencoders.
Lesson learned
I feel like BERT is really destructive lol