
TL;DR
- I read this because.. : https://github.com/long8v/PTIR/issues/82 I often see it mentioned in other articles like this one. Two-word headings
- task : Tasks whose input or output is an unordered set. 1) estimate the distribution parameter of a parameter 2) list and sum numbers 3) point cloud classification 4) find words close to a concept / cluster of a set of words 5) find all tags related to an image
- problem : What are the characteristics that a deep network that solves permutation invariant tasks should have?
- architecture : $f(x)=\sigma(\lambda I \mathbf{x} + \gamma \text{maxpool}(\mathbf{x})1)$
- Result : Similar or better performance than models characterized by one arch each
- Contribution : Theoretical characterization of set input output, verifying its performance in various applications.
- limitation / things I cannot understand :
Details
Permutation Invariance and Equivarnce
Problem Definition
The function f must be permutation invariant regardless of the order of the set.

- $\pi$ : permutation
Structure
A function f(X) that takes a set $X$ is pemutation invariant when it decomposes into the following form

For any function $f_\theta : \mathbb{R}^M \rightarrow \mathbb{R}^M$,
- $\sigma$ : nonlinearity function
- $\theta \in \mathbb{R}^{M\times M}$
If $f_\theta(\mathbf{x})=\sigma(\theta\mathbf{x})$, the diagonal elements of $\theta$ are equal and the non-diagonal elements are tied, then the permutation is equivalent.

I looked at the formula, and it looks like it’s just the same value except for diagonal, and the same value between diagnoal. lambda * torch.eyes(5) + gamma * torch.ones(5,5)
If you put $\mathbf{x}$ up to $f(x)=\lambda Ix \mathbf{(11^T)x})$ The summation of inputs Ix and x plus nonlinearity is permutation invariant (since the summation is independent of the permutation)
Deep Sets
We can replace the above properties with a universal approximator, i.e., we can approximate $\phi$ and $\rho$ by polynomials That is, 1) each instance $x_m$ is replaced by some representation $\phi(x_m)$ and 2) the representations are processed according to the $\rho$ network and then added. Given some meta-information $z$, the above networks are represented by a conditional mapping $\phi(x_m|z)$.
Equivariant model

If we replace this with another operation, we can get something like this

This is because max-pool has a commutative law similar to sum. In practice, we found that Max performed better than sum.
Applications and Empirical Results
Show Normally Distributed Random Numbers and Estimate Parametric Statistics

Show a list of numbers and ask for their summation text / mnist image

Shows up to 10 when learning and 100 when testing Deep sets generalize well, unlike RNNs
- point cloud classification

The points measured by LiDAR are in no particular order.
text set expansion Given a cheetah, a tiger, a task to draw a puma with a similar concept. unsupervised

image tagging Tagging all of the text that corresponds to a specific image For training, we gave them a few tags and asked them to predict the rest of the tags, and for testing, we gave them only images and asked them to predict the tags. One network to encode each element (image and tag), and one network to sum the elements to get the score of the set. -> Does it then score all the combinations of the set and pick the best? I don’t know

anomaly detection I have a CelebA with images and tags corresponding to those images, and I want to organize the images by tag and pull one image from a different group. Take a sequence of images and try to predict which one is the wrong image in the last softmax layer. With Deep sets, I got 70% of the tests right, while the baseline with FCN performed at random guess level.

A follow-up study? http://proceedings.mlr.press/v97/lee19d/lee19d.pdf With attention operations instead of pooling!