[16] Counterfactual Memorization in Neural Language Models

paper problem : Is the problem simply that the model memorized the phrases in the training data? If the phrases are common, that’s fine, right? The problem is when the model memorized rare data. solution : Divide the data into subsets and define bad memory as when the model performance changes significantly depending on whether a particular sample is present or absent. Counterfactual memorization measures the expected change in a model’s prediction when a particular example is excluded from the training set. For each model, we sample the entire training data and train it independently. For each model, we measure the difference between the accuracy of predicting the next token on the model with a particular sample x (the IN model) and the accuracy of doing the same task on the model without the sample x (the OUT model).

More formally, we say a training example x is counterfactually memorized, when the model predicts x accurately if and only if the model was trained on x.

We define counterfactual memorization in neural LMs which gives us a principled perspective to distinguish “rare” (episodic) memorization from “common” (semantic) memorization in neural LMs (Section 2).
We estimate counterfactual memorization on several standard text datasets, and confirm that rare memorized examples exist in all of them. We study common patterns across memorized text in all datasets and the memorization profiles of individual internet domains. (Section 3).
We extend the definition of counterfactual memorization to counterfactual influence, and study the impact of memorized examples on the test-time prediction of the validation set examples and generated examples (Section 4).

To measure Generation-Time Memorization, we check if the generated sequence is in the training data or compare the perplexity of LM if the training data is difficult to obtain. The difference between this memorization and the Counterfactual memorization we propose is that if there are many near-duplicates in the training data, they will not be measured as memorization because many will remain after removing a subset from our training set. For the training sample with low counterfactual memory, we can see that there are a lot of repeated phrases (texts that are not highlighted in yellow in the last block).

In summary, generation-time memorization measures how likely a trained model would directly copy from the training examples, while counterfactual memorization aims to discover rare information that is memorized.

The model used T5, C4, RealNews, and Wiki. We built the model with a 25% subset of the training data. The results of the experiment are shown above. I tended to memorize more specialty texts (all caps, tables, bullet lists, multilingual texts) than plain lines.

Similar to memorization, you can measure how much a sample has influenced the model, called Counterfactual Influence. To measure this, you can use the We can do this The difference with the above formula is that to measure the impact of x on x’, we measure it on x’ for the subset with x and the subset without x. In other words, the above memorization measures the influence of x on itself.

In general, high memorization was associated with high influence, but not in all cases, and influence was noticeably lower for data with memorization above 0.4. The reason for this is that a high percentage of the data with high memorization was garbage text data (meaningless blah blah blah blah?), which had to be memorized in order to learn, but didn’t learn any interesting information, so there was no information to influence.

The larger infl was, the more we saw that sentences in train (the subset with x) tended to be in valid (the subset without x), i.e., there were exactly the same sets in valid, so the influence on train must have been large.

In practice, the generation results for a large sample of influence and memorization looked like this