TL;DR: We introduce In-Run Data Shapley, a scalable data attribution algorithm for machine learning that calculates Data Shapley values during a single training run.
Abstract. Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.
The Shapley value, originating from cooperative game theory, is a method to fairly distribute the total gains (or costs) among players based on their individual contributions. This concept is extensively used in machine learning, economics, and business analytics.
Given a set of players \( N = \{1, 2, ..., n\} \) and a value function \( v \) that assigns a real number to every subset of \( N \) (representing the total value created by that subset), the Shapley value for player \( i \) is defined as:
$$\phi_i(v) = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))$$
This formula sums over all possible subsets \( S \) of \( N \) that do not include player \( i \), weighting each marginal contribution \( v(S \cup \{i\}) - v(S) \) by the fraction of permutations in which \( S \) precedes \( i \).
The Shapley value is the only solution concept that satisfies the following four axioms:
Data Shapley applies the Shapley value for assessing the contribution of each data point to the final model performance. In this context, the utility function \( v \) maps the input dataset to the model’s performance, such as accuracy or loss. Despite its solid theoretical foundation, calculating Data Shapley values is computationally intensive due to the factorial growth in the number of possible subsets as the dataset size increases. For a dataset with \( n \) points, there are \( 2^n \) subsets to consider, making the exact computation impractical for large datasets.
In this experiment, we evaluate the effectiveness of different data attribution techniques in identifying relevant individual corpora that have been paraphrased. We started by selecting a corpus from the training set and creating several paraphrased versions using GPT-4, with varying levels of paraphrasing. These paraphrased versions form our validation set. We then calculated the average value rank of the original training corpus for each of its paraphrased versions.
As shown in the table, even for a validation corpus that is a complete rewrite but of similar topics, the original training corpus still ranks very high according to the second-order In-Run Data Shapley. The results have important implications for the current discussion of the copyright of generative AI. The high value of the original corpus for non-verbatim yet semantically similar validation corpus implies that contribution does not require memorization, i.e., training data contributes to generative AI even if the output does not closely resemble the input. Our experimental results suggest that training data owners should hold a certain royalty share for generated content, even if it does not look similar to the copyrighted material. This aligns with ongoing discussions in AI copyright.12
In-Run Data Shapley tracks cumulative data values across training steps, allowing us to evaluate the contributions of data points at various stages. We applied In-Run Data Shapley to a math-related validation corpus to gain insights into data attribution.
The above animation illustrates the percentage of total value attributed to each domain, excluding those with a total value below zero. The interesting observations are as follows: (1) Rapid Initial Changes: The value composition changes rapidly at the beginning of training, stabilizing over time. (2) Stable Value Proportions: Later stages reflect the relative abundance of math content in the domain of ArXiv. (3) General Corpora Contributions: The Pile-CC domain, containing general web crawls, initially shows positive contributions but quickly drops to negative and converges to zero. This indicates that general corpora are crucial in the early stages for learning basic language patterns and common knowledge, but their relevance diminishes as training progresses and the model focuses on specialized topics.
Carefully curated pre-training corpora still contain data points that can adversely affect the training process. Identifying and removing these data points can accelerate model convergence and enhance overall performance, thereby saving computational resources. In this experiment, we demonstrate the effectiveness of In-Run Data Shapley in assessing the data quality of a subset of the Pile dataset. We selected 160k corpora (approximately 160 million tokens) and trained a GPT-2 model on this subset, computing data attribution values using Pile's validation set. By filtering out all negatively valued corpora and retraining the GPT-2 model on the cleaned subset, we observed significant improvement in model convergence. For both first- and second-order In-Run Data Shapley, we achieved around 25% fewer training iterations to reach a test loss of 3.75. Surprisingly, our analysis revealed that around 16% of the training corpora had negative second-order In-Run Data Shapley values. This experiment implies that there is still significant room for data curation in well-curated public datasets such as Pile.
@article{wang2024data,
title={Data Shapley in One Training Run},
author={Wang, Jiachen T and Mittal, Prateek and Song, Dawn and Jia, Ruoxi},
journal={arXiv preprint arXiv:2406.11011},
year={2024}
}