Data Shapley in One Training Run

Jiachen T. Wang^1★, Prateek Mittal¹, Dawn Song², Ruoxi Jia^3★

¹Princeton University ²University of California, Berkeley ³Virginia Tech

^★Correspondence authors

TL;DR: We introduce In-Run Data Shapley, a scalable data attribution algorithm for machine learning that calculates Data Shapley values during a single training run.

Abstract. Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

Data attribution: assessing training data contribution

In the era of foundation models, understanding the contribution of each data point is essential. Reliable data attribution can prevent legal issues, ensure fair compensation for data creators, and encourage the production of high-quality content. Without it, there is a risk of intellectual property violations. Additionally, foundation models are often trained on massive datasets scraped from the internet, which can include low-quality and harmful content. Problematic data not only wastes computational resources but also skews model outputs, potentially leading to biased or inaccurate results. By understanding the contribution of each data source, we can identify and mitigate the influence of low-quality data, thereby improving the efficiency and quality of model training.

Understanding the Shapley Value

The Shapley value, originating from cooperative game theory, is a method to fairly distribute the total gains (or costs) among players based on their individual contributions. This concept is extensively used in machine learning, economics, and business analytics.

Definition

Given a set of players $ N = \{1, 2, ..., n\} $ and a value function $ v $ that assigns a real number to every subset of $ N $ (representing the total value created by that subset), the Shapley value for player $ i $ is defined as:

$$\phi_i(v) = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))$$

This formula sums over all possible subsets $ S $ of $ N $ that do not include player $ i $, weighting each marginal contribution $ v(S \cup \{i\}) - v(S) $ by the fraction of permutations in which $ S $ precedes $ i $.

Key Properties

The Shapley value is the only solution concept that satisfies the following four axioms:

Efficiency: The total value is completely distributed among all players.
Symmetry: Players who contribute equally receive equal shares.
Dummy Player: A player who does not contribute to any subset gets a Shapley value of zero.
Additivity: For any two value functions, the Shapley value of their sum is the sum of their Shapley values.

Data Shapley: A Principled but Inefficient Data Attribution Technique

Data Shapley applies the Shapley value for assessing the contribution of each data point to the final model performance. In this context, the utility function $ v $ maps the input dataset to the model’s performance, such as accuracy or loss. Despite its solid theoretical foundation, calculating Data Shapley values is computationally intensive due to the factorial growth in the number of possible subsets as the dataset size increases. For a dataset with $ n $ points, there are $ 2^n $ subsets to consider, making the exact computation impractical for large datasets.

Overview of the Retraining-based Data Shapley

In-Run Data Shapley: "contribution accountant" for a specific training run

In-Run Data Shapley computes the value of each data point by assessing its contribution during the training process of a specific model. Instead of retraining the model on various subsets of data, In-Run Data Shapley tracks the influence of individual data points throughout the gradient update steps in a single training run. This approach leverages the iterative nature of model training, using first- and second-order Taylor expansions to approximate the changes in the model's performance. We develop a series of techniques, such as the "ghost dot-product" and "ghost gradient-Hessian-gradient product," which enable efficient computation of these values with minimal additional overhead compared to standard training.

Comparison between Retraining-based and In-Run Data Shapley. Retraining-based Data Shapley requires training a model from scratch on all possible subsets of the full training set, which is computationally inefficient and also raises concerns about interpretability and stability. In contrast, In-Run Data Shapley acts as a "contribution accountant", efficiently tracking and attributing data value scores to each training example across gradient update steps during a single training run.

Experiment: does contribution require memorization?

In this experiment, we evaluate the effectiveness of different data attribution techniques in identifying relevant individual corpora that have been paraphrased. We started by selecting a corpus from the training set and creating several paraphrased versions using GPT-4, with varying levels of paraphrasing. These paraphrased versions form our validation set. We then calculated the average value rank of the original training corpus for each of its paraphrased versions.

Top: (left) An original training corpus from Wikipedia. (right) A synthetic corpus falls in the category of "Similar topic" to the Wikipedia corpus on the left. A creative writing instruction-answer corpus generated by GPT-4 based on the topics and content of the Wikipedia corpus on the left (prompt in Appendix A). Bottom: the (average) value rank of the original corpus among all training corpora for validation corpora that are of varying similarity to the original corpus. The rank is out of ≈320k corpora.

As shown in the table, even for a validation corpus that is a complete rewrite but of similar topics, the original training corpus still ranks very high according to the second-order In-Run Data Shapley. The results have important implications for the current discussion of the copyright of generative AI. The high value of the original corpus for non-verbatim yet semantically similar validation corpus implies that contribution does not require memorization, i.e., training data contributes to generative AI even if the output does not closely resemble the input. Our experimental results suggest that training data owners should hold a certain royalty share for generated content, even if it does not look similar to the copyrighted material. This aligns with ongoing discussions in AI copyright.¹²

Experiment: how do data values change during training?

In-Run Data Shapley tracks cumulative data values across training steps, allowing us to evaluate the contributions of data points at various stages. We applied In-Run Data Shapley to a math-related validation corpus to gain insights into data attribution.

Value Composition of Training Corpora by Domain for the corpus below (from ArXiv)
In several applications, the matrix coefficients of the nonlinear valued function G(λ) are usually of low rank. In this section, we show how the exploitation of these low ranks leads to a linearization of size smaller than that of 𝓛_R(λ). This linearization generalizes the one used in [@dopi17; @suba11], which is valid when P(λ) is expressed using monomials, i.e., f_i(λ)=λ^i, to the more general setting used by CORK.

The above animation illustrates the percentage of total value attributed to each domain, excluding those with a total value below zero. The interesting observations are as follows: (1) Rapid Initial Changes: The value composition changes rapidly at the beginning of training, stabilizing over time. (2) Stable Value Proportions: Later stages reflect the relative abundance of math content in the domain of ArXiv. (3) General Corpora Contributions: The Pile-CC domain, containing general web crawls, initially shows positive contributions but quickly drops to negative and converges to zero. This indicates that general corpora are crucial in the early stages for learning basic language patterns and common knowledge, but their relevance diminishes as training progresses and the model focuses on specialized topics.

Experiment: is well-curated dataset actually clean?

Carefully curated pre-training corpora still contain data points that can adversely affect the training process. Identifying and removing these data points can accelerate model convergence and enhance overall performance, thereby saving computational resources. In this experiment, we demonstrate the effectiveness of In-Run Data Shapley in assessing the data quality of a subset of the Pile dataset. We selected 160k corpora (approximately 160 million tokens) and trained a GPT-2 model on this subset, computing data attribution values using Pile's validation set. By filtering out all negatively valued corpora and retraining the GPT-2 model on the cleaned subset, we observed significant improvement in model convergence. For both first- and second-order In-Run Data Shapley, we achieved around 25% fewer training iterations to reach a test loss of 3.75. Surprisingly, our analysis revealed that around 16% of the training corpora had negative second-order In-Run Data Shapley values. This experiment implies that there is still significant room for data curation in well-curated public datasets such as Pile.

Test loss comparison between the original training run and the model trained on the cleaned subset according to In-Run Data Shapley

Citation

@article{wang2024data,
  title={Data Shapley in One Training Run},
  author={Wang, Jiachen T and Mittal, Prateek and Song, Dawn and Jia, Ruoxi},
  journal={arXiv preprint arXiv:2406.11011},
  year={2024}
  }