2nd Workshop on Navigating and Addressing Data Problems for Foundation Models
(DATA-FM @ ICLR 2025)


Foundation models (FMs) have become central to modern machine learning, with data playing a crucial role in their development and sparking increased attention to data-related challenges such as curation and attribution. Adapting traditional data-centric methods to FMs is challenging due to the scale of both data and model architectures, necessitating interdisciplinary collaboration and community efforts. Building on the success of the first Data Problems in Foundation Models workshop at ICLR 2024, the second DATA-FM workshop will address persistent and emerging data-related challenges in FM deployment. While longstanding issues in data collection, curation, and synthesis remain relevant, new challenges have arisen as FMs are integrated into a growing number of applications and become increasingly multi-modal. Concurrently, the societal impact of AI has intensified, highlighting concerns such as data copyright. These evolving challenges emphasize the need for continued, focused discussions on data-related issues in FM development. Our goals include fostering a comprehensive understanding of these challenges across the entire FM pipeline and creating a platform for interdisciplinary researchers to connect, collaborate, and drive progress. We hope this workshop will serve as a catalyst for innovative solutions to critical data challenges, shaping the future of FMs and their wide-ranging applications.

In case of any issues or questions, feel free to email the organizers at tianhaowang@princeton.edu.



Important Dates


Submission Deadline Feb 7th, 2025, 11:59pm Anywhere on Earth (AoE)
Decision Notification March 5th, 2025
Camera-ready Deadline TBA, 2025
Workshop Date TBA, 2025 @ Singapore EXPO, Singapore



Call for Papers


We invite submissions to the 2nd DATA-FM workshop, focusing on data-centric techniques and foundation model (FM) development. We encourage submissions across a wide range of topics, including but not limited to:

Data Collection and Curation for Foundation Models

  • Practical strategies for curating data (e.g., filtering, mixing, repairing) tailored to FM training stages.
  • Extending data curation techniques to Retrieval-Augmented Generation (RAG), multimodal settings, and LLM agents.
  • Theoretical frameworks for guiding data selection and scaling laws for foundation models.

Data Attribution, Interpretability, and Data Marketplaces

  • Efficient techniques for attributing model outputs to specific training data.
  • Evaluating and comparing data attribution methods.
  • Economic models for data pricing and the design of data marketplaces that ensure fair compensation.

Law and Technical Solutions for Data Copyright Protection

  • Mitigation strategies and mathematical frameworks for addressing copyright issues in FM training data.
  • Connections between copyright, privacy, and fairness, including adaptations of techniques like machine unlearning.

Synthetic Data and Model Collapse

  • High-quality synthetic data generation and its impact on FM performance, robustness, and safety.
  • Understanding and mitigating model collapse through theoretical and empirical investigations.

Data and Society (Safety, Privacy, Fairness, and Other Social Impacts)

  • Improving AI safety, privacy, and fairness through data-centric approaches.
  • Addressing the side effects of data curation on fairness and ethics in FMs.

Benchmarks and Evaluations

  • Designing evaluation metrics for data-centric techniques and creating reliable dataset benchmarks for FMs.
  • Identifying and addressing pitfalls in existing dataset benchmarks, such as test data contamination.

Submission Guidelines

We welcome submissions to two paper tracks for the workshop:

  1. Regular/Position Papers Track: Submissions may be up to 10 pages, excluding references and appendices. All papers must be formatted using the DATA-FM template (see below).
  2. Tiny Papers Track: This track encourages submissions describe early-stage research, including modest theoretical results, novel observations from preliminary experiments, or new perspectives on existing problems. Submissions are required to be 3-5 pages using the DATA-FM template (see below).

Paper templates and style files (adapted from the ICML template) can be found in this Overleaf template. Submissions must follow the template and style, and be properly anonymized (for double-blind review), and not exceed the page limits for the specified track (excluding references and appendices). Accepted papers will be shared on OpenReview, but the workshop will remain non-archival.

Submissions should be uploaded via the ICLR 2025 DATA-FM Workshop Submission portal on OpenReview.


Author-Reviewer Policy

The workshop program committee plays an important role in identifying and giving feedback on up-and-coming work that would most benefit from discussion and visibility at the workshop. To sustain our review and program selection processes, we expect at least one author of each submitted paper to volunteer to participate as a reviewer for the DATA-FM 2025 workshop.


Regarding Tiny Papers Track

This year, ICLR is discontinuing the separate β€œTiny Papers” track, and is instead requiring each workshop to accept short (3–5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion; see ICLR 2025 Tiny Papers Track for more details. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2025 will become available on ICLR 2025 Website at the beginning of February and close on March 2nd.



Schedule (Tentative)


All times listed below are in Singapore Time SGT (GMT+8).

MORNING SESSION πŸŒ…πŸ•˜
9:00 - 9:05 AM Opening Remarks πŸ“–
9:05 - 9:35 AM Invited Talk: Ari Marcos πŸ€πŸ—£οΈ
9:35 - 10:05 AM Invited Talk: Baharan Mirzasoleiman πŸ€πŸ—£οΈ
10:05 - 11:20 AM Poster Session I πŸͺ§
11:20 - 11:30 AM Coffee Break β˜•
11:30 - 12:00 PM Invited Talk: Peter Henderson πŸ€πŸ—£οΈ
12:00 - 12:10 PM Spotlight Presentation 1 πŸ“Š
12:10 - 12:20 PM Spotlight Presentation 2 πŸ“Š
12:20 - 12:30 PM Spotlight Presentation (Tiny Papers) πŸ“Š
12:30 - 1:30 PM Lunch Break 🍲
AFTERNOON SESSION πŸŒ‡πŸ•
1:30 - 2:00 PM Invited Talk: Bryan Low πŸ€πŸ—£οΈ
2:00 - 2:30 PM Invited Talk: Danqi Chen πŸ€πŸ—£οΈ
2:30 - 2:40 PM Spotlight Presentation 3 πŸ“Š
2:40 - 2:50 PM Spotlight Presentation 4 πŸ“Š
2:50 - 4:05 PM Poster Session II πŸͺ§
4:05 - 4:35 PM Invited Talk: Kyle Lo or Luca Soldaini πŸ€πŸ—£οΈ
4:35 - 5:05 PM Panel Discussion πŸ‘₯πŸ’¬
5:05 - 5:10 PM Closing Remarks πŸ“—



Speakers


Danqi Chen
Princeton University
Peter Henderson
Princeton University
Kyle Lo
Allen Institute for AI (Ai2)
Bryan Low
National University of Singapore (NUS)
Baharan Mirzasoleiman
University of California, Los Angeles (UCLA)
Ari Morcos
Datalogy AI
Luca Soldaini
Allen Institute for AI (Ai2)




Organizers


Ruoxi Jia
Virginia Tech
Pang Wei Koh
University of Washington
Dawn Song
University of California, Berkeley
Feiyang Kang
Virginia Tech
Hoang Anh Just
Virginia Tech
Jiachen (Tianhao) Wang
Princeton University






Website theme adapted from the ICL workshop @ ICML 2024.