2nd Workshop on Navigating and Addressing Data Problems for Foundation Models
(DATA-FM @ ICLR 2025)

Foundation models (FMs) have become central to modern machine learning, with data playing a crucial role in their development and sparking increased attention to data-related challenges such as curation and attribution. Adapting traditional data-centric methods to FMs is challenging due to the scale of both data and model architectures, necessitating interdisciplinary collaboration and community efforts. Building on the success of the first Data Problems in Foundation Models workshop at ICLR 2024, the second DATA-FM workshop will address persistent and emerging data-related challenges in FM deployment. While longstanding issues in data collection, curation, and synthesis remain relevant, new challenges have arisen as FMs are integrated into a growing number of applications and become increasingly multi-modal. Concurrently, the societal impact of AI has intensified, highlighting concerns such as data copyright. These evolving challenges emphasize the need for continued, focused discussions on data-related issues in FM development. Our goals include fostering a comprehensive understanding of these challenges across the entire FM pipeline and creating a platform for interdisciplinary researchers to connect, collaborate, and drive progress. We hope this workshop will serve as a catalyst for innovative solutions to critical data challenges, shaping the future of FMs and their wide-ranging applications.

In case of any issues or questions, feel free to email the organizers at tianhaowang@princeton.edu.

Important Dates

Submission Deadline	Feb 7th, 2025, 11:59pm Anywhere on Earth (AoE)
Decision Notification	March 5th, 2025
Camera-ready Deadline	TBA, 2025
Workshop Date	TBA, 2025 @ Singapore EXPO, Singapore

Call for Papers

We invite submissions to the 2nd DATA-FM workshop, focusing on data-centric techniques and foundation model (FM) development. We encourage submissions across a wide range of topics, including but not limited to:

Data Collection and Curation for Foundation Models

Practical strategies for curating data (e.g., filtering, mixing, repairing) tailored to FM training stages.
Extending data curation techniques to Retrieval-Augmented Generation (RAG), multimodal settings, and LLM agents.
Theoretical frameworks for guiding data selection and scaling laws for foundation models.

Data Attribution, Interpretability, and Data Marketplaces

Efficient techniques for attributing model outputs to specific training data.
Evaluating and comparing data attribution methods.
Economic models for data pricing and the design of data marketplaces that ensure fair compensation.

Law and Technical Solutions for Data Copyright Protection

Mitigation strategies and mathematical frameworks for addressing copyright issues in FM training data.
Connections between copyright, privacy, and fairness, including adaptations of techniques like machine unlearning.

Synthetic Data and Model Collapse

High-quality synthetic data generation and its impact on FM performance, robustness, and safety.
Understanding and mitigating model collapse through theoretical and empirical investigations.

Data and Society (Safety, Privacy, Fairness, and Other Social Impacts)

Improving AI safety, privacy, and fairness through data-centric approaches.
Addressing the side effects of data curation on fairness and ethics in FMs.

Benchmarks and Evaluations

Designing evaluation metrics for data-centric techniques and creating reliable dataset benchmarks for FMs.
Identifying and addressing pitfalls in existing dataset benchmarks, such as test data contamination.

Submission Guidelines

We welcome submissions to two paper tracks for the workshop:

Regular/Position Papers Track: Submissions may be up to 10 pages, excluding references and appendices. All papers must be formatted using the DATA-FM template (see below).
Tiny Papers Track: This track encourages submissions describe early-stage research, including modest theoretical results, novel observations from preliminary experiments, or new perspectives on existing problems. Submissions are required to be 3-5 pages using the DATA-FM template (see below).

Paper templates and style files (adapted from the ICML template) can be found in this Overleaf template. Submissions must follow the template and style, and be properly anonymized (for double-blind review), and not exceed the page limits for the specified track (excluding references and appendices). Accepted papers will be shared on OpenReview, but the workshop will remain non-archival.

Submissions should be uploaded via the ICLR 2025 DATA-FM Workshop Submission portal on OpenReview.

Author-Reviewer Policy

The workshop program committee plays an important role in identifying and giving feedback on up-and-coming work that would most benefit from discussion and visibility at the workshop. To sustain our review and program selection processes, we expect at least one author of each submitted paper to volunteer to participate as a reviewer for the DATA-FM 2025 workshop.

Regarding Tiny Papers Track

This year, ICLR is discontinuing the separate “Tiny Papers” track, and is instead requiring each workshop to accept short (3–5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion; see ICLR 2025 Tiny Papers Track for more details. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2025 will become available on ICLR 2025 Website at the beginning of February and close on March 2nd.

Schedule (Tentative)

All times listed below are in Singapore Time SGT (GMT+8).

MORNING SESSION 🌅🕘
9:00 - 9:05 AM	Opening Remarks 📖
9:05 - 9:35 AM	Invited Talk: Ari Marcos 🤝🗣️
9:35 - 10:05 AM	Invited Talk: Baharan Mirzasoleiman 🤝🗣️
10:05 - 11:20 AM	Poster Session I 🪧
11:20 - 11:30 AM	Coffee Break ☕
11:30 - 12:00 PM	Invited Talk: Peter Henderson 🤝🗣️
12:00 - 12:10 PM	Spotlight Presentation 1 📊
12:10 - 12:20 PM	Spotlight Presentation 2 📊
12:20 - 12:30 PM	Spotlight Presentation (Tiny Papers) 📊
12:30 - 1:30 PM	Lunch Break 🍲
AFTERNOON SESSION 🌇🕐
1:30 - 2:00 PM	Invited Talk: Bryan Low 🤝🗣️
2:00 - 2:30 PM	Invited Talk: Danqi Chen 🤝🗣️
2:30 - 2:40 PM	Spotlight Presentation 3 📊
2:40 - 2:50 PM	Spotlight Presentation 4 📊
2:50 - 4:05 PM	Poster Session II 🪧
4:05 - 4:35 PM	Invited Talk: Kyle Lo or Luca Soldaini 🤝🗣️
4:35 - 5:05 PM	Panel Discussion 👥💬
5:05 - 5:10 PM	Closing Remarks 📗