2nd Workshop on Navigating and Addressing Data Problems for Foundation Models
(DATA-FM @ ICLR 2025)
Foundation models (FMs) have become central to modern machine learning, with data playing a crucial role in their development and sparking increased attention to data-related challenges such as curation and attribution. Adapting traditional data-centric methods to FMs is challenging due to the scale of both data and model architectures, necessitating interdisciplinary collaboration and community efforts. Building on the success of the first Data Problems in Foundation Models workshop at ICLR 2024, the second DATA-FM workshop will address persistent and emerging data-related challenges in FM deployment. While longstanding issues in data collection, curation, and synthesis remain relevant, new challenges have arisen as FMs are integrated into a growing number of applications and become increasingly multi-modal. Concurrently, the societal impact of AI has intensified, highlighting concerns such as data copyright. These evolving challenges emphasize the need for continued, focused discussions on data-related issues in FM development. Our goals include fostering a comprehensive understanding of these challenges across the entire FM pipeline and creating a platform for interdisciplinary researchers to connect, collaborate, and drive progress. We hope this workshop will serve as a catalyst for innovative solutions to critical data challenges, shaping the future of FMs and their wide-ranging applications.
In case of any issues or questions, feel free to email the organizers at tianhaowang@princeton.edu.
Important Dates
Submission Deadline | Feb 7th, 2025, 11:59pm Anywhere on Earth (AoE) |
Decision Notification | March 5th, 2025 |
Camera-ready Deadline | TBA, 2025 |
Workshop Date | TBA, 2025 @ Singapore EXPO, Singapore |
Call for Papers
We invite submissions to the 2nd DATA-FM workshop, focusing on data-centric techniques and foundation model (FM) development. We encourage submissions across a wide range of topics, including but not limited to:
Data Collection and Curation for Foundation Models
- Practical strategies for curating data (e.g., filtering, mixing, repairing) tailored to FM training stages.
- Extending data curation techniques to Retrieval-Augmented Generation (RAG), multimodal settings, and LLM agents.
- Theoretical frameworks for guiding data selection and scaling laws for foundation models.
Data Attribution, Interpretability, and Data Marketplaces
- Efficient techniques for attributing model outputs to specific training data.
- Evaluating and comparing data attribution methods.
- Economic models for data pricing and the design of data marketplaces that ensure fair compensation.
Law and Technical Solutions for Data Copyright Protection
- Mitigation strategies and mathematical frameworks for addressing copyright issues in FM training data.
- Connections between copyright, privacy, and fairness, including adaptations of techniques like machine unlearning.
Synthetic Data and Model Collapse
- High-quality synthetic data generation and its impact on FM performance, robustness, and safety.
- Understanding and mitigating model collapse through theoretical and empirical investigations.
Data and Society (Safety, Privacy, Fairness, and Other Social Impacts)
- Improving AI safety, privacy, and fairness through data-centric approaches.
- Addressing the side effects of data curation on fairness and ethics in FMs.
Benchmarks and Evaluations
- Designing evaluation metrics for data-centric techniques and creating reliable dataset benchmarks for FMs.
- Identifying and addressing pitfalls in existing dataset benchmarks, such as test data contamination.
Submission Guidelines
We welcome submissions to two paper tracks for the workshop:
- Regular/Position Papers Track: Submissions may be up to 10 pages, excluding references and appendices. All papers must be formatted using the DATA-FM template (see below).
- Tiny Papers Track: This track encourages submissions describe early-stage research, including modest theoretical results, novel observations from preliminary experiments, or new perspectives on existing problems. Submissions are required to be 3-5 pages using the DATA-FM template (see below).
Paper templates and style files (adapted from the ICML template) can be found in this Overleaf template. Submissions must follow the template and style, and be properly anonymized (for double-blind review), and not exceed the page limits for the specified track (excluding references and appendices). Accepted papers will be shared on OpenReview, but the workshop will remain non-archival.
Submissions should be uploaded via the ICLR 2025 DATA-FM Workshop Submission portal on OpenReview.
Author-Reviewer Policy
The workshop program committee plays an important role in identifying and giving feedback on up-and-coming work that would most benefit from discussion and visibility at the workshop. To sustain our review and program selection processes, we expect at least one author of each submitted paper to volunteer to participate as a reviewer for the DATA-FM 2025 workshop.
Regarding Tiny Papers Track
This year, ICLR is discontinuing the separate βTiny Papersβ track, and is instead requiring each workshop to accept short (3β5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion; see ICLR 2025 Tiny Papers Track for more details. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2025 will become available on ICLR 2025 Website at the beginning of February and close on March 2nd.
Schedule (Tentative)
All times listed below are in Singapore Time SGT (GMT+8).
MORNING SESSION π π | |
9:00 - 9:05 AM | Opening Remarks π |
9:05 - 9:35 AM | Invited Talk: Ari Marcos π€π£οΈ |
9:35 - 10:05 AM | Invited Talk: Baharan Mirzasoleiman π€π£οΈ |
10:05 - 11:20 AM | Poster Session I πͺ§ |
11:20 - 11:30 AM | Coffee Break β |
11:30 - 12:00 PM | Invited Talk: Peter Henderson π€π£οΈ |
12:00 - 12:10 PM | Spotlight Presentation 1 π |
12:10 - 12:20 PM | Spotlight Presentation 2 π |
12:20 - 12:30 PM | Spotlight Presentation (Tiny Papers) π |
12:30 - 1:30 PM | Lunch Break π² |
AFTERNOON SESSION ππ | |
1:30 - 2:00 PM | Invited Talk: Bryan Low π€π£οΈ |
2:00 - 2:30 PM | Invited Talk: Danqi Chen π€π£οΈ |
2:30 - 2:40 PM | Spotlight Presentation 3 π |
2:40 - 2:50 PM | Spotlight Presentation 4 π |
2:50 - 4:05 PM | Poster Session II πͺ§ |
4:05 - 4:35 PM | Invited Talk: Kyle Lo or Luca Soldaini π€π£οΈ |
4:35 - 5:05 PM | Panel Discussion π₯π¬ |
5:05 - 5:10 PM | Closing Remarks π |
Speakers
Princeton University
Princeton University
Allen Institute for AI (Ai2)
National University of Singapore (NUS)
University of California, Los Angeles (UCLA)
Datalogy AI
Allen Institute for AI (Ai2)
Organizers
Virginia Tech
University of Washington
University of California, Berkeley
Sony AI
Virginia Tech
Virginia Tech
Princeton University