Data in Generative Models @ ICML 2025
The Bad, the Ugly, and the Greats - Building reliable and trustworthy Generative AI must start with high-quality and responsibly managed training data.
⭐ West Ballroom A, Vancouver Convention Centre, 8:55 AM - 5 PM PDT, July 19th, 2025 ⭐
Generative models have become extremely powerful and are now integral to various aspects of daily life from creative arts to customer service. Given their increasing interaction with people, ensuring their trustworthiness is crucial. This workshop centers on the idea that the safety and reliability of generative models are deeply connected to the nature and treatment of their training data. We aim to explore the hypothesis that building reliable and trustworthy artificial intelligence (AI) systems based on generative models must start with high-quality and responsibly managed data.
The workshop will focus on several key areas where training data impacts the trustworthiness of generative models. Among others, we will address 1) privacy concerns, highlighting how improper inclusion and handling of sensitive information in the training data can lead to significant privacy violations; 2) safety risks, like backdoors and data poisoning that threaten robust generations; and 3) the impact of biases in generative models’ training data, which can cause models to perpetuate or even amplify societal biases, resulting in unfair outcomes.
Through expert talks, panel discussions, and interactive sessions, participants will delve into these issues and explore strategies for developing safe, trustworthy, and reliable generative models. This workshop aims to foster collaboration and drive forward research to ensure that generative models, as they become more embedded in our lives, do so in a trustworthy and beneficial manner.
⭐ Awards ⭐
Best Paper - Preference Leakage: A Contamination Problem in LLM-as-a-judge
Best Poster - Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models
Schedule
Event Start Time (PDT/GMT-07:00, Vancouver, Canada) |
Session | Speaker(s) |
---|---|---|
08:55 am | Welcome Remarks | Organizers |
09:00 am | Invited Talk 1 Dynamic & Stateful Evals of Safety on the Frontier: What can Academics do? |
Eric Wong |
09:30 am | Oral 1 Preference Leakage: A Contamination Problem in LLM-as-a-judge |
Dawei Li |
09:45 am | Oral 2 Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets |
Lei Hsiung |
10:00 am | Coffee Break | |
10:30 am | Invited Talk 2 There’s No Free Lunch in Safety in Fine-tuning Large Language Models |
Pin-Yu Chen |
11:00 am | Oral 3 Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLMs |
Sicheng Lai |
11:15 am | Invited Talk 3 Building Trustworthy LLMs: How Data Quality Shapes Performance and Where It Falls Short? |
Nouha Dziri |
11:45 am | Lunch Break | |
01:00 pm | Invited Talk 4 Data-centric LM research on an academic budget |
Tatsunori Hashimoto |
01:30 pm | Oral 4 Training Diffusion Models with Noisy Data via SFBD Flow |
Haoye Lu |
01:45 pm | Invited Talk 5 How (not) to hack AI? |
Ivan Evtimov |
02:15 pm | Oral 5 Unlocking Post-hoc Dataset Inference with Synthetic Data |
Pratyush Maini |
02:30 pm | Coffee Break | |
03:00 pm | Poster Session | Paper Authors |
03:45 pm | Invited Talk 6 On Specification Data |
Serena Booth |
04:15 pm | Panel Discussion | Pin-Yu Chen, Eric Wong, Ivan Evtimov, Serena Booth Moderator: Aniruddha Saha |
04:45 pm | Paper Awards | Organizers |
04:55 pm | Closing Remarks | Organizers |
Speakers
![]() |
![]() |
![]() |
Serena Booth Assistant Professor Brown University |
Pin-Yu Chen Principal Research Scientist IBM Research AI |
Nouha Dziri Research Scientist Allen Institute for AI (AI2) |
![]() |
![]() |
![]() |
Ivan Evtimov Research Scientist Meta AI Red Team |
Tatsunori Hashimoto Assistant Professor Stanford University |
Eric Wong Assistant Professor University of Pennsylvania |
Call for Papers
We cordially invite submissions and participation in our “Data in Generative Models (The Bad, the Ugly, and the Greats)” workshop that will be held on July 19th, 2025 at the Forty-Second International Conference on Machine Learning (ICML) 2025 in Vancouver, Canada.
Motivation and Topics
Building upon the success of our previous workshop, BUGS@NeurIPS 2023, we will continue to explore the impact of data on AI beyond backdoor attacks. A key enhancement of the new 2025 edition of the workshop is the substantial expansion of the 2023 research themes, focusing on the interaction between Datasets and Generative Models, including large language models, diffusion models, and vision-language models. Examples of research areas include:
-
Data-Centric Approach to the Safety of Generative Models (general theme): Most research in safe machine learning concentrates primarily on evaluating model properties (Papernot et. al.). However, from a data-centric perspective, we focus on the safety of generative models, such as LLMs and diffusion models. This strategy will create new methodologies and reveal vulnerabilities and potential threats that have not been previously sufficiently recognized by the research community.
-
Data Memorization in Generative Models: Current approaches to prevent data memorization in generative models propose limited solutions, such as removing, collecting, or augmenting specific data samples (Wen et. al.). This area of research aims to achieve a comprehensive understanding of data memorization, extending the analysis beyond training set distributions and model outputs. This requires us to examine the characteristics of individual data samples that make them susceptible to memorization, focusing on their representations across multiple internal model layers rather than solely on model outputs, as done in previous studies.
-
Data Contamination in Generative Models: Including test data from downstream tasks and benchmarks in the training data of generative models poses a significant challenge in accurately measuring the models’ true effectiveness. Developing new, effective methods for detecting data contamination within generative models is essential. These innovative approaches should focus on identifying potential contamination at the individual instance level and then use this information to evaluate broader contamination at the dataset level.
-
Data Verification in Generative Models:This area focuses on the inputs and outputs of generative models. Current methods for identifying training samples and verifying model outputs are typically assessed using academic benchmark datasets and small-scale models, as highlighted in the recent work (Dubiński et. al., Duan et. al.). We aim to bridge the gap by focusing on data verification for large-scale generative models trained on extensive datasets, ensuring privacy protection on a large scale in real-world scenarios.
We welcome submissions related to all aspects of Data in Generative AI, including but not limited to:
- Data-Centric Approach to the Safety of Generative Models
- Data Memorization in Generative Models
- Data Contamination in Generative Models
- Data Verification in Generative Models
- Data Poisoning and Backdoors in Generative Models
- Generative Data for Trustworthy AI Research (e.g., synthetic datasets for security studies, synthetic augmentation for robust models, etc.)
The workshop will employ a double-blind review process. Each submission will be evaluated based on the following criteria:
- Soundness of the methodology
- Relevance to the workshop
- Societal impacts
We only consider submissions that haven’t been published in any peer-reviewed venue, including ICML 2025 conference. We allow dual submissions with other workshops or conferences. The workshop is non-archival and will not have any official proceedings. All accepted papers will be allocated either a poster presentation or a talk slot.
Important Dates
- Submission deadline:
May 20th, 2025May 25th, 2025, 11:59 PM Anywhere on Earth (AoE) - Author notification: June 9th, 2025
- Camera-ready deadline: June 30th, 2025 11:59 PM Anywhere on Earth (AoE)
- Workshop date: July 19th, 2025 (Full-day Event)
Submission Instructions
Papers should be submitted to OpenReview: https://openreview.net/group?id=ICML.cc/2025/Workshop/DIG-BUG
We welcome both long (up to 6 pages) and short/position (up to 4 pages) papers, excluding references, acknowledgments, or appendices. Please use the submission template provided here. Submissions must be anonymous following ICML double-blind reviewing guidelines, ICML Code of Conduct, and Code of Ethics. Accepted papers will be hosted on the workshop website but are considered non-archival and can be submitted to other workshops, conferences, or journals if their submission policy allows.
Workshop Swag
Our workshop T-shirt was designed by Oindrila Ghosh, Aniruddha Saha, ChatGPT 4o, Han Nguyen and Khoa D Doan.
If you want to take a look at the co-design process with ChatGPT, here are the chats (1, 2, 3, 4, 5).
Workshop Sponsors
⭐ Please reach out if you would like to sponsor our workshop. ⭐
![]() |
Organizers
![]() |
![]() |
![]() |
![]() |
Franziska Boenisch CISPA Helmholtz Center for Information Security |
Khoa D Doan CECS-VinUniversity & VinUni-Illinois Smart Health Center |
Adam Dziedzic CISPA Helmholtz Center for Information Security |
Bo Li University of Illinois at Urbana-Champaign |
![]() |
![]() |
![]() |
![]() |
Viet Anh Nguyen Chinese University of Hong Kong |
Aniruddha Saha Independent Researcher |
Zhenting Wang Rutgers University |
Heather Zheng University of Chicago |
Organizer affiliations
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |