Data in Generative Models @ ICML 2025
The Bad, the Ugly, and the Greats - Building reliable and trustworthy Generative AI must start with high-quality and responsibly managed training data.
Generative models have become extremely powerful and are now integral to various aspects of daily life from creative arts to customer service. Given their increasing interaction with people, ensuring their trustworthiness is crucial. This workshop centers on the idea that the safety and reliability of generative models are deeply connected to the nature and treatment of their training data. We aim to explore the hypothesis that building reliable and trustworthy artificial intelligence (AI) systems based on generative models must start with high-quality and responsibly managed data.
The workshop will focus on several key areas where training data impacts the trustworthiness of generative models. Among others, we will address 1) privacy concerns, highlighting how improper inclusion and handling of sensitive information in the training data can lead to significant privacy violations; 2) safety risks, like backdoors and data poisoning that threaten robust generations; and 3) the impact of biases in generative models’ training data, which can cause models to perpetuate or even amplify societal biases, resulting in unfair outcomes.
Through expert talks, panel discussions, and interactive sessions, participants will delve into these issues and explore strategies for developing safe, trustworthy, and reliable generative models. This workshop aims to foster collaboration and drive forward research to ensure that generative models, as they become more embedded in our lives, do so in a trustworthy and beneficial manner.
Schedule
⭐ Coming Soon ⭐
Speakers (Tentative)
![]() |
![]() |
![]() |
Pin-Yu Chen Principal Research Scientist IBM Research AI |
Zico Kolter Professor Carnegie Mellon University |
Sanmi Koyejo Assistant Professor Stanford University |
![]() |
![]() |
![]() |
Dawn Song Professor University of California, Berkeley |
Eric Wallace Member of Technical Staff OpenAI |
Eric Wong Assistant Professor University of Pennsylvania |
Panelists (Tentative)
![]() |
![]() |
![]() |
Tatsunori Hashimoto Assistant Professor Stanford University |
Adina Williams Research Scientist Facebook AI Research |
Rex (Zhitao) Ying Assistant Professor Yale University |
Call for Papers
We cordially invite submissions and participation in our “Data in Generative Models (The Bad, the Ugly, and the Greats)” workshop that will be held on July 18th or July 19th, 2025 at the Forty-Second International Conference on Machine Learning (ICML) 2025 in Vancouver, Canada.
Motivation and Topics
Building upon the success of our previous workshop, BUGS@NeurIPS 2023, we will continue to explore the impact of data on AI beyond backdoor attacks. A key enhancement of the new 2025 edition of the workshop is the substantial expansion of the 2023 research themes, focusing on the interaction between Datasets and Generative Models, including large language models, diffusion models, and vision-language models. Examples of research areas include:
-
Data-Centric Approach to the Safety of Generative Models (general theme): Most research in safe machine learning concentrates primarily on evaluating model properties (Papernot et. al.). However, from a data-centric perspective, we focus on the safety of generative models, such as LLMs and diffusion models. This strategy will create new methodologies and reveal vulnerabilities and potential threats that have not been previously sufficiently recognized by the research community.
-
Data Memorization in Generative Models: Current approaches to prevent data memorization in generative models propose limited solutions, such as removing, collecting, or augmenting specific data samples (Wen et. al.). This area of research aims to achieve a comprehensive understanding of data memorization, extending the analysis beyond training set distributions and model outputs. This requires us to examine the characteristics of individual data samples that make them susceptible to memorization, focusing on their representations across multiple internal model layers rather than solely on model outputs, as done in previous studies.
-
Data Contamination in Generative Models: Including test data from downstream tasks and benchmarks in the training data of generative models poses a significant challenge in accurately measuring the models’ true effectiveness. Developing new, effective methods for detecting data contamination within generative models is essential. These innovative approaches should focus on identifying potential contamination at the individual instance level and then use this information to evaluate broader contamination at the dataset level.
-
Data Verification in Generative Models:This area focuses on the inputs and outputs of generative models. Current methods for identifying training samples and verifying model outputs are typically assessed using academic benchmark datasets and small-scale models, as highlighted in the recent work (Dubiński et. al., Duan et. al.). We aim to bridge the gap by focusing on data verification for large-scale generative models trained on extensive datasets, ensuring privacy protection on a large scale in real-world scenarios.
We welcome submissions related to all aspects of Data in Generative AI, including but not limited to:
- Data-Centric Approach to the Safety of Generative Models
- Data Memorization in Generative Models
- Data Contamination in Generative Models
- Data Verification in Generative Models
- Data Poisoning and Backdoors in Generative Models
- Generative Data for Trustworthy AI Research (e.g., synthetic datasets for security studies, synthetic augmentation for robust models, etc.)
The workshop will employ a double-blind review process. Each submission will be evaluated based on the following criteria:
- Soundness of the methodology
- Relevance to the workshop
- Societal impacts
We only consider submissions that haven’t been published in any peer-reviewed venue, including ICML 2025 conference. We allow dual submissions with other workshops or conferences. The workshop is non-archival and will not have any official proceedings. All accepted papers will be allocated either a poster presentation or a talk slot.
Important Dates (Tentative)
- Submission deadline: May 20th, 2025, 11:59 PM Anywhere on Earth (AoE)
- Author notification: June 9th, 2025
- Camera-ready deadline: June 30th, 2025 11:59 PM Anywhere on Earth (AoE)
- Workshop date: TBD (Full-day Event)
Submission Instructions
⭐ Coming Soon ⭐
Workshop Sponsors
⭐ Please reach out if you would like to sponsor our workshop. ⭐
Organizers
![]() |
![]() |
![]() |
![]() |
Franziska Boenisch CISPA Helmholtz Center for Information Security |
Khoa D Doan CECS-VinUniversity & VinUni-Illinois Smart Health Center |
Adam Dziedzic CISPA Helmholtz Center for Information Security |
Bo Li University of Illinois at Urbana-Champaign |
![]() |
![]() |
![]() |
![]() |
Viet Anh Nguyen Chinese University of Hong Kong |
Aniruddha Saha Independent Researcher |
Zhenting Wang Rutgers University |
Heather Zheng University of Chicago |
Organizer affiliations
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |