Data in Generative Models

Generative models have become extremely powerful and are now integral to various aspects of daily life from creative arts to customer service. Given their increasing interaction with people, ensuring their trustworthiness is crucial. This workshop centers on the idea that the safety and reliability of generative models are deeply connected to the nature and treatment of their training data. We aim to explore the hypothesis that building reliable and trustworthy artificial intelligence (AI) systems based on generative models must start with high-quality and responsibly managed data.

The workshop will focus on several key areas where training data impacts the trustworthiness of generative models. Among others, we will address 1) privacy concerns, highlighting how improper inclusion and handling of sensitive information in the training data can lead to significant privacy violations; 2) safety risks, like backdoors and data poisoning that threaten robust generations; and 3) the impact of biases in generative models’ training data, which can cause models to perpetuate or even amplify societal biases, resulting in unfair outcomes.

Through expert talks, panel discussions, and interactive sessions, participants will delve into these issues and explore strategies for developing safe, trustworthy, and reliable generative models. This workshop aims to foster collaboration and drive forward research to ensure that generative models, as they become more embedded in our lives, do so in a trustworthy and beneficial manner.

Schedule

⭐ Coming Soon ⭐

Speakers


Pin-Yu Chen Principal Research Scientist IBM Research AI	Nouha Dziri Research Scientist Allen Institute for AI (AI2)	Sanmi Koyejo Assistant Professor Stanford University

Ivan Evtimov Research Scientist Meta AI Red Team	Tatsunori Hashimoto Assistant Professor Stanford University	Eric Wong Assistant Professor University of Pennsylvania

Call for Papers

We cordially invite submissions and participation in our “Data in Generative Models (The Bad, the Ugly, and the Greats)” workshop that will be held on July 19th, 2025 at the Forty-Second International Conference on Machine Learning (ICML) 2025 in Vancouver, Canada.

Motivation and Topics

Building upon the success of our previous workshop, BUGS@NeurIPS 2023, we will continue to explore the impact of data on AI beyond backdoor attacks. A key enhancement of the new 2025 edition of the workshop is the substantial expansion of the 2023 research themes, focusing on the interaction between Datasets and Generative Models, including large language models, diffusion models, and vision-language models. Examples of research areas include:

Data-Centric Approach to the Safety of Generative Models (general theme): Most research in safe machine learning concentrates primarily on evaluating model properties (Papernot et. al.). However, from a data-centric perspective, we focus on the safety of generative models, such as LLMs and diffusion models. This strategy will create new methodologies and reveal vulnerabilities and potential threats that have not been previously sufficiently recognized by the research community.
Data Memorization in Generative Models: Current approaches to prevent data memorization in generative models propose limited solutions, such as removing, collecting, or augmenting specific data samples (Wen et. al.). This area of research aims to achieve a comprehensive understanding of data memorization, extending the analysis beyond training set distributions and model outputs. This requires us to examine the characteristics of individual data samples that make them susceptible to memorization, focusing on their representations across multiple internal model layers rather than solely on model outputs, as done in previous studies.
Data Contamination in Generative Models: Including test data from downstream tasks and benchmarks in the training data of generative models poses a significant challenge in accurately measuring the models’ true effectiveness. Developing new, effective methods for detecting data contamination within generative models is essential. These innovative approaches should focus on identifying potential contamination at the individual instance level and then use this information to evaluate broader contamination at the dataset level.
Data Verification in Generative Models:This area focuses on the inputs and outputs of generative models. Current methods for identifying training samples and verifying model outputs are typically assessed using academic benchmark datasets and small-scale models, as highlighted in the recent work (Dubiński et. al., Duan et. al.). We aim to bridge the gap by focusing on data verification for large-scale generative models trained on extensive datasets, ensuring privacy protection on a large scale in real-world scenarios.

We welcome submissions related to all aspects of Data in Generative AI, including but not limited to:

Data-Centric Approach to the Safety of Generative Models
Data Memorization in Generative Models
Data Contamination in Generative Models
Data Verification in Generative Models
Data Poisoning and Backdoors in Generative Models
Generative Data for Trustworthy AI Research (e.g., synthetic datasets for security studies, synthetic augmentation for robust models, etc.)

The workshop will employ a double-blind review process. Each submission will be evaluated based on the following criteria:

Soundness of the methodology
Relevance to the workshop
Societal impacts

We only consider submissions that haven’t been published in any peer-reviewed venue, including ICML 2025 conference. We allow dual submissions with other workshops or conferences. The workshop is non-archival and will not have any official proceedings. All accepted papers will be allocated either a poster presentation or a talk slot.

Important Dates

Submission deadline: ~~May 20th, 2025~~ May 25th, 2025, 11:59 PM Anywhere on Earth (AoE)
Author notification: June 9th, 2025
Camera-ready deadline: June 30th, 2025 11:59 PM Anywhere on Earth (AoE)
Workshop date: July 19th, 2025 (Full-day Event)

Submission Instructions

Papers should be submitted to OpenReview: https://openreview.net/group?id=ICML.cc/2025/Workshop/DIG-BUG

We welcome both long (up to 6 pages) and short/position (up to 4 pages) papers, excluding references, acknowledgments, or appendices. Please use the submission template provided here. Submissions must be anonymous following ICML double-blind reviewing guidelines, ICML Code of Conduct, and Code of Ethics. Accepted papers will be hosted on the workshop website but are considered non-archival and can be submitted to other workshops, conferences, or journals if their submission policy allows.

Workshop Sponsors

Organizers


Franziska Boenisch CISPA Helmholtz Center for Information Security	Khoa D Doan CECS-VinUniversity & VinUni-Illinois Smart Health Center	Adam Dziedzic CISPA Helmholtz Center for Information Security	Bo Li University of Illinois at Urbana-Champaign

Viet Anh Nguyen Chinese University of Hong Kong	Aniruddha Saha Independent Researcher	Zhenting Wang Rutgers University	Heather Zheng University of Chicago

Data in Generative Models @ ICML 2025

The Bad, the Ugly, and the Greats - Building reliable and trustworthy Generative AI must start with high-quality and responsibly managed training data.

Schedule

⭐ Coming Soon ⭐

Speakers

Call for Papers

Motivation and Topics

Important Dates

Submission Instructions

Workshop Sponsors

Organizers

Organizer affiliations

Schedule

⭐ Coming Soon ⭐

Speakers

Call for Papers

Motivation and Topics

Important Dates

Submission Instructions

Workshop Sponsors

⭐ Please reach out if you would like to sponsor our workshop. ⭐

Organizers

Organizer affiliations