Main Page

News

Presentation Schedule

Written on 27.10.25 by Ziqing Yang

Dear all,

After receiving your responses, we have arranged a schedule for you to give the presentations.

Starting from November 4th, every Tuesday from 2 pm to 3 pm, we will have two presenters introduce their preferred papers.

04.11.2025
Rishika Kumari, PLeak: Prompt… Read more

Dear all,

After receiving your responses, we have arranged a schedule for you to give the presentations.

Starting from November 4th, every Tuesday from 2 pm to 3 pm, we will have two presenters introduce their preferred papers.

04.11.2025
Rishika Kumari, PLeak: Prompt Leaking Attacks against Large Language Model Applications
Prachi Sajwan, I Don't Know If We're Doing Good. I Don't Know If We're Doing Bad: Investigating How Practitioners Scope, Motivate, and Conduct Privacy Work When Developing AI Products

11.11.2025
Manu Vyshnavam Viswakarmav, Unveiling Privacy Risks in LLM Agent Memory
Ansu Varghese, Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

18.11.2025
Tianze Chang, Universal and Transferable Adversarial Attacks on Aligned Language Models
Farzaneh Soltanzadeh, JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

25.11.2025
Syed Usfar Wasim, Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models
Shreya Atul Kolhapure, Formalizing and Benchmarking Prompt Injection Attacks and Defenses

02.12.2025
Tarik Kemal Gundogdu, From Meme to Threat: On the Hateful Meme Understanding and Induced Hateful Content Generation in Open-Source Vision Language Models
Mengfei Liang, Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

09.12.2025
Elena Bondarevskaya, On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
Xinyu Zhang, HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

16.12.2025
Daniyal Azfar, Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Shaun Paul, Societal Alignment Frameworks Can Improve LLM Alignment

Best,
Ziqing

Paper List Available

Written on 23.10.25 by Ziqing Yang

Dear all,

The paper list is online. Please select three papers (ranked by preference) and send them to Ziqing Yang (ziqing.yang@cispa.de) by 24.10.2025.

Note that the assignment will be based on the first-come, first-served principle.

The assignment will be announced at 11 am on… Read more

Dear all,

The paper list is online. Please select three papers (ranked by preference) and send them to Ziqing Yang (ziqing.yang@cispa.de) by 24.10.2025.

Note that the assignment will be based on the first-come, first-served principle.

The assignment will be announced at 11 am on 27.10.2025.

Best,

Ziqing

Paper List

Membership Inference Attacks Against In-Context Learning
PLeak: Prompt Leaking Attacks against Large Language Model Applications
Towards label-only membership inference attack against pre-trained large language models
"I Don't Know If We're Doing Good. I Don't Know If We're Doing Bad": Investigating How Practitioners Scope, Motivate, and Conduct Privacy Work When Developing AI Products
Unveiling Privacy Risks in LLM Agent Memory
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models
Black-box Membership Inference Attacks against Fine-tuned Diffusion Models
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Universal and Transferable Adversarial Attacks on Aligned Language Models
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model
Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Instruction Backdoor Attacks Against Cutomized LLMs
Prompt Stealing Attacks Against Text-to-Image Generation Models
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
From Meme to Threat: On the Hateful Meme Understanding and Induced Hateful Content Generation in Open-Source Vision Language Models
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities
Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications
On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Learning Safety Constraints for Large Language Models
Antidote: Post‑fine‑tuning Safety Alignment for LLMs against Harmful Fine‑tuning
Societal Alignment Frameworks Can Improve LLM Alignment
One-Shot Safety Alignment for Large Language Models via Optimal Dualization
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

AI Safety

As AI systems become increasingly powerful and integrated into critical aspects of society, ensuring they behave safely and reliably has never been more important. AI Safety is the interdisciplinary field focused on minimizing risks associated with AI, from algorithmic bias and system failures to the long-term challenges posed by advanced autonomous agents.

In this seminar, we will explore the key technical, ethical, and societal issues related to AI safety. Topics include value alignment, robustness, and the governance of powerful AI systems. By the end of the seminar, students will gain a foundational understanding of how to assess and mitigate risks, design safer AI systems, and contribute to responsible AI development.

Logistics:

Time: Tuesday 2pm - 4pm

Location: TBD

TAs:

Ziqing Yang (ziqing.yang@cispa.de)
Yihan Ma
Bo Shao

AI Safety

News

Presentation Schedule

Paper List Available

AI Safety

Logistics:

TAs:

List of Papers