News

Paper List Available

Written on 23.10.2025 16:12 by Ziqing Yang

Dear all,

The paper list is online. Please select three papers (ranked by preference) and send them to Ziqing Yang (ziqing.yang@cispa.de) by 24.10.2025.

Note that the assignment will be based on the first-come, first-served principle.

The assignment will be announced at 11 am on 27.10.2025.

Best,

Ziqing

Paper List

Membership Inference Attacks Against In-Context Learning
PLeak: Prompt Leaking Attacks against Large Language Model Applications
Towards label-only membership inference attack against pre-trained large language models
"I Don't Know If We're Doing Good. I Don't Know If We're Doing Bad": Investigating How Practitioners Scope, Motivate, and Conduct Privacy Work When Developing AI Products
Unveiling Privacy Risks in LLM Agent Memory
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models
Black-box Membership Inference Attacks against Fine-tuned Diffusion Models
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Universal and Transferable Adversarial Attacks on Aligned Language Models
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model
Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Instruction Backdoor Attacks Against Cutomized LLMs
Prompt Stealing Attacks Against Text-to-Image Generation Models
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
From Meme to Threat: On the Hateful Meme Understanding and Induced Hateful Content Generation in Open-Source Vision Language Models
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities
Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications
On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Learning Safety Constraints for Large Language Models
Antidote: Post‑fine‑tuning Safety Alignment for LLMs against Harmful Fine‑tuning
Societal Alignment Frameworks Can Improve LLM Alignment
One-Shot Safety Alignment for Large Language Models via Optimal Dualization
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

AI Safety

News

Paper List Available