News
Paper List Available
Written on 23.10.2025 16:12 by Ziqing Yang
Dear all,
The paper list is online. Please select three papers (ranked by preference) and send them to Ziqing Yang (ziqing.yang@cispa.de) by 24.10.2025.
Note that the assignment will be based on the first-come, first-served principle.
The assignment will be announced at 11 am on 27.10.2025.
Best,
Ziqing
Paper List
- Membership Inference Attacks Against In-Context Learning
- PLeak: Prompt Leaking Attacks against Large Language Model Applications
- Towards label-only membership inference attack against pre-trained large language models
- "I Don't Know If We're Doing Good. I Don't Know If We're Doing Bad": Investigating How Practitioners Scope, Motivate, and Conduct Privacy Work When Developing AI Products
- Unveiling Privacy Risks in LLM Agent Memory
- Privacy Backdoors: Stealing Data with Corrupted Pretrained Models
- Black-box Membership Inference Attacks against Fine-tuned Diffusion Models
- JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
- Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model
- Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models
- Formalizing and Benchmarking Prompt Injection Attacks and Defenses
- Instruction Backdoor Attacks Against Cutomized LLMs
- Prompt Stealing Attacks Against Text-to-Image Generation Models
- AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
- Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
- From Meme to Threat: On the Hateful Meme Understanding and Induced Hateful Content Generation in Open-Source Vision Language Models
- HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
- Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
- UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
- Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities
- Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications
- On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- Safety Alignment Should Be Made More Than Just a Few Tokens Deep
- Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
- Learning Safety Constraints for Large Language Models
- Antidote: Post‑fine‑tuning Safety Alignment for LLMs against Harmful Fine‑tuning
- Societal Alignment Frameworks Can Improve LLM Alignment
- One-Shot Safety Alignment for Large Language Models via Optimal Dualization
- Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
- SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
