News
Currently, no news are available
Data anonymization and how to break it
Anonymization is the main legal paradigm for sharing data while limiting privacy harms. Yet, robust anonymization of individual-level data is very difficult to achieve in practice.
In this seminar, you will first learn what is not anonymous data, through (in)famous examples of anonymization failures. Then, we will turn our attention to modern data sharing systems, including query-based systems, synthetic data, and differential privacy. Finally, we will cover automated approaches for auditing the privacy of these systems, which holds interesting challenges.
Lecturer: Dr. Ana-Maria Cretu (CISPA)
Time: Every Thursday between 2:15 and 4 PM starting on April 16.
Location: C0 Room 0.02, located in the CISPA Stuhlsatzenhaus 5 building, except for 07.05.26 and 02.07.26. for which the location is TBD.
Requirements:
- Strong interest in data privacy.
- Basic knowledge of probabilities and statistics.
- (Optional) Having taken a course on machine learning and/or optimization.
Grading: This seminar has three grading components:
- Presentation (50%): You will prepare and deliver a 30 min presentation (followed by 10 mins question/discussion) of the paper assigned to you. You will have the possibility to get feedback on your slides before the presentation.
- Participation in discussion (20%): Contribution to the discussion during the seminar meeting, including but not limited to 3 questions prepared in advance for one selected paper (see component 3 below).
- Paper review: (30%) You will write a review of a different paper than the one you presented. The review can be up to 4 A4 pages (template TBD), not counting references. The use of LLMs in any capacity (ideation, correction, etc.) is strictly forbidden. The review should address the following questions:
- What is the research problem addressed by the paper?
- Why is this research problem worth studying (societal motivation)?
- What has been done before (positioning relative to prior work)?
- How does the paper address the research problem (methods, datasets)?
- What are the limitations of the paper?
- What novel research questions does this work inspire?
For this paper, in addition to the review, you will have to prepare 3 questions to ask the presenter of the paper.
Topics:
Week 1 (16.04.2026, C0 Room 0.02): Introduction & k-anonymity
- We will cover the course organization, a brief intro to anonymization, and an example of paper presentation on k-anonymity.
- Paper 1: k-anonymity: a model for protecting privacy. Presenter: Ana-Maria Cretu.
Week 2 (23.04.2026, C0 Room 0.02): l-diversity and t-closeness
- Paper 2: ℓ-Diversity: Privacy Beyond k-Anonymity. Presenter: TBD
- Paper 3: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. Presenter: TBD.
Week 3 (30.04.2026, C0 Room 0.02): Re-identification attacks
- Paper 4: Robust De-anonymization of Large Sparse Datasets ("Netflix paper"). Presenter: TBD.
- Paper 5: Unique in the Crowd: The privacy bounds of human mobility. Presenter: TBD.
- (Optional reading): De-anonymizing Social Networks, Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography
Week 4 (07.05.2026, Location TBD): Advanced re-identification - Part I
- Paper 6: Estimating the success of re-identifications in incomplete datasets using generative models. Presenter: TBD.
- Paper 7: Attacks on Deidentification’s Defenses. Presenter: TBD.
- (Optional reading): A scaling law to model the effectiveness of identification techniques
Week 5 (14.05.2026, C0 Room 0.02): Advanced re-identification - Part II
- Paper 8: Interaction data are identifiable even across long periods of time. Presenter: TBD.
- Paper 9: The risk of re-identification remains high even in country-scale location datasets. Presenter: TBD.
Week 6 (21.05.2026): No seminar
Week 7 (28.05.2026, C0 Room 0.02): Aggregation
Paper 10: A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Presenter: TBD.
Paper 11: Knock Knock, Who's There? Membership Inference on Aggregate Location Data. Presenter: TBD.
Week 8 (04.06.2026, C0 Room 0.02): Query-based systems
Paper 12: When the Signal is in the Noise: Exploiting Diffix's Sticky Noise. Presenter: TBD.
Paper 13: QuerySnout: Automating the Discovery of Attribute Inference Attacks against Query-Based Systems. Presenter: TBD.
Week 9 (11.06.2026, C0 Room 0.02): Differential privacy - part I
Paper 14: Revealing Information while Preserving Privacy. Presenter: TBD.
Paper 15: Calibrating Noise to Sensitivity in Private Data Analysis. Presenter: TBD.Week 10 (18.06.2026, C0 Room 0.02): Differential privacy - part II
Week 10 (18.06.2026, C0 Room 0.02): Differential privacy - part II
Paper 16: Census TopDown: Differentially Private Data, Incremental Schemas, and Consistency with Public Knowledge. Presenter: TBD.
Paper 17: DP-Sniper: Black-Box Discovery of Differential Privacy Violations using Classifiers. Presenter: TBD.
Week 11 (25.06.2026, C0 Room 0.02): Synthetic data - part I
Paper 18: Synthetic Data – Anonymisation Groundhog Day. Presenter: TBD.
Paper 19: Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing. Presenter: TBD.
Week 12 (02.07.2026, Location TBD): Synthetic data - part II
Paper 20: The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against "Truly Anonymous" Synthetic Datasets. Presenter: TBD.
Paper 21: The DCR Delusion: Measuring the Privacy Risk of Synthetic Data. Presenter: TBD.
Week 13 (09.07.2026, C0 Room 0.02): Anonymization and machine learning
Paper 22: Algorithms that remember: model inversion attacks and data protection law
Paper 23: Extracting Training Data from Large Language Models
Week 14 (15.07.2026, C0 Room 0.02): Pen-and-paper exam
