News

Currently, no news are available

Data anonymization and how to break it

Anonymization is the main legal paradigm for sharing data while limiting privacy harms. Yet, robust anonymization of individual-level data is very difficult to achieve in practice.

In this seminar, you will first learn what is not anonymous data, through (in)famous examples of anonymization failures. Then, we will turn our attention to modern data sharing systems, including query-based systems, synthetic data, and differential privacy. Finally, we will cover automated approaches for auditing the privacy of these systems, which holds interesting challenges.

Lecturer: Dr. Ana-Maria Cretu (CISPA)

Time: Every Thursday between 2:15 and 4 PM starting on April 16. 

Location: C0 Room 0.02, located in the CISPA Stuhlsatzenhaus 5 building, except for 07.05.26 and 02.07.26. for which the location is TBD.

Requirements:

  • Strong interest in data privacy.
  • Basic knowledge of probabilities and statistics.
  • (Optional) Having taken a course on machine learning and/or optimization.

Grading: This seminar has three grading components:

  1. Presentation (50%): You will prepare and deliver a 20-25 min presentation (followed by 10 mins question/discussion) of the paper assigned to you. You will have the possibility to get feedback on your slides before the presentation.
  2. Participation in discussion (20%): Contribution to the discussion during the seminar meeting, including but not limited to 3 questions prepared in advance for one selected paper (see component 3 below).
  3. Paper review: (30%) You will write a review of a different paper than the one you presented. The review can be up to 4 A4 pages (template TBD), not counting references. The use of LLMs in any capacity (ideation, correction, etc.) is strictly forbidden. The review should address the following questions:
    • What research problem does the paper study?
    • Why is this research problem worth studying (societal motivation)?
    • What has been done before (positioning relative to prior work)?
    • How does the paper address the research problem (methods, datasets)?
    • What are the limitations of the paper?
    • What novel research questions does this work inspire?

        For this paper, in addition to the review, you will have to prepare 3 questions to ask the presenter of the paper.

Topics:

Crossed out are presentations skipped or already completed.

Week 1 (16.04.2026, C0 Room 0.02): Introduction & k-anonymity

- We will cover the course organization, a brief intro to anonymization, and an example of paper presentation on k-anonymity.

- Paper 1: k-anonymity: a model for protecting privacy. Presenter: Ana-Maria Cretu.

Optional reading: 

- A scientific review of anonymization: Anonymization: The imperfect science of using data while preserving privacy

- EU guidelines for anonymization: Article 29 Data Working Party's Opinion 05/2014 on Anonymisation Techniques

- How to k-anonymize data: Incognito: Efficient Full-Domain K-Anonymity

Week 2 (23.04.2026, C0 Room 0.02): K-anonymity revisited

Paper 2: ℓ-Diversity: Privacy Beyond k-Anonymity (presentation skipped)

Paper 3: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. Presenter: TBD.

Week 3 (30.04.2026, C0 Room 0.02): Re-identification attacks

Paper 4: Robust De-anonymization of Large Sparse Datasets ("Netflix paper"). Presenter: Eric Ansbach.

Paper 5: Unique in the Crowd: The privacy bounds of human mobility. Presenter: Syed Kumail Raza Zaidi.

Optional reading on the de-anonymization of social network data:

De-anonymizing Social Networks

Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography

Week 4 (07.05.2026, UdS room SR 2 in E 2.5): Advanced re-identification - Part I

- Paper 8: Interaction data are identifiable even across long periods of time. Presenter: Amelie Klein. 

- Paper 9: The risk of re-identification remains high even in country-scale location datasets. Presenter: Luis Henrique Bastos Tamura.

Optional reading:

- In-depth analysis of how re-identification risk scales with population size: A scaling law to model the effectiveness of identification techniques

Week 5 (14.05.2026): No seminar (public holiday)

Week 6 (21.05.2026): No seminar

Week 7 (28.05.2026, C0 Room 0.02): Advanced re-identification - Part II

Paper 6: Estimating the success of re-identifications in incomplete datasets using generative models. Presenter: Kerem Kılıç.

Paper 7: Attacks on Deidentification’s Defenses (presentation skipped)

Week 8 (04.06.2026, C0 Room 0.02): Aggregation

Paper 10: A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Presenter: Assigned.

Paper 11: Knock Knock, Who's There? Membership Inference on Aggregate Location Data. Presenter: Shaloni Modi.

Week 9 (11.06.2026, C0 Room 0.02): Query-based systems

Paper 12: When the Signal is in the Noise: Exploiting Diffix's Sticky Noise. Presenter: Assigned.

Paper 13: QuerySnout: Automating the Discovery of Attribute Inference Attacks against Query-Based Systems. Presenter: TBD.

Week 10 (18.06.2026, C0 Room 0.02): Differential privacy - part I

Paper 14: Revealing Information while Preserving Privacy. Presenter: TBD.

Paper 15: Calibrating Noise to Sensitivity in Private Data Analysis. Presenter: TBD.

Week 11 (25.06.2026, C0 Room 0.02): Differential privacy - part II

Paper 16: Census TopDown: Differentially Private Data, Incremental Schemas, and Consistency with Public Knowledge. Presenter: TBD.

Paper 17: DP-Sniper: Black-Box Discovery of Differential Privacy Violations using Classifiers. Presenter: TBD.

Week 12 (02.07.2026, UdS room SR 2 in E 2.5): Advanced reconstruction

Paper 18: Generate-then-Verify: Reconstructing Data from Limited Published Statistics. Presenter: TBD.

Paper 19: Confidence-Ranked Reconstruction of Census Microdata from Published Statistics. Presenter TBD.

Week 13 (09.07.2026, C0 Room 0.02): Synthetic data - part I

Paper 20: Synthetic Data – Anonymisation Groundhog Day. Presenter: TBD.

Paper 21: Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing. Presenter: TBD.

Week 14 (16.07.2026, C0 Room 0.02): Synthetic data - part II 

Paper 22: PrivBayes: Private Data Release via Bayesian Networks. Presenter: TBD.

Paper 23: The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against "Truly Anonymous" Synthetic Datasets. Presenter: TBD.

Optional reading:

- In-depth analysis of privacy metrics: The DCR Delusion: Measuring the Privacy Risk of Synthetic Data.

Week 15 (23.07.2026, C0 Room 0.02): Anonymization and machine learning

Paper 24: Algorithms that remember: model inversion attacks and data protection law

Paper 25: Extracting Training Data from Large Language Models

Privacy Policy | Legal Notice
If you encounter technical problems, please contact the administrators.