Two de-identification methods, k-anonymization and adding a “fuzzy factor,” significantly reduced the risk of re-identification of patients in a dataset of 5 million patient records from a large cervical cancer screening program in Norway. This is the conclusion of a new study published in Cancer Epidemiology, Biomarkers & Prevention, a journal of the American Association for Cancer Research. The lead author is Giske Ursin, MD, PhD, director of Cancer Registry of Norway, Institute of Population-based Research.
“Researchers typically get access to de-identified data, that is, data without any personal identifying information, such as names, addresses, and Social Security numbers. However, this may not be sufficient to protect the privacy of individuals participating in a research study,” said Ursin.
Patient datasets often have sensitive data, such as information about a person’s health and disease diagnosis that an individual may not want to share publicly, and data custodians are responsible for safeguarding such information, Ursin added. “People who have the permission to access such datasets have to abide by the laws and ethical guidelines, but there is always this concern that the data might fall into the wrong hands and be misused,” she added. “As a data custodian, that’s my worst mightmare.”
To test the strength of their de-identification technique, Ursin and colleagues used screening data containing 5,693,582 records from 911,510 women in the Norwegian Cervical Cancer Screening Program. The data included patients’ dates of birth, and cervical screening dates, results, names of the labs that ran the tests, subsequent cancer diagnoses, if any, and date of death, if deceased.
The researchers used a tool called ARX to evaluate the risk of re-identification by approaching the dataset using a “prosecutor scenario,” in which the tool assumes the attacker knows that some data about an individual are in the dataset. An attack is considered successful if a large…