In 2020, the United States Census Bureau applied for the first time differential privacy, a technique to anonymize data by adding statistical noise, to the decennial census. Consequently, the state of Alabama filed a federal lawsuit in March 2021 alleging that the 2020 census data had been deliberately altered to be inaccurate, and therefore could not be used to determine congressional districts[1]. While this lawsuit was ultimately dismissed in July 2021, several legal and statistical considerations remain regarding the use of differentially private counting for the census, and the potential ramifications of this method on congressional representation.

 Census privacy and the law.

The United States Census Bureau is prohibited by law from releasing data where the personal information of an individual can be identified. This is critical to ensure that census data remains confidential and can only be used for aggregate statistical purposes. Despite this long-standing mandate, census data has been historically misused by the federal government. For example, during World War II, the Census Bureau provided information on the addresses of individual Japanese American citizens to the Secret Service to aid in internment. Strong privacy protections are necessary to safeguard census data against such misuse.

De-identifying data is hard.

Removing identifying information from a dataset is challenging from a statistical perspective, and simply deleting identifiers such as names, addresses, and phone numbers may not be sufficient to preserve privacy. One well-known example is the Netflix Prize competition, where Netflix released a database of movie ratings with the user information replaced with an anonymous identifier. Researchers, however, were able to link this database with a public database from IMDB that contained identifying information, revealing the user behaviors of several real-world individuals.

This form of data re-identification is known as a linkage attack. Re-identification attacks represent a significant threat to data privacy. In fact, when researchers at the US Census Bureau applied re-identification attacks to historical data from the 2010 US Census, they were able to re-identify private data for 300 million individuals. This finding motivated the Census Bureau to adopt differential privacy, a stronger form of privacy protection, for the 2020 US Census.

What is differential privacy?

Differential privacy is a modern mathematical technique introduced in 2006 to safeguard the privacy of individuals in a database by adding randomness. For example, consider a teacher who wishes to determine what percentage of their students cheated on exams. Due to the sensitive nature of this question, polling the students would be unlikely to yield honest answers. As such, the teacher gives each student a coin and tells them to flip the coin. If the coin lands on heads, the student answers truthfully if they cheated on the exam. If the coin lands on tails, the student answers that they cheated on the exam, regardless of whether they have actually cheated. For each individual student, the teacher is unable to determine if they cheated on the exam. However, with some clever statistical manipulations, the teacher can determine the percentage of the class that cheats.

The differential privacy mechanisms used by the Census Bureau in 2020 work similarly. Importantly, these mechanisms have a privacy budget , and increasing  makes the data more private but decreases the accuracy of the aggregate statistics computed from the data.

Can differentially private data be used for districting?

In high-stakes applications, such as determining congressional representation using census data, the application of differential privacy remains controversial. In these applications, the data used must be accurate, so the choice of privacy budget , which controls the accuracy of the data, is critical. In fact, the state of Alabama’s lawsuit was in part dismissed due to the fact that the suit was filed before the Census Bureau determined the final value of  to use, rendering the plaintiff’s claims unripe. More recently, research has shown that adding differential privacy can degrade the accuracy of census data and can potentially harm voting rights. The debate around the use of differential privacy in the Census remains largely unresolved and may remain a productive source of interesting legal and statistical arguments for the years to come.

[1] Alabama v. Department of Commerce No. 21-211 (2021).