What is differential privacy and how does it apply to our daily life?

carolc

6 years ago

Differential privacy is a method in which errors or “noise” is introduced into a database to establish people’s anonymity. Learn as much as possible about a group while learning as little as possible about any individual in it. Differential privacy allows you to gain knowledge of large data sets, but with a mathematical proof that no one can get information about a single individual in the set. Thanks to differential privacy you can meet your users without violating their privacy. It does, basically adding more statistical noise to the answer the more specific the question we’re asking the database. As an adaptation of Heisenberg’s principle of uncertainty in physics, applied by social imperative in data science.

This approach to the world of differential privacy, the origin of which is partially among the publications of Cynthia Dwork, a Microsoft researcher, are being implemented by tech giants of the likes of Google – who have been betting on it since when they didn’t even call it that.

What are your practical applications?

Differential privacy can be applied to everything from recommendation systems to location-based services and social media. Apple uses differential privacy to collect anonymous information about the use of devices such as iPhones, iPads, and Macs.

Differential privacy would also allow a company like Amazon to access their custom purchase preferences while hiding sensitive information about their historical shopping list. Facebook may use it to collect behavioral data for targeted advertising, without violating a country’s privacy policies.

On the contrary, with their new approach, many companies in the telephony industry only send the information of your devices to their servers when it has already gone through a transformation process, through various techniques (cryptographic functions, noise insertion…), to ensure that it is mathematically impossible to associate the data with your identity.

How differential privacy works

To protect each subject’s sensitive data, when a query is launched to a system that incorporates differential privacy, the system will modify the query result by adding new data (noise) randomly extracted from a distribution generated from the original data.

Thus, a dataset incorporating this concept to which we ask something like “how many customers who called us yesterday have an account balance of more than $100,000?” will not give us back the exact and actual figure, but a number close to it resulting from adding a value (positive or negative).

The introduction of non-deterministic noise (randomness) is essential. Specifically, any non-trivial privacy guarantee that is maintained regardless of other sources or ancillary information, including other databases, studies, rumors, news, statistics, requires randomness.

How do organizations use data privacy?

They typically use data for two main purposes: improving decision-making (through institutional intelligence) and promoting automation (through machine learning and AI). There is a new set of methods and tools that preserve privacy when building systems that are based on institutional intelligence and machine learning.

Researchers and entrepreneurs are actively mobilizing to create privacy preservation methods and tools. Machine learning specialists have long agreed that simple data anonymization techniques can compromise user privacy; here are some recent techniques for preserving privacy in machine learning:

Federated learning

Introduced by Google allows you to train a centralized machine learning model without data exchange, and therefore lends itself very well to the services of mobile devices.

Homomorphic encryption

It is an emerging field whose objective is to develop tools that allow the computation of complex models from encrypted data. Preliminary research has focused on computer vision technologies and speech technologies.

Decentralization

This is an area managed mainly by new companies that want to use blockchains, distributed accountators and incentive structures that use cryptocurrencies.

What are the precautions due to misuse?

As mentioned above each website has a different policy. Since it must also be governed by the provisions of the country it is in. Therefore collecting information from each person becomes a rather sensitive subject. Consequently, it causes monetary damages of large sums, which are calculated in both hours and days if the necessary measures are not taken.

It is known that the data collected in any entity falls into a general database, which clearly sorts them for your next consultation. In some cases some other data is partially deleted without leaving a trace in the source database. Therefore enters the underlying concept of the (Differential Privacy) whereby not being in sight it cannot be stolen. It can be recovered not in a simple way but with mathematical processes to determine its exact location and be manipulated again.

Virtual assistants

Virtual assistants are a threat to our data; advances in technology are going down a well-defined path: virtual assistants. We want our mobiles to tell us where we are going before we know ourselves, to notify us of our appointments, and to suggest restaurants according to our preferences and our information. This comes at a price: they must use our data. For a smartphone to tell us how long it will take to get to work when we get up in the morning according to the traffic of the moment, you must first know where we work and know which route we usually use to get to it. There are two ways to get him to know: either we indicate it ourselves, or he takes care of collecting our data and doing it for himself.

These processes require a lot of random data for better protection of real information. In other words, present effective camouflage. It is recommended that if in any case the database has data from few users you will most rightly have to overflow it with alternate data.

In conclusion, although Differential Privacy is a recent field of Data Science, it has a rigorously formalized mathematical literature that offers theorems, properties and mechanisms that guarantee correct protection over the control of the particular information that each user can exercise over their data. In turn, while anonymising each individual’s information, it allows useful conclusions to be drawn about a large set of data.