Data protection in algorithms
Technological development is enabling the automation of all processes, as Henry Ford did in 1914; The difference is that now instead of cars we have decisions about privacy. Since GDPR came into force on 25th May 2018, lots of questions have arisen regarding how the Regulation may block any data-based project.
In this article, we aim to clarify some of the main GDPR concepts that may apply to the processing of large amounts of data and algorithm decision-making. It has been inspired by the report the Norwegian Data Protection Authority -Datatilsynet- published in January this year: “Artificial Intelligence and Privacy”.
Artificial intelligence and the elements it comprises like algorithms and machine/deep learning are affected by GDPR for three main reasons: the huge volume of data involved, the need of a training dataset and the feature of automated decision-making without human intervention. These three ideas reflect four GDPR principles: fairness of processing, purpose limitation, data minimisation, and transparency. We are briefly explaining all of them in the following paragraphs – the first paragraph of each concept contains the issue and the second one describes how to address it according to GDPR.
One should take into account that without a lawful basis for automated decision making (contract/consent), such processing cannot take place.
Fairness processing: A discriminatory result after automated data processing can derive from both the way the training data has been classified (supervised learning) and the characteristics of the set of Data itself (unsupervised learning). For the first case, the algorithm will produce a result that corresponds with the labels used in training, so if the training was biased, so will do the output. In the second scenario, where the training data set comprises two categories of data with different weights and the algorithm is risk-averse, the algorithm will tend to favour the group with a higher weight.
GDPR compliance at this point would require implementing regular tests in order to control the distortion of the dataset and reduce to the maximum the risk of error.
Purpose limitation: In cases where previously-retrieved personal data is to be re-used, the controller must consider whether the new purpose is compatible with the original one. If this is not the case, a new consent is required or the basis for processing must be changed. This principle applies either to the re-use of data internally and the selling of data to other companies. The only exceptions to the principle relate to scientific or historical research, or for statistical or archival purposes directly for the public interest. GDPR states that scientific research should be interpreted broadly and include technological development and demonstration, basic research, as well as applied and privately financed research. These elements would indicate that – in some cases – the development of artificial intelligence may be considered to constitute scientific research. However, when a model develops on a continuous basis, it is difficult to differentiate between development and use, and hence where research stops and usage begins. Accordingly, it is therefore difficult to reach a conclusion regarding the extent to which the development and use of these models constitute scientific research or not.
Using personal data with the aim of training algorithms should be done with a data set originally collected for such purpose, either with the consent of the parties concerned or, to anonymisation.
Data minimisation: The need to collect and maintain only the data that are strictly necessary and without duplication requires a pre-planning and detailed study before the development of the algorithm, in such a way that its purpose and usefulness are well explained and defined.
This may be achieved by making it difficult to identify the individuals by the basic data contained. The degree of identification is restricted by both the amount and the nature of the information used, as some details reveal more about a person than others. While the deletion of information is not feasible in this type of application due to the continuous learning, the default privacy and by design must govern any process of machine learning, so that it applies encryption or use of anonymized data whenever possible. The use of pseudonymisation or encryption techniques protect the data subject’s identity and help limit the extent of intervention.
Transparency, information and right to explanation: Every data processing should be subject to the previous provision of information to the data subjects, in addition to a number of additional guarantees for automated decision-making and profiling, such as the right to obtain human intervention on the part of the person responsible, to express his point of view, to challenge the decision and to receive an explanation of the decision taken after the evaluation.
GDPR does not specify whether the explanation is to refer to the general logic on which the algorithm is constructed or the specific logical path that has been followed to reach a specific decision, but the accountability principle requires the subject should be given a satisfactory explanation, which may include a list of data variables, the ETL (extract, transform and load) process or the model features.
A data protection impact assessment carried by the DPO is required before any processing involving algorithms, artificial intelligence or profiling in order to evaluate and address the risk to the rights and freedoms of data subjects.