Spanish Data Protection Authority (AEPD) has launched new guidance on hash function as a personal data pseudonymisation technique.
GDPR refers pseudonymization of personal data as one of the appropriate technical and organisational measures that may be taken by data controllers in order to ensure a level of security appropriate to the risk. However, it does not specify how data can be pseudonymised. In this context, hash function may be a suitable technique for such purpose and, lucky us, AEPD has prepared some guidancein order to clarify how it works. Do you want to learn more about hash function as a personal data pseudonymisation technique? Keep reading!
What does ‘pseudonymisation’ means?
According to the GDPR, ‘pseudonymisation’ means “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”.
What is a hash function?
Have you ever tried to write a tweet with more than 280 characters and have not been able because of Twitter character limit? Hash function has come to save you!
A digest or hash function is a process which transforms any random dataset in a fixed length character series, regardless of the size of input data. For example, the full text of Romeo and Juliet may become a series of just one hundred numbers after being run through a hash function. You may be wondering: how? Hash functions divide the input message into blocks, calculate the hash for each of the blocks and add up them all.
Hash and reidentification
How likely is the output of a hash to be reverted to the initial input? Let’s imagine a processing activity whichintends to associate a hash value to each National Insurance Numbers in UK. The main element that would allow or hinder reidentification is the “order” in the message space.
The message space is represented by all possible datasets which may be created and from which a hash may be generated (in our case, UK National Insurance Numbers).The stricter this “order” is (for example, in the case that only National Security Numbers from women who are 30-45 years old were admitted), the smaller the set of numbers (processing message space) will be. Thisguarantees the hash effectivity as a single identifier (no collision) but it also increases the likelihood of identifying the original message from the hash.
The degree of disorder in a dataset is called entropy. The smaller the message space and the lower the entropy are, the lower the risk of collision in hash processing is, but re-identification will be more likely and vice versa: the higher the entropy, the higher the possibility of a collision, but the risk of reidentification will be lower. This is the reason why measuring the amount of information is one of the key factors to consider whenever a message is protected via hash functions or any other pseudonymization or encryption techniques.
How does this apply to the day-to-day business? This basically means that the more variables that “order” the message space (e.g. individuals age, gender, socioeconomic status, nationality, etc.), the higher the risk of re-identification (e.g. the higher the risk of singling out an individual).
The risk of re-identification is even higher when additional information is linked to the hash.
Strategies to hinder re-identification
A strategy to hinder re-identification of the hash value is to use an encryption algorithm with a key that is confidentially stored by the data controller or with the other person taking part in the processing, so that the message is properly encrypted before the hash is completed.
The effectiveness of the encryption will depend on the environment (distributed environments may increase this risk), the vulnerability to attacks and the volume of encrypted information (the more information, the easier it will be to carry out a cryptanalysis), among others.
As an alternative to encryption, random fields may be added to the original message, so the format of the original message is expanded to an “extended message”, which increases its entropy.
However, the computation of the hash itself (e.g. selection of an specific algorithm and its implementation), message space related aspects (e.g. entropy), linked information, physical safety and human factors, etc. includes a series of weaknesses and introduces different risk elements that makes hash function a pseudonymization technique rather than an anonymization one.
According to the AEPD, using hash techniques to pseudonymise or anonymise personal data must be justified by a re-identification risk analysis associated with the specific hash technique used in the processing. “In order to consider the hash technique an anonymisationtechnique, this risk analysis must also assess:
• The organisational measures that guarantee the removal of any information that allows for reidentification.
• The reasonable guarantee of the system robustness beyond the expected useful life of personal data.”.