Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy

Data Privatisation,
Data Anonymisation,
Data
Pseudonymisation
and Differential
Privacy
This paper describes how technologies such
as data pseudonymisation and differential
privacy technology enables access to
sensitive data and unlocks data
opportunities and value while ensuring
compliance with data privacy legislation and
regulations
Alan McSweeney
January 2022
alan@alanmcsweeney.com

Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy
Page 2
Contents
Introduction.......................................................................................................................................................................4
Personal Information .........................................................................................................................................................6
Third-Party Data Sharing And Data Access Framework .....................................................................................................7
Data Privacy Technologies...............................................................................................................................................10
Context Of Data Privatisation – Anonymisation, Pseudonymisation And Differential Privacy .......................................... 11
Data Sharing Use Cases...............................................................................................................................................14
Pseudonymisation ........................................................................................................................................................... 15
Why Pseudonymise Rather Than Anonymise? .............................................................................................................16
GDPR Origin Of Pseudonymisation .............................................................................................................................16
Growing Importance Of Pseudonymisation .................................................................................................................19
Approaches To Pseudonymisation...............................................................................................................................19
Pseudonymisation By Replacing ID Fields With Linking Identifier (Token) ...............................................................20
Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields..............................................21
ID Field Hashing Pseudonymisation.........................................................................................................................21
Hashing And Identifier Codes ..................................................................................................................................22
Hashing And Reversibility........................................................................................................................................23
ID Field Hashing Pseudonymisation With Data Salting And Peppering ....................................................................24
Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering ............................................25
Content Hashing Pseudonymisation........................................................................................................................26
Pseudonymisation And Data Lakes/Data Warehouses................................................................................................. 27
Pseudonymisation Implementation.............................................................................................................................28
Data Breaches and Attacks ..............................................................................................................................................28
Pseudonymisation and Data Breaches.........................................................................................................................29
Differencing Attack .....................................................................................................................................................30
Differencing Attack, Reconstruction Attack And Mosaic Effect.................................................................................... 31
Differential Privacy ..........................................................................................................................................................32
Data Privatisation and Differential Privacy Solution Architecture Overview.....................................................................34
Differential Privacy Platform Solution Service Management Processes .......................................................................36
Differential Privacy Platform Deployment Options...................................................................................................... 37
On-Premises Deployment .......................................................................................................................................38
Cloud Deployment ..................................................................................................................................................39
Differential Privacy and Data Attacks ..........................................................................................................................40
Data Privatisation and Differential Privacy Solution Planning ..........................................................................................40
Data Privatisation and Differential Privacy Solution Operation and Use...........................................................................41
Data Privatisation and Differential Privacy Next Steps.....................................................................................................43
Early Business Engagement and Differential Privacy Opportunity Validation...............................................................45
Differential Privacy Detailed Design ............................................................................................................................46
Differential Privacy Readiness Assessment.................................................................................................................. 47
Differential Privacy Architecture Sprint .......................................................................................................................49

Page 3
List of Figures
Figure 1 – Data Privacy Subject Areas ................................................................................................................................5
Figure 2 – Data Privacy and Data Utility Balancing Act.......................................................................................................5
Figure 3 – Data Sharing and Data Access Framework.........................................................................................................7
Figure 4 – Data Sharing and Access Topologies .................................................................................................................9
Figure 5 – Data Privatisation Spectrum ............................................................................................................................10
Figure 6 – Data Privacy Technologies............................................................................................................................... 11
Figure 7 – Context of Data Privatisation ...........................................................................................................................12
Figure 8 – Overview of Pseudonymisation ....................................................................................................................... 15
Figure 9 – Pseudonymisation for Data Sharing with External Business Partners...............................................................16
Figure 10 – Overview of Approaches to Pseudonymisation ..............................................................................................19
Figure 11 – Pseudonymisation By Replacing ID Fields With Linking Identifier...................................................................20
Figure 12 – Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields ....................................21
Figure 13 – ID Field Hashing Pseudonymisation ...............................................................................................................22
Figure 14 – ID Field Hashing Pseudonymisation With Data Salting And Peppering...........................................................24
Figure 15 – Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering ...................................25
Figure 16 – Content Hashing Pseudonymisation ..............................................................................................................26
Figure 17 – Pseudonymisation and Data Lakes/Data Warehouses .................................................................................... 27
Figure 18 – Pseudonymisation and Data Breaches ...........................................................................................................29
Figure 19 – Differential Privacy and Differencing Attacks................................................................................................. 31
Figure 20 – Differencing Attack, Reconstruction Attack And Mosaic Effect......................................................................32
Figure 21 – Differential Privacy Operation........................................................................................................................ 33
Figure 22 – Data Privatisation and Differential Privacy Balancing Act ..............................................................................34
Figure 23 – Operational Data Privatisation and Differential Privacy Solution Architecture ............................................... 35
Figure 24 – Sample High-Level On-Premises Deployment ...............................................................................................38
Figure 25 – Sample High-Level Cloud Deployment ..........................................................................................................39
Figure 26 – Data Privatisation and Differential Privacy Solution Journey..........................................................................43
Figure 27 – Approaches to Data Privatisation and Differential Privacy Solution Scoping and Definition ...........................44
Figure 28 – Early Business Engagement and Differential Privacy Opportunity Validation Process ....................................46
Figure 29 – Differential Privacy Detailed Design Views .................................................................................................... 47
Figure 30 – Areas Covered in Differential Privacy Readiness Assessment .........................................................................48

Page 4
Introduction
This paper examines the related concepts of data privatisation, data anonymisation, data pseudonymisation, and
differential privacy.
Data has value. To realise this value, it may need to be made more widely available, both within and outside your
organisation for various types of access such as sharing data with outsourcing and service partners or making data
available to research partners. This data sharing must be performed in the context of maintaining personal data privacy.
This paper examines the technology options to provide different types access to data while preserving privacy and
ensuring compliance with the many (and growing) data privacy regulatory and legislative requirements.
You need to take a risk management approach to data sharing and third-party data access. Appropriate technology,
appropriately implemented and operated is a means of managing and reducing risks of re-identification by making the
time, skills, resources and money necessary to achieve this unrealistic. A demonstrable technology-based approach to
data privacy supported by a data sharing business framework reduces an organisation’s liability in the event of data
breaches.
For example, with the EU GDPR (General Data Protection Regulation)1 where a data breach occurs, the controller is
exempted from its notification obligations where it can show that the breach is ‘unlikely to result in a risk to the rights and
freedoms of natural persons’2 such as when pseudonymised data leaks and the re-identification risk is remote.
Organisations need a well-defined and implement process that enables them to make your data available as widely as
possible without exposing them to risks associated with non-compliance with the wide range of differing data privacy
regulations.
Managing data privacy in the context of data access and sharing arrangements encompasses the areas of:
• Data Governance
• Privacy Management
• Security Management
• Risk Management
1 See http://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX%3A32016R0679
2 See GDPR recitals 80 and 85 and articles 27 and 33.

Page 5
Figure 1 – Data Privacy Subject Areas
Managing data privacy in the context of data access and sharing arrangements is a balancing access between data
privacy and data utility. Perfect data privacy can be achieved by not sharing or making accessible any data irrespective of
whether it contains personal identifiable information. The result of this is that data is unused.
Perfect data utility can be achieved by sharing and making accessible all data. The result of this is that there is no data
privacy.
One aspect of data privacy management is taking a risk-based approach to this balancing act.
Figure 2 – Data Privacy and Data Utility Balancing Act
This paper describes some practical, realistic and achievable approaches to implementing data privatisation using
pseudonymisation and differential privacy approaches as a means of addressing your data sharing and access
requirements and opportunities.

Page 6
This paper covers the following topics:
• Personal Information – this discusses what is meant by personal information.
• Third-Party Data Sharing And Data Access Framework – data sharing is a business issue enabled through
technologies. But it is primarily a business concern and any arrangements should be grounded in a business
framework.
• Data Privacy Technologies and Context Of Data Privatisation – this discusses data privatisation approaches of
anonymisation, pseudonymisation and differential privacy. It covers the GDPR origin of pseudonymisation, the
growing importance of pseudonymisation and various approaches to pseudonymisation, hashing and
pseudonymisation and data lakes/data warehouses.
• Data Breaches and Attacks – the provides background information on data breaches and attacks and how data
privatisation approaches provide protection against them.
• Why Data Privatisation and Differential Privacy – this provides a context to the need for a robust, secure
operational data privatisation and differential privacy technology framework.
• Data Privatisation and Differential Privacy Solution Architecture Overview – how does a differential privacy
solution sits within your existing information technology solution and data landscape, what are its components and
what are the solution deployment options.
• Data Privatisation and Differential Privacy Solution Planning – what does an exercise to plan for the
implementation and operation of a successful data privatisation and differential privacy solution consists of.
• Data Privatisation and Differential Privacy Solution Operation and Use – how the data privatisation and
differential privacy solution is operated and used.
• Differential Privacy Next Steps – this describes a set of possible next steps and types of engagement to allow you
move along the data privatisation and differential privacy journey successfully.
Personal Information
Personal information is any information relating to an identified or identifiable natural person. This can be direct -
information that directly identifies a single individual – or indirect or quasi-identifiers - information that can be used to
identify an individual by being linked with other information.
Quasi-identifiers include information such as date of birth, date of death, post code and others. These do not specifically
link to an individual but such links can be determined.
Personal information can be structured or unstructured such as free-form text or it can take other forms such as images
(photographs, medical images) or other data types such as genomic data.
Personal information can be stored in multiple different ways from database tables and columns to data formats such as
documents and spreadsheets to image files. Personal information may also exist in the form of metadata attached to
date files.
The technologies underpinning data privatisation will need to handle all these data types and formats.

Page 7
When considering data privatisation in the context of data access and sharing, the full set of personal information and the
range of data formats should be considered. The approach to handling quasi-identifiers may be different to that taken for
direct identifiers. Rather than completely removing them they could made more general such as month and year for date
of birth or a data range could be specified.
Third-Party Data Sharing And Data Access Framework
Managing data privacy in the context of data access and data sharing is not just a technology concern. The selection,
implementation and operation of technologies needed to ensure data privacy exist within a wider data sharing and access
framework. Organisations that intend to share and provide access to data should define such a framework. This will
provide an explicit approach rather than leaving such arrangements implicit and poorly defined. It will reduce the time
and effort required to implement data access and sharing. It will ensure a consistent and coherent approach. The
following diagram describes a possible structure for such a framework.
Data access and sharing covers both internal, such as other business units other than the originating business unit
accessing data, and external – third-parties being given access to data for business and research purposes.
Figure 3 – Data Sharing and Data Access Framework
This framework has the following dimensions:
1. Business and Strategy Dimension – this relates to the overall organisation posture relating to internal and external
data access and sharing and needs to cover topics such as:
• Overall Objectives, Purposes and Goals – this sets the context and overall direction of and the principles that
will underpin data sharing and data access arrangements. The objectives, purposes and goals of these
arrangements will be defined.

Page 8
• Data Sharing Strategy – this will define the organisation’s strategy for internal and external data sharing and
access – why it is being done, who will be allowed access to data, the types of data to which access will be
granted, the types of access allowed and the technology approaches that will be used.
• Risk Management, Governance and Decision Making – this will cover how data sharing and access
arrangements will be governed and managed, how decisions will be made on these arrangements and how data
sharing and access risks will be managed.
• Charges and Payments – this will define the charges and payments structure, if applicable, that will apply to
data access and sharing arrangements.
• Monitoring and Reporting – this will document how the operation and use of data access and sharing
arrangements will be monitored, audited and reported on.
2. Legal Dimension – this encompasses the legal aspects of data sharing and needs to cover topics such as:
• Data Privacy Legislation and Regulation Compliance – this will cover the activities of researching and
monitoring the data privacy legislative and regulatory landscape and any changes and developments that may
impact data access and sharing.
• Contract Development and Compliance – this will encompass the development, negotiation and
implementation of contractual arrangements governing specific data access and sharing arrangements.
3. Technology Dimension – this covers technology and security standards and needs to cover topics such as:
• Data Sharing and Data Access Technology Selection – this covers the arrangements and responsibilities for
selecting the tools and technologies that will be used to implement data access and sharing.
• Technology Standards Monitoring and Compliance – this will define the responsibilities for and scope of
monitoring technology standards and developments, the organisation’s adoption of and compliance with those
standards and managing change as the standards change.
• Security Standards Monitoring and Compliance – this will describe both how data access and sharing security
standards should be monitored, how security is implemented for data sharing and access arrangements and
managing change as the standards change.
4. Development and Implementation Dimension – this relates to the implementation of data sharing technology tools
and platforms and of specific data access and sharing arrangements and needs to cover topics such as:
• Technology Platform and Toolset Selection and Implementation – the includes the selection and
implementation of specific data access and sharing technologies covering security and access control, the range
of data types and data access facilities being offered.
• Functionality Model Development and Implementation – this relates to defining and implementing the data
access and sharing functionality, features being offered and the tools and technologies that will support them.
• Data Sharing and Access Implementations – this encompasses the specification and implementation of specific
data access and sharing arrangements.
• Data Sharing and Access Maintenance and Support – this covers the maintenance and support arrangements
both of the overall data access and sharing tools, platforms and technologies as well as the specific
arrangements.

Page 9
5. Service Management Dimension – this defines the operational processes that should be defined and implemented in
order to operate data sharing and needs to cover topics such as:
• Service Management Processes – this defines the operational and service management processes that need to
be implemented and operated.
• Operational and Service Level Agreement Management – this covers the topic of defining and then managing
and monitoring compliance with operational and service level agreements for data access and sharing
arrangements.
• Maintain Inventory of Data Sharing Arrangements – this covers the maintenance of a list of current and
previous data sharing and access arrangements.
• Service Monitoring and Reporting – this defines how the data sharing arrangements will be monitored and
reported on.
• Issue Handing and Escalation – this covers how any issues relating to the operation and use of data sharing will
be recorded, handled and escalated.
There are different data sharing and access arrangements.
Figure 4 – Data Sharing and Access Topologies
Data can be made available more widely within the organisation for purposes for which it was not originally collected.
Data can be made publically available. Once this has been done, it will not be possible to control who uses it, the uses to
which it is put or to recall it.

Page 10
Data can be shared subject to some form of legal or contractual arrangement.
Data can be shared through some form of controlled and secure facility.
In the last two arrangements, some form of trust exists between the sharing entity and the data recipient. This sharing
may be supported by penalties (after disclosure) or by technology (disclosure prevention) or both.
Data can be pushed to the target or the data can be made available to the target through a pull or download facility.
The data sharing and access framework should cover all these possibilities.
Within the context of data access and sharing, data privatisation can be viewed as a spectrum from completely
identifiable data to data that is not linked to individuals.
Figure 5 – Data Privatisation Spectrum
The data privacy risk is reduced as you move further to the right. Data utility may also be reduced as you move to the
right.
The data sharing and access framework should combine both the data sharing and access topology and data privatisation
spectrum to get a more complete view of data access arrangements.
Data Privacy Technologies
Data privatisation is the removal of personal identifiable information (PII) from data. At a very high-level, data
privatisation can be achieved in one or both of two ways:
1. Data Summarisation – sets of individual data records are compressed into summary statistics with all personal
information removed
2. Data Tokenisation – the personal data within a dataset that allows an individual to be identified is replaced by a
token (possibly generated from the personal data such as by hashing), either permanently (anonymisation) or
reversibly (pseudonymisation)

Page 11
Figure 6 – Data Privacy Technologies
There are different routes to making data accessible and shareable within and outside the organisation without
compromising compliance with data protection legislation and regulations and removing the risk associated with
allowing access to personal data.
• Differential Privacy – source data is summarised, and individual personal references are removed. The one-to-one
correspondence between original and transformed data has been removed.
• Anonymisation – identifying data is destroyed and cannot be recovered so individual cannot be identified. There is
still a one-to-one correspondence between original and transformed data.
• Pseudonymisation – identifying data is encrypted and recovery data/token is stored securely elsewhere. There is still
a one-to-one correspondence between original and transformed data
These technologies and approaches are not mutually exclusive – each is appropriate to different data sharing and data
access use cases.
Context Of Data Privatisation – Anonymisation, Pseudonymisation And
Differential Privacy
The wider context of data privatisation and specific approaches for enabling it such as anonymisation, pseudonymisation
and differential privacy can be represented by the four interrelated areas of:
• Value in Data Volumes and Data Assets – you have expended substantial resources in gathering and processing and
generating data. This data has value that you want to realise by making it more widely available. The need to comply
with the increasing body of data protection and privacy laws inhibits your ability to achieve this.
• Data Privacy Laws and Regulations – you need to ensure that making your data available to a wider range of
individuals and organisations does not breach the ever-increasing set of data protection and privacy legislation and
regulations. All too frequently the cost of and concerns around ensuring this compliance prevents this wider data
access.

Page 12
• Technologies – the various data privatisation privacy technologies are mature, well-proven, industrialised and are
independently certified. They can be used to provide controlled, secure access to your data while guaranteeing
compliance with data protection and privacy legislation. Using these technologies will embed such compliance by
design into your data sharing and access facilities. This will allow you to realise value from your data successfully.
• Data Processes and Business Data Trends – the volumes of data available to organisations are increasing. The range
of analysis tools and technologies available are increasing. Data storage is moving to cloud-platforms that can handle
data volumes and provide analysis tools more easily than costly and complex on-premises solutions that are available
only to larger organisations. Organisations are outsourcing more business processes to third parties. These
outsourcing arrangements require the sharing of data.
Figure 7 – Context of Data Privatisation
To achieve the value inherent in your data you need to be able to make it appropriately available to others. You need a
process that enables you to make your data available as widely as possible without exposing you to risks associated with
non-compliance with the wide range of differing data privacy regulations. You need one data access framework and
associated set of technologies that work for all data access and sharing while guaranteeing legislative and regulatory
compliance.

Page 13
Data Privatisation Topology –
Data Privacy Laws and
Regulations
The landscape of data protection
and privacy legislation and
regulations is extensive, complex
and growing – this is just a partial
and incomplete view.
Organisations that share data
externally need to be able to
guarantee compliance with all
relevant and applicable
legislation.
Value in Data Volumes and
Data Assets
Organisations have more and
more data of increasing
complexity that they want and
need to share in order to
generate value.

Page 14
Technologies
There are a range of well-proven
technologies available for
ensuring data privacy.
Data Processes and Business
Data Trends
Organisations want to outsource
their business processes and
share their data with partners to
gain access to specialist analytics
and research skills and tools.
Data Sharing Use Cases
There are many data sharing use cases and scenarios that involve the sharing potential personal identifiable information
such as:
• Share data with other business functions within your organisation
• Use third-party data processing and storage platform and facilities
• Use third-party data access and sharing as a service platform and facilities
• Use third-party data analytics platform and facilities
• Engage third-party data research organisations to provide specialist services

Page 15
• Share data with external researchers
• Outsource business processes and enable data sharing with third parties
• Share data with industry business partners to gain industry insights
• Share data to detect and avoid fraud
• Share customer data with service providers at the request of the customer
• Enable customer switching
• Participate in Open Data initiatives
Pseudonymisation
Pseudonymisation is an approach to deidentification where personally identifiable information (PII) values are replaced
by tokens or artificial identifiers or pseudonyms.
Pseudonymisation is one technique to assist compliance with EU General Data Protection Regulation (GDPR)
requirements for secure storage of personal information.
Pseudonymised is intended to be reversible: the pseudonymised data can be restored to its original state.
Personal data fields can be individually pseudonymised so there is a one-to-one correspondence between original source
data fields and transformed data fields or the personal data fields can be removed and replaced with a token.
Figure 8 – Overview of Pseudonymisation

Page 16
Why Pseudonymise Rather Than Anonymise?
Personal identifiable data is pseudonymised when there is a need to re-identify the data, for example, after it has been
worked on by a third-party either within or outside the organisation and the results of the processing need to be matched
to the original data. The following diagram illustrates such a scenario.
Figure 9 – Pseudonymisation for Data Sharing with External Business Partners
The numbered steps are:
1. Original Data – this is the original collected or processed data containing personal identifiable information.
2. Pseudonymised Data – the personal identifiable information within the data is pseudonymised.
3. Pseudonymisation Key – there is separate pseudonymisation key that allows pseudonymised for be re-identified
when needed. This needs to be kept separate from the pseudonymised data.
4. Pseudonymised Data Transmitted to Data Processor – the pseudonymised data is then sent to the external data
processor for their use.
5. Processed Data with Additional Processed Data – the data is enriched with the results of additional processing.
6. Pseudonymised Data with Additional Processed Data Returned – the enriched data is returned to the organisation.
7. Original Data Merged with Additional Processed Data – the enriched data is re-identified using the previously
created pseudonymisation key.
Pseudonymisation can also be used as part of the archiving process for data containing personal identifiable information
after its main processing has been completed and the data is being retained for historical purposes.
GDPR Origin Of Pseudonymisation
The use of pseudonymisation as a form of encryption of personal identifiable information gained importance and
legitimacy from the GDPR. Pseudonymisation is referred to many times in the GDPR.
The term pseudonymisation is defined in Article 4(5) of the GDPR:
‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no
longer be attributed to a specific data subject without the use of additional information, provided that such

Page 17
additional information is kept separately and is subject to technical and organisational measures to ensure that
the personal data are not attributed to an identified or identifiable natural person;
Pseudonymisation is also referred to in Recitals 26 and 28 of the GDPR:
Recital 26
The principles of data protection should apply to any information concerning an identified or identifiable
natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural
person by the use of additional information should be considered to be information on an identifiable natural
person. To determine whether a natural person is identifiable, account should be taken of all the means
reasonably likely to be used, such as singling out, either by the controller or by another person to identify the
natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the
natural person, account should be taken of all objective factors, such as the costs of and the amount of time
required for identification, taking into consideration the available technology at the time of the processing and
technological developments. The principles of data protection should therefore not apply to anonymous
information, namely information which does not relate to an identified or identifiable natural person or to
personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This
Regulation does not therefore concern the processing of such anonymous information, including for statistical
or research purposes.
Recital 28
The application of pseudonymisation to personal data can reduce the risks to the data subjects concerned and
help controllers and processors to meet their data-protection obligations. The explicit introduction of
‘pseudonymisation’ in this Regulation is not intended to preclude any other measures of data protection.
Article 32(1)(a), dealing with security refers to the pseudonymisation and encryption of personal data, uses
pseudonymisation to mean changing personal data so that resulting data cannot be attributed to a specific person
without the use of additional information.
Article 89, covering safeguards and derogations relating to processing for archiving purposes in the public interest,
scientific or historical research purposes or statistical purposes, refers to pseudonymisation as follows
1. Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical
purposes, shall be subject to appropriate safeguards, in accordance with this Regulation, for the rights and
freedoms of the data subject. Those safeguards shall ensure that technical and organisational measures are in
place in particular in order to ensure respect for the principle of data minimisation. Those measures may
include pseudonymisation provided that those purposes can be fulfilled in that manner. Where those
purposes can be fulfilled by further processing which does not permit or no longer permits the identification of
data subjects, those purposes shall be fulfilled in that manner.
Article 6 (4), covering lawfulness of processing, refers to pseudonymisation as a means of possibly contributing to the
compatibility of further use of data:
Where the processing for a purpose other than that for which the personal data have been collected is not
based on the data subject's consent or on a Union or Member State law which constitutes a necessary and
proportionate measure in a democratic society to safeguard the objectives referred to in Article 23(1), the
controller shall, in order to ascertain whether processing for another purpose is compatible with the purpose for
which the personal data are initially collected, take into account, inter alia:
(a) any link between the purposes for which the personal data have been collected and the purposes of the
intended further processing;

Page 18
(b) the context in which the personal data have been collected, in particular regarding the relationship between
data subjects and the controller;
(c) the nature of the personal data, in particular whether special categories of personal data are processed,
pursuant to Article 9, or whether personal data related to criminal convictions and offences are processed,
pursuant to Article 10;
(d) the possible consequences of the intended further processing for data subjects;
(e) the existence of appropriate safeguards, which may include encryption or pseudonymisation.
Article 25 refers to pseudonymisation as a means to contribute to data protection by design and by default in data
applications
1. Taking into account the state of the art, the cost of implementation and the nature, scope, context and
purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural
persons posed by the processing, the controller shall, both at the time of the determination of the means for
processing and at the time of the processing itself, implement appropriate technical and organisational
measures, such as pseudonymisation, which are designed to implement data-protection principles, such as
data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in
order to meet the requirements of this Regulation and protect the rights of data subjects.
Encryption is a form of pseudonymisation. The original data cannot be read. The process cannot be reversed without the
correct decryption key. GDPR requires that this additional information be kept separate from the pseudonymised data.
Pseudonymisation reduces risks associated with data loss or unauthorised data access. Pseudonymised data is still
regarded as personal data and so remains covered by the GDPR. It is viewed as part of the Data Protection By Design
and By Default principle.
Pseudonymisation is not mandatory. Implementing pseudonymisation with old legacy IT systems and processes may be
complex and expensive and, to that extent, pseudonymisation might be considered an example of unnecessary
complexity within the GDPR.
In relation to processing that does not require identification, it is appropriate to refer to Article 11. Article 11(1) provides
that if the purposes for which a controller processes personal data do not, or no longer, require the identification of a data
subject by the controller, the controller shall not be obliged to maintain, acquire or process additional information in
order to identify the data subject for the sole purpose of complying with the GDPR. Where, in such cases, the controller is
able to demonstrate that it is not in a position to identify the data subject, the controller shall inform the data subject
accordingly, if possible and in such cases, Articles 15 to 20 shall not apply except where the data subject, for the purpose
of exercising his or her rights under those articles, provides additional information enabling his or her identification.
The GDPR has effectively made pseudonymisation the recommended approach to protecting personal identifiable
information.

Page 19
Growing Importance Of Pseudonymisation
The Schrems II judgement3 has further increased the importance and relevance of data pseudonymisation. This increased
the importance of pseudonymisation in relation to data transfers outside the EU. The judgement found that the US FISA
(Foreign Intelligence Surveillance Act) does not respect the minimum safeguards resulting from the principle of
proportionality and cannot be regarded as limited to what is strictly necessary. While the changes apply to transfers
outside the EU, especially the US, they can be adopted pervasively to all data transfers to ensure consistency.
The European Data Protection Board (EDPB) adopted version 2 of its recommendations on supplementary measures4 to
enhance data transfer arrangement to ensure compliance with EU personal data protection of personal requirements.
In this context, data pseudonymisation must ensure that:
• Data is protected at the record and data set level as well as the field level so that the protection travels with the data
wherever it is sent
• Direct, indirect, and quasi-identifiers of personal information are protected
• The approach must attempt to prevent against mosaic effect re-identification attacks by adding high levels of
uncertainty to pseudonymisation techniques.
Approaches To Pseudonymisation
There are several potential approaches to pseudonymisation that can be implemented, as shown in the following
diagram:
Figure 10 – Overview of Approaches to Pseudonymisation
These approaches include:
• Replace IDAT Fields With Linking Identifier
• Hash IDAT Fields
3 https://curia.europa.eu/juris/document/document.jsf?text=&docid=228677&pageIndex=0&doclang=en
4 https://edpb.europa.eu/system/files/2021-06/edpb_recommendations_202001vo.2.0_supplementarymeasurestransferstools_en.pdf

Page 20
• Hash IDAT Fields With Additional Salting/Peppering
• Generate Hash From All Contents
These approaches are explained in more detail in the next sections. In the following, IDAT means identifying data and
refers to personal identifiable information and ADAT means analytic information.
Pseudonymisation By Replacing ID Fields With Linking Identifier (Token)
This approach involves replacing identifying data fields with a random value. These random values are then stored in a
separate secure non-accessible set of data that links the random value to the original record.
Figure 11 – Pseudonymisation By Replacing ID Fields With Linking Identifier

Page 21
Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields
Where there are multiple identifying data fields these can be replaced with random values. The multiple identifying data
fields can be removed and replaced with single identifier.
Figure 12 – Pseudonymisation By Replacing ID Fields With Linking Identifier – Multiple ID Fields
Replacing multiple source fields with a single token field reduces the granularity with which the original source data can
be retrieved. The entire set of source fields must be retrieved from the depseudonymisation key and then the individual
field required can be retrieved.
ID Field Hashing Pseudonymisation
The hashing approach to pseudonymisation involves replacing identifying data with a hash code of the data. So, for
example the SHA3-512 hash of IDAT1 in hexadecimal is:
576c23e0ec773508ae7a03d1b286d75f3a7cfe524625b658a1961d3fa7b0ebb4cc01b3b530c63
4c9525631614ad3ebcb3afb69d33e5d8608a1587c2f43c16535
The SHA3-512 algorithm returns a 512-bit value. The hexadecimal value above is represented as the following binary
string:
01010111011011000010001111100000111011000111011100110101000010001010111001111
01000000011110100011011001010000110110101110101111100111010011111001111111001
01001001000110001001011011011001010111101000011001011000011101001111111010011

Page 22
11011000011101011101101001100110000000001101100111011010100110000110001100011
01001100100101010010010101100011000101100001010010101101001111101011110010110
01110101111101101101001110100110011111001011101100001100000100010100001010110
00011111000010111101000011110000010110010100110101
Storing a SHA3-512 hash code requires 64 bytes. In the case of some identifying data fields, this may be longer than the
field itself. So pseudonymisation will increase storage requirements by replacing shorter fields with longer ones and by
requiring the storage of separate depseudonymisation keys – see page 28
The input identifying data cannot be recalculated from hash directly. However, hash values can be easily and quickly
calculated (“brute force” attack) and compared to pseudonymised values to generate the original identifying data.
Figure 13 – ID Field Hashing Pseudonymisation
Hashing And Identifier Codes
If any of the IDAT fields contains a recognisable identifier code then brute force hash attacks are very feasible, even with
modest computing resources. In general, identifying data tends to be more structured than other data – names,
addresses, codes and so on.
For example, consider an identifier code with a format such as:

Page 23
AAA-NNN-NNN-C
where:
A is an upper-case alphabetic character
N is a number from 0-9
C is a check character
There are 17,576,000,000 possible combinations of this sample identifier code. This may appear to be a large number. But
a single high-specification PC could calculate all the SHA3-512 hash values for these combinations in a few hours.
So, unless the input to the hash generation is augmented with additional more random information, brute force attacks
are feasible.
The following illustrates how a small (single character) change (in the case, changing a character from lower to upper
case) in the sample input value generates very different hash codes.
Input SHA3-512 Hash
... no man has the right to fix the boundary of a nation. No
man has the right to say to his country, "Thus far shalt
thou go and no further", and we have never attempted to
fix the "ne plus ultra" to the progress of ...
e0ef7bd38b6b4bc6a27e7260d2162b2ea58cf5a
fa5098072d0f735f9d73b67f9b9f699b8b098ec
41d44e117135e88b3cfb670876a2f34efd5734e
7ce80b64450
fix the "Ne plus ultra" to the progress of ...
e0ab9f0efb8f4cc2b89b73439f7b1365e687b17
b7e0bdc0ede00751a5a883ad8ee0877b9b6a303
2ad23521a7bc25a0b199e5c57cdb2cb5d7500c9
97e133c41a1
fix the "ne Plus ultra" to the progress of ...
61361212da56a824559b81409cf02ba5f8c3bf4
1d4c8038faa885a183e1bdac1705eefad72594a
f1fc3901aa55295c3166eb6635ca866f1e5cdf5
6c7ff0fb56a
fix the "ne plus Ultra" to the progress of ...
833d8b7cc47843cf74fd42cbbf782e87543c677
ecbdc1f7fe4d7ad9166557fac4c17d467fa8130
2a195e60a0a6f3f89c34e03a5c94eefcb3f19ca
bcfd87a37ad
Hashing And Reversibility
The hash of a value is always the same – there is no randomness in hashing. However, as show above, hashes of very
similar input values are very different. A very small input change leads to very large difference in the generated hash. For
SHA3-512 a 0.5% change in input value leads to 85%-95% difference in hash output.
So, given two hash values, it cannot be easily determined how similar the input values are or what the structure of the
input values might be. This non-correlation property means the hash function is characterised by erratic behaviour in its
output generation.
Hashing process as a form of pseudonymisation is potentially vulnerable to brute force attacks as large number of hashes
can be generated very easily and quickly. If you have some knowledge of the input value, you can generate large numbers
of permutations and their hashes and compare values with the known hash to identify the original value. But ultimately
you have to have the exact input value to generate the same hash: being very close is of no benefit
Therefore, combining the original data with even a small amount of randomised data renders brute force attacks of hash
values more complex.

Page 24
ID Field Hashing Pseudonymisation With Data Salting And Peppering
Salt is an additional different data item added to each identifying data field before hashing. Pepper is a fixed item of data
added to record or field level data before hashing. With this approach the hashed identifying data is:
HASH(CONCATENATE(IDATi+SALTi+PEPPER)) =
For example, SHA3-512(CONCATENATE(IDAT1+SALT1+PEPPER)) =
3fa075114200b2327092f18067059ba81a5b191b33d5a10a2042673adcb119fac4dc5d3f63c60
d44e132f4db5996d416fd70216d4e055f1e5ccc0258ff15e1e1
This approach eliminates almost all the risk from brute force hash generation attacks unless approach to generating Salt
and Pepper can be determined.
Figure 14 – ID Field Hashing Pseudonymisation With Data Salting And Peppering
While the Pepper value seems to add little to the randomisation of the hash, it makes determining the pseudo random
number generator harder and thus makes the hash more secure.

Page 25
One possible approach to generating the Salt is to use a cryptographically secure pseudo random number generator5
(PRNG) to generate salt values. Other less secure PRNGs are vulnerable to attacks.
This ensures that the random salt values are very difficult to determine which in turn makes brute force attacks virtually
impossible. The following show some examples of random numbers added to identifying data to generate has codes:
HASH(CONCATENATE(IDAT1+1144360296176+2356573852518))
Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering
Using this approach to augment the identifying data hash, in order to find the identifying data and additional random
data used to generate a hash code, you need to know three pieces of information:
1. The structure of the identifying data in order to generate all possible permutations
2. The pseudo random number generator used to generate the Salt values
3. The specific Pepper code used, if this has been added.
Figure 15 – Data Attacks – ID Field Hashing Pseudonymisation With Data Salting And Peppering
5 Examples of cryptographically secure pseudo random number generators are:
Fortuna - https://www.schneier.com/academic/fortuna/
PCG - https://www.pcg-random.org/

Page 26
Content Hashing Pseudonymisation
Content hashing involves generating the hash token from the entire record contents rather than just individual
identifying fields. For example, the hash is generated from:
SHA3-512(IDAT1,ADAT1,SALT1,PEPPER) =
df767164078cb0779d06c1de02de74c62192461e82bbb0d01d60c3c3664c9c69111d5d2f07415
333e85cc04acfc1f7a204eadd8deead25a63c5a5ad343a5b3f2
This results in a very high degree of variability in the source data for the hashes. It increases the difficulty of identifying
the source data that generated the hash code.
Figure 16 – Content Hashing Pseudonymisation

Page 27
Pseudonymisation And Data Lakes/Data Warehouses
Data should be pseudonymised before the data lake and/or data warehouse is populated as part of a Data Privacy By
Design And By Default approach.
At a high-level the stages involved in this are:
1. As part of the standard ETL/ELT process, the source data is pseudonymised and the depseudonymisation key is
created.
2. The pseudonymised data is passed to the data lake. The data may remain in the data lake or it may be used to
populate the data warehouse.
3. The pseudonymised data created by the ETL/ELT process may be used to update the data warehouse directly,
bypassing the data lake stage.
4. The pseudonymised data in the data lake is used to update the data warehouse.
Figure 17 – Pseudonymisation and Data Lakes/Data Warehouses
The data in the data warehouse can be made available for more general use within the organisation without any concerns
about personal data being made available. This ensures compliance with GDPR article 6 (see page 17). In this case,
pseudonymisation is used as part of the archiving process for data containing personal identifiable information after its
main processing has been completed and the data is being retained for historical and analytical purposes.

Page 28
Pseudonymisation Implementation
As mentioned on page 22, storing a SHA3-512 hash code requires 64 bytes. In the case of some identifying data fields,
this may be longer than the field itself. So pseudonymisation will increase storage requirements by replacing shorter
fields with longer ones and by requiring the storage of separate depseudonymisation keys.
For example, a table in an Oracle database with 10 million records, five IDAT fields each with an average length of 20
bytes, five ADAT fields each with an average length of 8 bytes and one index column of 8 bytes will require about 1.22 GB
of storage.
With pseudonymisation of individual IDAT fields, these will be replaced with 64 bytes each. The table size will increase to
about 2.48 GB.
There will also be a depseudonymisation key table that will hold both the original five IDAT fields each with an average
length of 20 bytes and the five pseudonymisation fields of 64 bytes each as well as one index column of 8 bytes. This will
occupy 2.95 GB of storage.
So, in this example, pseudonymisation increases storage requirements from 1.22 GB to 5.43 GB, an increase of 4.21 GB.
As mentioned on page 21, replacing multiple source IDAT fields with a single pseudonymisation hash reduces the
granularity with which the original source data can be retrieved. The entire set of source fields must be retrieved from the
depseudonymisation key and then the individual field required can be retrieved. This reduces the storage overhead.
The use of a separate depseudonymisation key table is not required. The original source data with its personal identifiable
information can be used as the depseudonymisation key. The pseudonymised data will need to store a link to the row in
the original source data. The hash code contained in the pseudonymised data could be compared with a hash code
generated from the source data. However, in this case, if the hash generation process was augmented with salting and
peppering, the correct sale would have to be regenerated.
Data Breaches and Attacks
The objectives of data privatisation technologies are:
• To prevent data breaches and attacks
• To minimise or eliminate the impact of a data breach or attack
Data privatisation technologies are just one of a number of layers of data protection an organisation should implement to
its systems and data.
Data access and data sharing arrangements introduce an additional level of data privatisation complexity in that the
person or organisation being given access to the data may be the attacker. Or the data protection arrangements
implemented and operated by the person or organisation being given data access may not have the same level of data
protection arrangements as the source organisation.
So, the source organisation should assume that data sharing and access arrangements are implicitly compromised and
act accordingly.
There are many security frameworks that can be used to define this wider organisation security framework, such as:
• Center for Internet Security (CIS) Critical Security Controls – https://www.cisecurity.org/controls/

Page 29
• Control Objectives for Information Technologies (COBIT) – https://www.isaca.org/resources/cobit
• NIST: Cybersecurity Framework, 800-53, 800-171 – https://csrc.nist.gov/Projects/risk-management/sp800-53-
controls/downloads
• US FedRAMP (Federal Risk and Authorization Management Program – https://tailored.fedramp.gov/) Security
Controls Baseline – https://tailored.fedramp.gov/static/APPENDIX%20A%20-
%20FedRAMP%20Tailored%20Security%20Controls%20Baseline.xlsx
• Cybersecurity Maturity Model Certification (CMMC) – https://www.acq.osd.mil/cmmc/documentation.html
• Cloud Security Alliance (CSA)Cloud Controls Matrix (CCM) – https://cloudsecurityalliance.org/research/cloud-
controls-matrix/
The analysis of these security standards and frameworks is outside the scope of this paper.
Pseudonymisation and Data Breaches
Pseudonymisation protects against data breaches by making data unusable should it be exposed.
Figure 18 – Pseudonymisation and Data Breaches
The ways in which pseudonymised data can be exposed and the impact of these breaches include:
1. The data may be exposed, accidentally or deliberately, by the entity with which the data is shared. If the data is
correctly pseudonymised and if the pseudonymisation algorithm is protected then the impact of such a breached
would be low.

Page 30
2. The sharing organisation may cause the pseudonymised data to be exposed. For example, the data sharing
mechanism used to share or provide access to the data may be compromised. The impact of such a breached would
be low.
3. The depseudonymisation may be compromised. The risk of personal data re-identification will be high if this
happens.
4. The pseudonymisation algorithm may be compromised. The risk of personal data re-identification will be high if this
happens.
Differencing Attack
Differencing attacks work by running multiple partially overlapping queries can be run against summarised data until the
results can be combined to identify an individual. Differencing attacks apply especially to differential privacy data access
platforms. For example, the following set of queries can be run against the data:
• How many people in the group are aged greater than N?
• How many people in the group aged greater than N have attribute A?
• How many people in the group aged greater than N have attribute B?
• How many people with ages in the range N-9 to N-5 are male?
• How many people with ages in the range N-4 to N are male?
After a number of queries, you may be able to identify individuals or small numbers of individuals in a given age range of a
given sex have a defined attribute. Apparently anonymous summary results can be combined to reveal potentially
sensitive insights and comprise confidentiality.
Differential privacy can be designed to reduce or eliminate the threat of differencing attacks by attaching a cost to each
query. A budget is assigned to the dataset. The amount spent by queries against the dataset is tracked. When the budget
is expended, no more queries can be run until the budget is increased.
A differential privacy platform should be able to track queries performed by solution consumers given access to
determine potential patterns of abuse.

Page 31
Figure 19 – Differential Privacy and Differencing Attacks
Differencing Attack, Reconstruction Attack And Mosaic Effect
In addition to a differencing attack, there are various types of data attacks that can be performed on data as made
available, without the need to obtain other data access:
• A reconstruction attack uses the information from a differencing attack to identify how the original dataset was
processed to create the summary.
• A mosaic effect attack involves combining data from other data (public) sources to identify individuals. For example,
apparently anonymised medical data containing dates of death can be combined with public death notice records to
identify individual.
This results in a data attack topology that should be monitored to ensure data privatisation is maintained.

Page 32
Figure 20 – Differencing Attack, Reconstruction Attack And Mosaic Effect
Differential Privacy
Differential privacy allows for the (public) sharing of information about a group or aggregate by describing the patterns of
groups within the group or aggregate while suppressing information about individuals in the group or aggregate. Source
data is aggregated ad summarised and individual personal references are removed. The one-to-one correspondence
between original and transformed data has been removed.
A viewer of the information cannot (or should not be able to) tell if a specific individual's information was or was not used
in the group or aggregate. This involves inserting noise into the results returned from a query of the data by a differential
privacy middleware tool. The greater the noise introduced, the less usable the data will be but the re-identification risk
will be reduced.
It is a well-proven, widely used robust technique6. It aims to eliminate the possibility of re-identification of individuals
from the dataset being analysed. Individual-specific information is always hidden.
Differential privacy technologies are more complex than anonymisation and pseudonymisation as an approach to data
privatisation. It will require more technical skills and the possible selection and implementation of a software platform.
The remainder of this paper covers the topic of differential privacy in more detail.
An effective data privatisation and differential privacy operational solution consists at its core of a computational layer
that introduces deliberate randomisation into the summarised results returned from a data query. This means that the
6 See The Algorithmic Foundations of Differential Privacy https://www.cis.upenn.edu/~aaroth/privacybook.html.

Page 33
action of running multiple queries across the dataset cannot be used to reconstruct the underlying individual records. It
thus enables Privacy Preserving Data Mining (PPDM). The objective is to prevent access to or identification of specific,
individual personal records or sensitive information while preserving the aggregated or structural properties on the data.
Figure 21 – Differential Privacy Operation
Differential privacy assigns a privacy budget to each dataset. The differential privacy engine introduces a fuzziness into
the results of queries. Each query has a privacy cost. The total privacy expenditure across all queries by all users is tracked.
When the budget has been spent, no further data queries can be performed until more privacy budget is allocated.
Effective and usable data privatisation and differential privacy means finding the right balance between data privacy
and data utility. At one extreme, the solution would be to completely delete or prevent any access to data. While this
preserves absolute data privacy, it also eliminates the utility and usefulness of the data.

Page 34
Figure 22 – Data Privatisation and Differential Privacy Balancing Act
This results in a balancing act between three factors:
1. Level of Detail Contained in Results Presented
2. Amount and Complexity of Data Processing Allowed
3. Level of Data Privacy
Relaxing or constraining one factor affects the other two. In other to determine the right equilibrium across these factors
for your organisation and your data, you need to explicitly formalise your approach to data privacy and data utility in a
policy. This policy should be made accessible and be able to be understood by those in charge of managing data. The
policy should also be formally defined so its applicability and its subsequent implementation, operation and use can be
verified. Differential privacy technology can them be used to operationalise this policy including monitoring its operation
and use.
Technology is a key enabler of data privatisation and differential privacy. It ensures and embeds Privacy By Design in
your data access solution rather than data privacy concerns being addressed as an afterthought
Data Privatisation and Differential Privacy Solution Architecture Overview
This section describes the idealised architecture and design of an operational data privatisation and differential privacy
solution. This essentially illustrates a reference architecture that you can use to determine what solution components are
needed and what must be installed, implemented, and configured to create a usable and secure solution within your
organisation. It can be used as a structured framework to define business and technical requirements. It can also be used
to evaluate suitable products.

Page 35
Figure 23 – Operational Data Privatisation and Differential Privacy Solution Architecture
The numbered components of this are:
1. Core Data Privatisation/Differential Privacy Operational Platform – this is the core differential privacy platform.
This can be installed on-premises or on a cloud platform such as AWS, Google Cloud and Azure. It takes and
summarises data from designated data sources and provides different levels of and types of computational access to
authorised users via a data API. It also provides a range of management and administration functions.
2. Data Sources – these represent data held in a variety of databases such as Oracle, SQL Server and other data storage
systems such as HDFS, Cassandra, PostgreSQL and Teradata as well as external data stores such as AWS S3 and
Azure. The differential privacy platform needs read-only access to these data sources.
3. Data Access Connector – these are connectors that enable read-only access to data held in the data sources.
4. Data Ingestion and Summarisation – this takes data from data sources, processes it and outputs in a format suitable
for access. It includes features to manage data ingestion workflows, scheduling and error identification and handing.
5. Data Analysis Data Store – the core differential privacy platform creates pre-summarised versions of the raw data
from the data sources. The platform never provides access to individual source data records. The data is encrypted
while at rest in the data store.
6. Metadata Store – the platform creates and stores metadata about each data source. This is used to optimise data
privacy of the result sets generated in response to data queries.

Page 36
7. Batch Task Manager – in addition to running online data queries, asynchronous batch tasks can be run for longer
data tasks.
8. Access and Usage Log – this logs data accesses.
9. User Access API – the platform provides an API for common data analytics tools such as Python and R to generate
and retrieve privatised randomised sets of data summaries as well as providing data querying and analytics
capabilities. Data results returned from queries is encrypted while in transit.
10. Data Visualisation Interface – this provides a data access and visualisation interface.
11. User Directory – the platform will use you existing user directories such as Active Directory or Azure Active Directory
for user authentication and authorisation.
12. Authorised Internal Users – authorised internal users can access different datasets and perform different query types
depending on their assigned access rights.
13. Authorised External Users – authorised external users can access different datasets and perform different query
types depending on their assigned access rights.
14. Analytics and Reporting – this will allow you analyse and report on users accesses to data managed by the platform.
15. Monitoring, Logging and Auditing – this will log both system events and user activities. This information can be used
both for platform management and planning as well as identifying potential patterns of data use and possible abuse.
16. Data Access Creation, Validation and Deployment – this will allow new data sources to be onboarded and allow
existing data sources to be managed and updated.
17. Management and Administration – this will provide facilities to manage the overall platform such as adding and
removing users and user groups and applying data privacy settings to different datasets.
18. Security and Access Control – this allows the management of different types of user access to different datasets.
19. Billing System Interface – you may want to charge for data access, either at a flat rate or by access or a mix of both.
This represents an optional link to a financial management system to enable this
Differential Privacy Platform Solution Service Management Processes
Just like any other information technology solution, service management processes should be implemented for an
operational differential privacy solution. Because a differential privacy solution exposes personal data, albeit in a
summarised, randomised and anonymised manner, these service management processes are important. They should be
part of any implementation project. This will maximise confidence in differential privacy technology in your organisation
and reduce project risk. In turn, this will maximise the success of the platform and ensure that return on investment is
optimised.
The following table lists what we regard as being most important service management processes in the context of a
differential privacy solution.
Your organisation will already have invested in information technology service management processes. These should be
extended to the differential privacy platform.

Page 37
Service Management Process Overview and Scope
Access Management This process is concerned with operationalising security management policies
relating to enabling authorised users access the differential privacy platform and
managing their access lifecycle.
Availability Management This process relates to ensuring the differential privacy platform meets its agreed
availability targets and obligations by planning, defining, measuring, analysing and
improving availability.
Capacity Management This is concerned with planning, defining, measuring, analysing and delivering the
required facilities to ensure that the differential privacy platform has sufficient
capacity to meet its service level commitments in the short-, medium- and long-term.
Compliance Management This process is focused on ensuring that the design, operation and use of the
differential privacy platform complies with legal and regulatory requirements and
obligations.
Knowledge Management This is about ensuring that knowledge about the implementation, operation and use
of the differential privacy platform is collated, stored and shared, maximising reuse
and eliminating the need for knowledge rediscovery.
Operations Management This process is concerned with implementing and operating the housekeeping
activities and tasks relating to the differential privacy solution, including monitoring
and controlling the platform and backup and recovery.
Risk Management This relates to the identification, evaluation and management of risks including
threats to and vulnerabilities of the differential privacy solution.
Security Management This is concerned with ensuring the confidentiality of the data assets contained in the
differential privacy solution. Your organisation will already have invested in security
management. This needs to be extended to the differential privacy solution.
Service Continuity Management This is focused on ensuring the continuity of operation of and access to the
differential privacy solution is maintained in the event of problems.
Service Level Management This relates to the definition of and the subsequent monitoring of service level targets
and service level agreements relating to the access to and use of the differential
privacy solution.
Differential Privacy Platform Deployment Options
This section outlines two solution deployment options: on-premises and in the cloud.

Page 38
On-Premises Deployment
The following diagram illustrates the key components of an on-premises implementation of a differential privacy solution.
Figure 24 – Sample High-Level On-Premises Deployment
If users outside the organisation are to be given access to the data platform then either an existing external access facility
will be used to provide secure access or a new facility will have to be implemented.

Page 39
Cloud Deployment
The following diagram illustrates the key components of a cloud implementation of a differential privacy solution.
Figure 25 – Sample High-Level Cloud Deployment
For a cloud deployment, the key differences relate to how on-premises data is processed and transferred to the cloud
platform and how data access users outside the organisation authenticate using an approach such as Azure Active
Directory.

Page 40
Differential Privacy and Data Attacks
Data Privatisation and Differential Privacy Solution Planning
There are many different paths along the journey to the implementation of an operational data privatisation and
differential privacy solution. The section Data Privatisation and Differential Privacy Next Steps on page 43 lists some of
the possible stages along this journey. This section lists a possible set of activities and tasks that you can used to create a
workplan for implementing a workable solution.
The goal is to create an operational, supportable, maintainable, usable solution that provide access to your data without
compromising data privacy and security.
The implementation of a data privatisation and differential privacy solution is not very different from any other
information technology solution that your organisation wants to implement.
The following high-level set of steps can be iterated several times as you move from an initial pilot implementation to a
complete production solution over time.
• Create a prioritised inventory of potential data sources to which you would like to provide secure privatised
computational access
• Profile the data: understand the structure and contents of data, evaluate data quality and data conformance with
standards, identify terms and metadata used to describe data and identify data relationships and dependencies, data
sensitivity, Privacy Exposure Limit (PEL) and privacy requirements of each dataset
• Define the data extract processes
• Identify the target set of users for access to one or more of the datasets and define the type of access
• Define and agree user access processes and security requirements

Page 41
• Define the subsets of data to be made available for querying
• Perform capacity planning and analysis in terms of raw data volumes, expected number and type of data access
transactions, data refresh frequency, caching of results for performance, creation of materialised views and other
factors that give rise to resource requirements
• Define and agree platform audit logging and reporting, user activity monitoring, event, exception and alert handling
processes
• Define data access charging and billing
• Define the platform operational administration, maintenance and support processes
• Create a cost model for the solution including license costs, infrastructure, support and maintenance and any
proposed revenue streams
• Decide on the deployment approach
• Define the organisational structures and service management processes needed to support the new solution
• Decide on the data integration approach, especially if the solution is to be deployed on a cloud platform
• Define the different types of training needed: administrator, support, data administrator, data query user
• Create, review, validate and approve a differential privacy solution architecture design that incorporates the
information gathered in the previous steps
• Conduct a security review of the differential privacy solution
• Acquire trial versions of platform licenses
• Acquire deployment infrastructure, either on-premises or cloud
• Configure the differential privacy platform and its data sources
• Validate the platform
• Allow user access to the platform in a phased and controlled manner
Data Privatisation and Differential Privacy Solution Operation and Use
The following table lists some key differential privacy platform use cases and what they entail.
These can be embedded into operational service management processes that are listed in the section Differential
Privacy Platform Solution Service Management Processes on page 36.

Page 42
Data Privatisation and
Differential Privacy Use
Case
Description
User Enrolment The user must be defined in the organisation’s user directory. The process for enrolling users
outside the organisation depends on the platform deployment model – on-premises or
cloud.
If the user is outside the organisation, then you may choose to use a cloud-based directory
such as Azure Active Directory as a SAML identity provider.
The user can be assigned to one of more groups, if needed. The user (or the groups to which
the user belongs) will have different access rights to different datasets. The access rights
include details on the subsets of data sources that can be queried, and the number and type
of data queries the user can run before being prevented from running additional requests.
Platform Usage
Reporting and Analysis
The usage of the platform can be analyses in several ways:
1. The overall platform performance, rate of usage, number of users, number and type of
data query transactions, both online and batch, can be analysed and reported on. This
will ensure that the platform is able to handle the current and expected future volume of
data and its use.
2. The amount of data privacy exposed by user queries can be analysed to ensure that the
privacy of data being made available is maintained.
3. Any charges for access to your data can be determined and bills generated.
Addition of Data Source The data source should be profiled to understand its structure and content.
A link must be defined between the data source and the differential privacy platform
summarised data subset.
The data refresh frequency must be defined.
The Privacy Exposure Limit (PEL) of the dataset must be defined. This is the maximum
amount of privacy exposed by all data queries run on the dataset. As queries are run, this is
incremented. Once the limit has been reached, no further access is possible.
Platform Security
Auditing
Platform auditing can be performed at three levels:
1. The overall differential privacy platform can be audited to ensure that it guarantees that
no personal information can be disclosed.
2. The privacy settings of individual datasets can be audited to ensure that they are
appropriate for the sensitivity of their information.
3. The use of the platform can be audited through the analysis of audit records collected to
determine unusual patterns of queries by users.

Page 43
Data Privatisation and Differential Privacy Next Steps
The previous section Data Privatisation and Differential Privacy Solution Planning on page 40 contains a generic set of
steps involved in the planning for differential privacy technology
The journey to creating an industrialised and productionised differential privacy solution can involve a number of points
at which a decision to proceed to the next stage in the journey can be made.
Figure 26 – Data Privatisation and Differential Privacy Solution Journey
In order to allow your organisation move along this journey we have identified a number of practical engagement
exercises that are designed to answer specific questions you might have on in order to progress your differential privacy
journey and to provide you with specific deliverables. These engagements are:
1. Early Business Engagement and Differential Privacy Opportunity Validation
2. Differential Privacy Design Process
3. Differential Privacy Readiness Assessment
4. Differential Privacy Architecture Sprint
Implementing differential privacy technology is a means to an end rather than an end it itself. It is a way of resolving or
addressing a data access problem or opportunity. These engagements are designed with this in mind.
While these engagement types are described individually here, they can be combined to create a custom exercise to suit
your specific needs.
The following diagram illustrates at a high-level the scope of each of these engagements in terms of the duration and
where they fit into your journey to the successful implementation of differential privacy in your organisation.

Page 44
Figure 27 – Approaches to Data Privatisation and Differential Privacy Solution Scoping and Definition
The following table summarises the characteristics of each of these engagements.
What Question You Want
Answered
Engagement
Type
Level of
Detail
Included in
Deliverable
Likely
Engagement
Duration
What You Get
I want a consulting exercise to define
new business structures and
associated solutions to address the
potential data access provision
opportunity
Early Business
Engagement
and Differential
Privacy
Opportunity
Validation
Medium to
High
Medium A validated differential privacy
opportunity across the areas of:
• Strategic fit
• Options evaluation and
identification
• Procurement and
implementation
• Expected whole-life revenue and
costs
• Realistic and staged plan for
achievement
I want a full detailed design created
from an initial not necessarily well-
defined idea that I can pass to
Differential
Privacy
Detailed
High Medium A detailed end-to-end design for a
differential privacy solution
encompassing all solution

Page 45
What Question You Want
Answered
Engagement
Type
Level of
Detail
Included in
Deliverable
Likely
Engagement
Duration
What You Get
solution delivery Design components
I want generalised solution options
identified to the potential data
access provision opportunity
Differential
Privacy
Readiness
Assessment
Low to
Medium
Medium An understanding the scope,
requirements, objectives, approach,
options for a differential privacy
platform and to get a high-level
understanding of the likely
resources, timescale and cost
required before starting the solution
implementation
I have a good idea of the potential
data access solution I want and I am
looking for a quick view of the
solution options and their indicative
costs, resources and timescales to
implement
Differential
Privacy
Architecture
Sprint
Low to
Medium
Short A high-level design for an end-to-end
differential privacy solution focusing
on technology aspects, that
identifies if the solution is feasible,
worthwhile and justifiable
The following sections contain more detail on each of these engagement types.
Early Business Engagement and Differential Privacy Opportunity Validation
The engagement is concerned with analysing and defining the structure and operations of a business function within your
organisation that will operate a differential privacy platform to provide controlled access to your data. It describes a
target business model that includes identifying the differential privacy platform and its constituent components.
The objective is to create a realistic, achievable, implementable and operable target differential privacy platform business
justification to achieve the desired business targets.
This is not an exact engagement with an easily defined and understood extent and duration. It has an essential
investigative and exploratory aspect that means it has to have a necessary latitude. This is not an excuse for excessive
analysis without reaching a conclusion. The goal is to produce results and answers within a reasonable time to allow
decisions to be made based on evidence.

Page 46
Figure 28 – Early Business Engagement and Differential Privacy Opportunity Validation Process
The deliverables from this exercise will contain information in five key areas: strategic fit, options evaluation and
identification, procurement and implementation, expected whole-life revenue and costs and a realistic and staged plan
for achievement.
Strategic Fit Options Evaluation
and Identification
Procurement and
Implementation
Whole-Life Revenue
and Costs
Realistic and Staged
Plan for Achievement
Business need and its
contribution to the
organisation’s data
strategy
Key benefits to be
realised
Critical success factors
and how they will be
measured.
Cost/benefit analysis of
realistic options for
meeting the business
need
Statement of possible
soft benefits that
cannot be quantified in
financial terms
Identify preferred
option and any trade-
offs
Proposed sourcing
option with reasons
Key features of
proposed commercial
arrangements
Procurement
approach/strategy with
supporting details
Statement of available
funding and details of
projected whole-life
revenue from and cost
of project (acquisition
and operation),
including all relevant
costs
Expected financial
benefits
Plan for achieving the
desired outcome with
key milestones and
dependencies
Contingency plans
Risks identified and
mitigation plan
External supplier plans
Resources, skills and
experience required
Differential Privacy Detailed Design
This is a very comprehensive engagement that produces a detailed end-to-end design for a differential privacy solution
for your organisation. This approach to solution design is based on using six views as a structure to gather information
and to create the design. These six views are divided into two groups:
• Core Solution Architecture Views – concerned with the kernel of the solution:
• Business
• Functional
• Data

Page 47
• Extended Solution Architecture Views – concerned with solution implementation and operation:
• Technical
• Implementation
• Management and Operation
Figure 29 – Differential Privacy Detailed Design Views
The core dimensions/views define what the differential privacy solution must do, how it must operate and the results it
will generate. The extended dimensions/views define how the solution must or should be implemented, managed and
operated. They describe factors that affect, drive and support decisions made during the solution design process. Many
of these factors will have been defined as requirements of the solution and so their delivery will be included in the
solution design.
Together these core and extended views describe the end-to-end solution design comprehensively.
Differential Privacy Readiness Assessment
The Differential Privacy Readiness Assessment is intended to allow the exploration of an as yet undefined solution that
addresses a data access opportunity using differential privacy technology.
The work is done from business, information technology and data perspectives. The objective is to understand the scope,
requirements, objectives, approach, options for a differential privacy platform and to get a high-level understanding of
the likely resources, timescale and cost required before starting the solution implementation.
It looks to identify the changes needed within the organisation in order to successfully adopt differential privacy
technology and use it to make your data more widely available.

Page 48
Figure 30 – Areas Covered in Differential Privacy Readiness Assessment
These domains of change can be categorised as follows:
• Business-Oriented Change Areas
− Facilities – existing and new facilities of the organisation, their types and functions
− Business Processes – current and future business process definitions, requirements, characteristics,
performance
− Organisation and Structure – organisation resources and arrangement, business unit, function and team
structures and composition, relationships, reporting and management, roles and skills
• Technology-Oriented Change Areas
− Technology and Infrastructure – current and future technical infrastructure including security, constraints,
standards, technology trends, characteristics, performance requirements
− Applications and Systems – current and future applications and systems including the core differential
privacy platform and any extended components, their characteristics, constraints, assumptions,
requirements, design principles, interface standards, connectivity to business processes
− Information and Data – the data to which privatised access is to be provided, data and information
architecture, data integration, data access and management, data security and privacy
The analysis also included an extended change domain that covers the organisation operating environment and business
landscape and the organisation data access and data availability strategy.
This categorisation provides a structure for this engagement. It aims to define the changes needed across these domains
that are needed to use differential privacy technology to enable data access.

Page 49
Differential Privacy Architecture Sprint
This engagement is designed to produce a high-level design for an end-to-end differential privacy technology solution.
The focus is on the breadth of the technology solution rather than on depth and detail. This engagement recognises that
the journey from initial business concept to operational solution is rarely simple. Not all business concepts progress to
solution delivery projects and not all solution delivery projects advance to a completed operational solution. There is
always an inevitable and necessary attrition during the process. There are many reasons why this should and could
happen. Business and organisation needs and the operational environment both change. The allocation of budgets and
resources are prioritised elsewhere.
In this light, there is a need for a differential privacy solution design sprint that generates results quickly. There is a need
to identify a feasible, worthwhile, justifiable concept that merit proceeding to implementation and to eliminate those
that are not cost-effective
The areas analysed in the differential privacy solution design sprint are:
• Systems/Applications – these are existing systems and applications that will participate in the operation of the
differential privacy solution and which may need to be changed and new systems and applications that will have to
be delivered as part of the solution
• System Interfaces – these are links between systems for the transfer and exchange of data
• Actors – these are individuals, groups or business functions who will be involved in the operation and use of the
differential privacy solution
• Actor-System Interactions – interactions between Actors and Systems/Applications
• Actor-Actor Interactions – interactions between Actors
• Functions – these are activities that are performed by actors using facilities and functionality provided by systems
• Processes – business processes required to operate the differential privacy solution and the business processes
enabled by the solution, including new business processes and changes to existing business processes
• Journey – standard journey through processes/functions and exceptions/deviations from this “happy path”
• Logical Data View – data elements required
• Data Exchanges – movement of data between Systems/Applications
This set of information combines to provide a comprehensive view of the potential differential privacy solution at an early
stage.

Page 50
For more information, please contact:
Alan McSweeney
alan@alanmcsweeney.com

Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy

Ähnlich wie Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy (20)

Mehr von Alan McSweeney

Mehr von Alan McSweeney (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differential Privacy