Privacy preserving AI/ML

Oscar Okwero
6 min readDec 1, 2023
Courtesy@east.vc

Introduction

AI has brought tremendous changes on how we consume technology my making machines exhibit human-like tendencies like autonomous vehicles, UAVs, conversational AI and even machine assisted medical operations clinical surgeries. These have been made possible by the enormous amounts of data generated from our digital activities like social media interactions, IoT sensor data and other big data sources like Cloud computing systems. With an increased computing landscape from mobile devices, cloud computing applications supporting e-commerce & e-government, IoT devices an increased rate of cyber-attacks has inevitably followed with data privacy compromise being a major risk.

In AI, Data privacy is a central issue in training and testing AI models, in developing inferences based on historical training data. To infer a potential strain of a disease from historical medical data for example, an ML model needs training in a large set of similar to production data in order to accurately infer a disease strain from an unseen set of data. This means the model should ‘ideally’ be trained in live data that contains very sensitive personal & medical data that may contain personally identifiable information, a practice that is would be ‘suicidal’ and very irresponsible in data security governance. Furthermore even after training these sets assuming you are able to mask the sensitive personal information contained therein it’s been proven that it’s possible to reverse-engineer models to be able to retrieve a representive set of training data that essentially compromises the privacy of the subjects whose data was used in the training of the model.

Privacy risks of AI

Among the long standing risks of AI has been that the training data contains human biases as it tries to accurately represent human behaviour which often contains biases. An artificial intelligence model trained on this data will therefore replicate, reinforce or amplify these biases. This can have serious consequences, such as in employment where certain demographics would be at a disadvantage due to AI bias. AI/ML algorithms have been proven to be vulnerable to Cyber-attacks that are cable of compromising the privacy of the training data and potentially be further weaponized to carry out more sophisticated privacy compromising cyber-attacks commonly referred to as “adversarial machine learning”.

According to ISACA, some of the challenges faced in Data engineering for AI models include; Data persistence ; defined as data existing longer than the human subjects that created it, driven by low data storage costs, Data repurposing ;data being used beyond their originally imagined purpose and Data spill overs ; data collected on people who are not the target of data collection. These characteristics of data engineering have been proven to be susceptible to compromise to reveal features of the training data sets leading to potential privacy compromise in cases of sensitive training data used.

Tab attacks: Tab attacks can occur when an attacker has access to top-1 predictions of a language model, leading to leakage of the model features through pre-set controls like tab clicks. Poisoning attacks: This occurs when attackers manipulate an ML/AI model to act in a way that exposes sensitive training data or make predictions that potentially expose sensitive information on new data sets. Evasion attacks: This occurs when a model is forced to act in un-intended way that may include either wrongly classifying an output or exposing sensitive features of a model that may include sensitive information like personally identifiable information (PII) or protected health information (PHI).

Privacy preserving AI/ML

In an effort to secure ML/AI models from such adversarial machine learning that aim to compromise the privacy and data security of the training and /or the prediction data, Microsoft research proposed a tiered engineering approach to ensure that models are protected through implementation of privacy preserving techniques at each stage of the life cycle. This involves securing every stage of the development of the model as described below;

Training Data Privacy: The guarantee that a malicious actor will not be able to reverse-engineer the training data. Input Privacy: The guarantee that a user’s input data cannot be observed by other parties across the whole life cycle of model development. Output Privacy: The guarantee that the output of a model cannot be reversed by a threat actor to be able to guess the predictive methodology of the model. Model Privacy: The guarantee that the model cannot be taken over by a threat actor to compromise its design and therefore its predictive features.

The following designs considerations have been tested as potential solutions for privacy preserving technologies;

1. Using synthetic data (SD) or anonymised/de-identified data, this helps in avoiding the use of sensitive data to train models hence preserving their privacies. However this compromises the quality of the models as they cannot effectively represent the characteristics of real production data.

2. Federated Learning: Instead of training models on centralised large data repos, federated learning is basically on-device machine learning whereby the training is at end points/edge devices and the models come together as parts already trained to achieve particular collective objectives. This makes it harder for an attacker to reverse the models individually from a collective prediction.

3. Differential Privacy: A technique that adds randomized “noise” to the training data that cannot be reverse engineered to understand the original inputs. The noise inherently interferes with the quality of the predictions and is computed out as inversely proportional to the accuracy and certainty/reliability of that data.

4. Homomorphic Encryption: homomorphic makes it impossible to reverse encrypted data either at rest or in motion to decipher the methods of the model under training. This ensures that an attacker cannot use the encrypted data characteristics to infer the model characteristics. Some forms of Homomorphic encryption involve the encryption of data while in use to ensure the preservation of its privacy.

5. Secure Multiparty Computation (MPC): This involves two or more parties’ who do not trust each other who generate their own separate inputs and transform the inputs into “nonsense” which gets sent into a function whose output is only sensical when the correct number of inputs is used ensuring that any attacker must access all the separate inputs to try reverse the collective output.

Other controls

Apart from the technical privacy controls illustrated above, it’s important that ML/AI developers implement proper data security governance in the process of designing, training and testing their models to ensure that at no stage of the ML development lifecycle is a privacy risk introduced. ML developers should keep in mind the following good practices;

1. Data minimization

ML/AI developers should only collect essential data to meet the objectives of the model under development. For instance for the development of a digital identity management model for a government institution do not need to collect DNA data.

2. Secure data storage and transmission

The first rule of any data governance program is to always encrypt data at rest and in motion using secure algorithms and secure the encryption keys. Ensure that all data elements are encrypted and the keys securely stored to prevent compromise.

3. Anonymization and pseudonymization

In instances where its important to have representative sensitive training data fed to a model, use techniques like tokenization and hashing to blur out such sensitive information for example, a medical research institution can use hashed patient IDs instead of actual names in a clinical trial database. Use of hash functions such as HMAC for irreversible pseudonymization help in having representative models without using actual sensitive information for training models.

4. Secure Deployment

To ensure that the process of development of an IA model is free from compromise developers should implement secure development practices like DevOps/DevSecOps and code version control ensuring that each phase is securely managed and approved. AI model versions and dependencies should be isolated using container technologies like Docker and Kubernetes and if the model is hosted in cloud workloads use the relevant cloud security technologies to secure the workloads.

5. Identity & access control

Role based access control (RBAC) including MFA authentication should be implemented to ensure that threat actors do not inject adversarial machine learning into an ML/AI model life cycle.

6. Supply chain cyber risk management

Any ML/AI life cycle that involves the acquisition, preparation/cleaning and processing of large amounts of data from various sources introduces a risk that any part of that is a candidate for compromise by threat actors to inject nefarious training data set that makes the model develop particular characteristics that can be later compromised to leak sensitive information. This should therefore be secured at all stages of the life cycle. Any API interfaces should be secured as they represent an added threat landscape.

7. Employee Training and Awareness

The human elements remain the most vulnerable part of any model as they can always be compromised to reveal key training data characteristics. Train all knowledge workers and engineers involved in the development of a model to protect the sensitive training data elements used in the development of ML/AI models.

--

--

Oscar Okwero

Cyber Security | AI | Data protection | Food | Liverpool FC |