Data Privacy in a World of Outsourced Artificial Intelligence

April 27, 2017 | Evan Wright

Artificial intelligence(AI) and deep learning can lead to powerful business insights.  Many executives are ready to harness the power of this technology but one main challenge holds them back.  Hiring technical talent for cybersecurity is hard enough in itself; hiring technical talent for AI is a much bigger challenge.

This problem was recently faced by the UK’s National Health Service(NHS).  Tremendous results have been demonstrated recently using computer vision techniques to identify specific types of illness in medical patients by looking at scans of the patient’s body.  Artificial Intelligence has a strong track record of effectively predicting medical conditions such as Cancer, Heart attacks and many other image-based diagnoses.

Medical information is particularly sensitive to medical organizations like the NHS, but it is also among the most lucrative types of PII to cybercriminals.  Many freely available AI/machine learning software packages exist such libraries as theano, torch, cntk, and tensorflow.  Despite the availability of these tools, many organizations like the NHS do not have sufficient access to experts able to run powerful machine learning tools.  Without this type of collaboration many illnesses may go unidentified and people could die.  So the NHS* decided to partner with DeepMind, a company acquired by Alphabet/Google.  The University of Cambridge and the Economist wrote an article detailing many aspects of the contract.

As a result, DeepMind gets access to 1.6 million medical records and a neat application of its technology, in addition to undisclosed funding. This data includes blood tests, medical diagnostics and historical patient records but also even more sensitive data such as HIV diagnosis and prior drug use. In the sub-discipline of machine learning called Deep Learning, the algorithms are particularly dependent on having a large data corpus.

When an organization is faced with the choice of outsourcing sensitive information to experts, what are the choices?  Any organization outsourcing information should redact all personally identifiable information such as name and personal identifiers.  This instead can be represented by a pseudonym - a unique mapping such as a hash function - where the unique identifier and the PII are held only by the trusted entity (NHS  in this case).  Furthermore, semi-sensitive information that would have value to the ML model should be abstracted.  For example, geographical location may be a powerful indicator of an illness, but the raw data could be used to reverse-engineer PII of a given patient.  In this case binning the information so a little fidelity is lost is an effective trade-off between empowering the AI’s prediction power and protecting patient confidentiality.  For example, grouping specific addresses into zip codes or counties may be a nice trade-off in this space.

The tradeoff of security and predictive power will likely be a challenging problem for data owners. AI is able to combine many weak signals and often make surprising conclusions.  In one study by CMU researchers found social security numbers were surprisingly predictable, and the AI algorithms could usually reconstruct a SSN from information such as birthdate and gender.  So being able to guarantee that AI can’t reconstruct your PII is an unsolved problem, and likely very dependent on the data.   However, best-effort strategies like those outlined above can help mitigate against most concerns.

In the future this issue may change significantly.  Recent developments in federated learning may allow for increased flexibility where keeping data on premise may become more available.  A related technology of homomorphic encryption has been in the works for far longer.  In homomorphic encryption the computations occur on encrypted data without ever having to decrypt the data, which would significantly reduce the security concern.  We are still years out of technology solving this problem directly. In the interim the promise of the AI benefits are too great for most organizations to wait.

At Anomali, we deal with sensitive information regularly, as we help many organizations around the world winnow down data from across the enterprise and focus on the applicable security threats.  We address privacy issues with on-premise deployments such as Anomali Enterprise; or by very tight access controls and data isolation like our Trusted Circles feature for sharing threat intelligence in our Threat Intelligence Platform, ThreatStream.

*The agreement was signed by the Royal Free NHS Trust, a small subordinate component of the much larger NHS. The Royal Free Trust is comprised of three hospitals in London.

Evan Wright
About the Author

Evan Wright

Evan Wright is a principal data scientist at Anomali where he focuses on applications of machine learning to threat intelligence. Before Anomali, he was a network security analyst at the CERT Coordination Center and a network administrator in North Carolina. Evan has supported customers in areas such as IPv6 security, ultra-large scale network monitoring, malicious network traffic detection, intelligence fusion, and other cybersecurity applications of machine learning. He has advised seventeen security operations centers in government and private industry. Evan holds a MS from Carnegie Mellon University, a BS from East Carolina University, a CCNP and six other IT certifications. Twitter: @evanwright

Get the latest threat intelligence news in your email.