Assessing Google's data collaboration tool 'Private Join and Compute'
Much has been written on the need for businesses to better utilize their customer data while protecting consumer privacy. In particular, the tech giants have come under fire for not doing enough given the level of data they collect. In addition to others, Google has recently begun a campaign of promoting data privacy across its various offerings. The most recent result is the release of its open source ‘Private Join and Compute’ tool that promises to help businesses “collaborate with data in a privacy safe way”.
Privacy-safe data collaboration is a topic we have been advocating for quite some time, so we’re interested to see different approaches being developed. While we are excited that platforms like Google are exploring privacy-safe data collaboration, we have found some areas of interest to those following the space.
The first area of interest relates to the functionality that ‘Private Join and Compute’ would deliver. Google themselves identify this by pointing out:
“Practical deployment of secure computation techniques involves solving technical, business, organizational, and human challenges which are often intertwined in the context of real applications. Thus, a viable solution cannot target just one subset of these areas of concern, but needs to address all of them.”
‘Private Join and Compute’ is a multi-party computation tool, meaning it enables a joint analysis of sensitive data. Google’s focus is on a very specific use case that they dub “The Intersection-Sum problem”. At a very basic level, this will allow two parties to calculate the sum of a numerical value (for example cash spend) in the intersection of their two data sets.
‘Private Join and Compute’ uses two encryption methodologies, “commutative encryption” that allows two data sets to be encrypted with multiple keys, and “homomorphic encryption” which enables calculations to be conducted on encrypted data. However, due to these encryption techniques, it isn’t possible to gain any insight outside of the simple sum, count and average. In the context of the advertising industry, and more widely in the public policy use cases identified by Google, this has significant limitations. An example of this would be the inability to create data segments based on demographic or interest categories, as these are not numeric fields.
What’s more, while this is a functional limitation, one of the bigger concerns our team identified was in the lack of substantial privacy and security controls.
Data privacy and security
Any time consumer personal data is being utilized for research and analysis, the security of the data and the privacy of individuals must be protected. As an example, one of the use cases Google suggested this protocol could be implemented in is health care. While this is a cause that would benefit greatly from the ability to run analysis across a number of data sources, health data is incredibly sensitive, and so the lack of comprehensive privacy and security controls within Google’s protocol are of concern.
From a security point of view, a fundamental part of Google’s functionality is the transfer of data between parties. This movement of data creates significant security risks for both parties as once the data has been shared, it is impossible to enforce the protocol and to determine any additional usage of the data. Google summarises this:
"Our protocol has security against honest-but-curious adversaries. This means that as long as both participants follow the protocol honestly, neither will learn more than the size of the intersection and the intersection-sum. However, if a participant deviates from the protocol, it is possible they could learn more than the prescribed information"
In this statement, Google also identifies a privacy issue - it is possible through analysis of the intersection, that additional insights could be gleaned. This becomes a particular concern when it relates to the ability to identify an individual, something Google also states is possible in their continued statement.
“If a participant deviates from the protocol, it is possible they could learn more than the prescribed information. For example, they could learn the specific identifiers in the intersection.”
The ability to identify an individual within the intersection is a fundamental privacy concern when running analysis across sensitive data. An example of how identification could be achieved is through over-analysis of the data until a single individual either appears or disappears from the intersection. Using the example of health data, it would therefore be possible to identify an individual who suffered from a certain medical condition. This can be addressed with the introduction of differential privacy, and the application of redaction thresholds, rate limits and noise to protect individuals.
A protocol is an important first step, but it requires a comprehensive system to be built around it. It is possible for Google to do this themselves, or for other developers to do so using the open source code, however, this would not resolve the fundamental limitations of the protocol identified above.
InfoSum’s comprehensive approach
Our approach goes beyond the Google protocol and provides a scalable enterprise-ready solution that delivers commercial value and opportunity across most sectors today.
InfoSum’s pioneering work in the area of federated analytics is a fundamentally different technique to Google - we don’t transfer data between parties. Users of our Platform upload their data into their own private server, known as a Bunker. At this stage, the data goes through a normalization process that renders the data pseudonymized and maps all data categories to our global schema. It is this mapping of data that enables an analysis to be conducted across any number of datasets that go beyond the “sum” functionality available in ‘Private Join and Compute’, to deliver richer, actionable insights. Any direct identifiers are converted into a key and the original irreversibly deleted from the Bunker.
Rather than combining data to conduct analysis, our technology keeps Bunkers completely isolated. A mathematical model moves between each to generate aggregate statistical results based on the user-defined query. To further protect the privacy of both consumers, we have built various differential privacy concepts into our platform that ensure no single individual could ever be re-identified, removing the need to rely on your collaborator following protocol “honestly”.
We’ll be excited to see how businesses look to implement Google’s ‘Private Join and Compute’ functionality into their solutions. Privacy-safe data collaboration provides great opportunities across many industries, ranging from advertising and retail to health and government. However, we urge any businesses looking to utilize Google’s open source code to consider the limitations outlined above.
If you would like to discuss how InfoSum can provide you with a privacy-safe approach to connecting multiple data sources for insight and activation, contact us here.
On deploying secure computing commercially - https://eprint.iacr.org/2019/723.pdf
Helping organisations do more without collecting more data - https://security.googleblog.com/2019/06/helping-organizations-do-more-without-collecting-more-data.html
Security and Privacy Caveats - https://github.com/google/private-join-and-compute/blob/master/README.md#security-model