Demystifying Non-Movement of Data

Demystifying Non-Movement of Data
Nick Halstead
Tuesday, January 18, 2022
Nick Halstead
Tuesday, January 18, 2022

When conceiving of InfoSum my goal was to empower companies of any size to unlock the full potential of their data, by enabling safe and secure collaboration with other companies. 

Before InfoSum, I had witnessed the largest platforms building businesses on top of businesses through the value of their customer data. It was evident to me that for other businesses to offer those same data-driven experiences, they would need access to more data. At the time, most data technology stacks were fraught with technical barriers, high costs, and massive risk of data leakage and exposure. 

More recently, the promise of data clean room technology has filled a growing need to safely connect, analyze, and activate data sets from multiple organizations without exposure or privacy risk. But many instances of this “clean” room technology still require data to move or be shared. 

But what if companies could collaborate without actually sharing their data? What if the world’s data could be connected, without it ever moving?

This question led to the development of InfoSum’s data collaboration platform, which uses next-generation clean room technology powered by non-movement of data. 

InfoSum’s non-movement of data approach has five underlying technologies:

  1. Bunkers
  2. Permissions Control
  3. Private-Set Intersection
  4. Decentralized Edge Processing
  5. Differential Privacy Techniques

By bringing these technologies together, unlimited data sources can be connected to create a first-of-its-kind data network to unlock rich and actionable consumer intelligence, without a single piece of data being shared. This network of data clean rooms places consumer privacy at the center to ensure trust, transparency, and responsibility for all data-driven use cases. 

Let’s take a look at each of those five technologies. 

Bunkers

Challenge: Enable collaboration regardless of data structure or taxonomy

Bunkers are possibly the technology we’re most synonymous with. Bunkers are standalone, private cloud instances. Each Bunker is unique to a single company, and only the data owner ever has access to the Bunker. 

Companies upload a version of their first-party data into their Bunker. During upload, it goes through a normalization process that maps the data across our global schema. This process not only standardizes data representations but also ensures that no personal data (PII) and nothing that could be used to identify an individual remains in the Bunker. 

Bunkers are completely agnostic, meaning they work within a company’s existing data infrastructure irrelevant of data warehousing solutions, CRM, DMP, or CDP. Not only does this accelerate the ability to create data-driven experiences, but also reduces the cost of implementation. 

Permissions Control

Challenge: Enable companies to shape and control every collaboration

To provide companies the flexibility to control every aspect of their collaboration, we have developed a system of permission controls. These permissions allow a company to grant another company the ability to run analysis against their Bunker. The Bunker owner has full control over the level of analysis that can be conducted (down to the individual attributes), and to what degree their Bunker can be used in conjunction with other Bunkers.

Importantly, these permissions never grant access to the underlying data itself and can be retracted at any time. When permission is removed, the other party instantly loses the ability to analyze the data in the Bunker and because no data has been transferred, the data owner has not lost any control.

These permission controls enable companies to define the rules of every collaboration and build agile data-driven relationships.

Private-Set Intersection

Challenge: Deliver rich analysis, without transferring knowledge

All analysis within our platform is driven by a secure multi-party computation technique known as Private-Set Intersection (PSI). The purpose of PSI is to allow multiple datasets to be compared to determine the intersection. Put simply, it enables companies to test if two or more encrypted data sets share any common data points, for example, whether company 1 shares any similar customers with company 2. Importantly, this computation technique never exposes any information outside of the intersection to either party. 

Here is how it works at its most basic level, two companies.

In this scenario, both Company 1 and 2 only learn that they have 50% intersection. They don’t learn anything about the other 50%. 

Decentralized Edge Processing

Challenge: Eliminate the need for centralized data processing

Often where solutions touting decentralization fall short is when it comes to the processing of data. Many of these solutions still require processing to take place in a centralized location, therefore breaking the principle of non-movement of data.

To resolve this, we utilize Edge Processing. This means that all data processing takes place where the data itself is located, meaning within the Bunker. When a query is conducted against two or more Bunkers, the first step is to generate a mathematical model of the individuals in the first Bunker that match the query criteria. This mathematical model is anonymous and contains no personal data (PII). It is then the mathematical model that moves from one Bunker to another, testing itself to determine if there are any common customers, creating the PSI.

In the below diagram, a platform user is querying the intersection between their Bunker (Bunker A) and a Bunker they have permission to query (Bunker B).

Diagram: How querying works

Differential Privacy Techniques

Challenge: Ensure that no single individual can ever be reidentified.

It was vitally important to me that we not only prioritize the privacy of consumers but that we enhance it with our technology. As any data scientist will know, it is possible that by interrogating data closely and applying finer and finer querying criteria, you can potentially identify a single individual within a data set. This simply wasn’t acceptable at InfoSum. 

This is why we have pioneered the use of various differential privacy techniques. The three key techniques we utilize are:

  1. Redaction Thresholds: Where a true query result is below a certain threshold, we do not return a result. This ensures that a small result can’t be used to identify a set of individuals. 
  2. Rounding: We round all results down, by default to the nearest 100. This ensures that subtle changes in the query criteria that fluctuate the results slightly can’t inadvertently expose a single individual. 
  3. Noise: We insert a small amount of purposeful noise, in the form of slight alterations to the aggregate counts. The noise helps mask any identifiable characteristics of individuals. It is small enough to not affect the accuracy of the results but large enough to ensure the protection of personal information.

Here is an example of how differential privacy techniques are applied:

True Result > Redaction Threshold? Noise Rounding Result Delivered
165,325 Yes +1% = 166,982 -82 = 166,900 166,900

The use of differential privacy techniques does not significantly impact the results. Instead, it enables greater insight and analysis by removing barriers to entry for any previously restricted datasets due to privacy or data governance concerns. Additionally, while the results surfaced in the InfoSum platform are at an aggregate, the underlying datasets are maintained at the highest accuracy and richness using deterministic identity resolution.

With InfoSum, these privacy-preserving thresholds are on by default, and while platform users can control the level of obfuscation or redaction they can never turn them off. Companies using the InfoSum platform are guaranteed protection and security by a privacy-by-design infrastructure. 

Using non-movement of data to power the future of data collaboration

The result is an end-to-end data collaboration platform that removes the need for any data sharing between companies. 

Using non-movement of data powered by these five underlying technologies, InfoSum provides end-to-end data collaboration without risk or exposure. InfoSum simplifies the creation of custom data clean rooms built to the exact requirements of each party in minutes. By removing the heavy implementation and integration burden, these data clean rooms empower companies to protect and enrich their first-party data, plan and execute data-driven experiences at scale, and optimize the performance of their strategies with granular measurement insights. 

Imagine an ecosystem of brands, media owners, platforms, and data providers able to share valuable intelligence about their customers and their business without worrying about data leakage, compliance, or misuse. An entirely decentralized network of data clean rooms that require absolutely no sharing or movement of data. All data-driven use cases from identity resolution to personalized customer experiences are unlocked from a single access point to a network of data-rich companies, all while prioritizing consumer privacy, and ensuring each company retains 100% control of its data at all times. 

These networks are the manifestation of my original vision for InfoSum, the world’s data connected and powering incredible customer experiences, without sharing a single bit of data. 

Related Articles