Balancing Privacy and Machine Learning


If users don’t know the importance of their data and how to exercise their rights, we’re doomed

Darren Redfern, CTO skritswap

Machine Learning is a science that depends on algorithms. The algorithm tells the computer the exact steps needed to use data to solve a problem. The computer collects prior solutions to predict future solutions. It then suggests a result based on previous experience.

In most cases, these systems learn from the actions of many users to learn how to service the larger user community. What if those actions contain information that might be private or confidential? Will the future outputs of the algorithms expose sensitive information? What can we do about this?

The risk level depends on circumstances, including:

  • What type of data is being used?
  • Where did that data come from?
  • How detailed is the data in the application and its output?

In some machine-learning applications, like image recognition systems, the privacy risks are low to none. If a specific person contributes training data by labelling the content in an image, that data cannot be traced back to that specific person by looking at the output of the algorithm.

However, if the data collected is more textual (written or oral), the privacy issues are complex. Take for example, a system that analyzes people’s emails to develop automated responses. It would need to be cautious that sensitive or personally-identifiable information does not pass through. At the same time, this kind of personalization might be desirable in suggestions offered to the user that the system learned from.

So, how do we balance the benefit to the larger community with the privacy of the individuals in that community? The answer lies in combining approaches:

  • Explicit Opt-Outs

Tell users their data is being used to create ML models for use with a larger audience. And show how they can keep specific data out of the results. In our example above, users can mark all or part of an email as private, so it won’t be used in models offered to other users.

  • Data Filtering

Don’t leave it to users to protect their sensitive information. ML systems can be trained to identify data that might be considered too private for general use. This is using ML both for the business purpose and for making the system itself more responsible.

  • Granularity Control

Those who build these algorithms must pay attention to the details—the many small and distinct parts–of how they process and put out the information. Doing so can minimize the risks.

So, an algorithm might analyze the structure and meaning in complete sentences, while it puts out suggestions at a different level (phrase or word). Then whole sentences can be built up from these smaller units. The issue of data privacy in machine learning is deeper than the examples examined here. Each company using it needs to design their approach to maximize automation while minimizing the exposure of personal information. The confidence clients can have in the result are worth the effort.