This document discusses privacy-aware artificial intelligence and techniques like split learning and federated learning. It notes the tension between utility and privacy with AI systems and proposes approaches like differential privacy, homomorphic encryption, and sharing wisdom rather than raw data to develop private AI. Split learning and federated learning allow models to be trained from distributed data sources without aggregating private information. The goal is to capture precise data to learn and act while respecting privacy through techniques that train models from decentralized and anonymized data.
22. Split Learning
Master AI from Wisdom
Smashed Wisdom
[Gupta17@ MIT]
..
Intern Notes
for every
patient
Intern Notes
for every
patient
..
1 2 3 … 100
Chest
X-rays
https://splitlearning.github.io/
Distributed deep learning and inference without sharing raw data
MIT Alliance for Distributed and Private Machine Learning
Abstract: Friction in data sharing is a large challenge for large scale machine learning. Recently techniques such as Federated Learning, Differential Privacy and Split Learning aim to address siloed and unstructured data, privacy and regulation of data sharing and incentive models for data transparent ecosystems. Split learning is a new technique developed at the MIT Media Lab’s Camera Culture group that allows for participating entities to train machine learning models without sharing any raw data.
New Program: MIT Alliance for Distributed and Private Machine Learning
The program will explore the main challenges in data friction that make capture, analysis and deployment of AI technologies. The challenges include siloed and unstructured data, privacy and regulation of data sharing and incentive models for data transparent ecosystems. The research program with study automated machine learning (AutoML), privacy preserving machine learning (PrivateML) and intrinsic as well as extrinsic data valuation (Data Markets). By working with a stakeholder and innovator network, we aim to create a standard for data transparent ecosystems that can simultaneously address the privacy and utility of data. Our broad focus would be on key technologies such as Differential privacy, Federated Learning and Split Learning.
The program members will meet 4 times a year, publish case studies of AI on siloed data, will develop a curated github archive and engage in privacy aware data sharing protocol discussion towards a data exchange standard. We expect this integrated program to lead to many publications, training of talent, new technologies and standards.
MIT Media Lab consortium members can join the alliance as a special interest group SIG member. Non-member companies, startups and non-profits can join via undirected research gift. To know more about the program, please contact at vepakom(at)mit.edu
Burning man, the vibrant colorful art festival in Nevada. If you walk up, there is a boot that says, talk to God.
Now, let's not get religious but curious, right? Would you walk up and pick up the phone and see what God has to say? Well, two years ago, if you had gone and picked up the phone, the voice of God on the other side had a very strong, thick Indian accent. And that was me. So I spent days just a couple of hundred meters from this old style telephone booth listening to folks. And amazingly, you know, the fellow Americans shared and poured their hearts out, talking about their dreams, but often their challenges and frustrations and they thought somehow there is someone who can see everything, can talk to everybody involved and maybe you can resolve and provide them guidance.
This was a very moving experience for me as a human being, but also as a scientist. Why did they trust me with all their deepest secrets? And when I came back to MIT, I and my team thought, this is what we need for the society to move forward. We need a trusted, impartial, honest broker who has an all seeing all knowing view of what's going on, but not through a kind of a big brother or villian system, but maybe a new fangirl AI. That's Benny volunt and we look at, you know, the of Providence on our dollar bill and if you believe Nicholas cage from the national treasure, it's all seeing all knowing and it frankly is not going well. We don't trust most of the systems.
Who here loves the Google maps, the reds and the greens for the traffic? I mean, isn't it amazing? We just casually give away our privacy of a GPS location and we get so much utility out of it. Imagine if you could do the same thing for everything else. If there's a treatment for diabetes or treatment for another challenging disease. If millions of people worldwide would just share that health data and the treatments, we can probably solve some of these problems over night. So we have this dichotomy. Either we get privacy or we get utility and the question for us is we can be create a privacy preserving AI that can achieve port.
Now, of course we are going to say this is nearly impossible because we care about consent, we care about the regulations and of course our trade secrets. And it's just too private, too confidential to share all this information. And even if you can, you know, the quality is going to be horrible. You know, Google is great at collecting the data, but who's going to put together all this information and what's the incentive for people to do this? But you'll agree with me that if you can solve this privacy utility trade off will transform not only our health, our transportation, but maybe on our democracy. And some of us might say, you know, I really had nothing to hide. So just like I give away my location data, I'm happy to give away everything else because I really had nothing to hide. But sometimes not just about you, it's about others around you as well.
The challenge for us when we hear about word like privacy is it's very egocentric and we understand issues like identity fraud or job discrimination, but we don't really tie that to an Orwellian all things system that's out there. So that individual level, we understand that if you visit a mental health facility, maybe this Orwellian AI could lead to job discrimination, but also affects at the organization and even geopolitical level. Imagine if a large company can buy the data of just the locations, awful, the employees of a smaller company and figure out all their plants. So a lot of the trade secrets are going to spill out. And even at geopolitical level, Strava who is a Strava here, beautiful app to keep track of your outdoor activities, Strava desire to release a heat map that in aggregate in an anonymized way, shows the trails of every, every individual that's using Strava for running and walking and so on.
Well, it turns out it's mostly used by Westerners. So the army front of the U S army in Syria and Nigeria were completely exposed because only the U S army men, they're using the struggle maps. And in Taiwan, the locations of their very secret defense facilities were exposed by looking at those trails. And we are not even talking about the possibility of a rogue entity, just buying the raw data from a Strava employee and figuring out if people working at those military facilities, you know, what their traces are and they go from one location to the next location. So it's kind of scary. So privacy is not just about us, it's about everybody around us. And also at the national and geopolitical level
The obvious reaction would be less to regulation, but most of the world actually lives in an unregulated way. The app that you download from some strange country and strange Uplevel uppers is often not regulated where you are. And even if they do, it's not going to end up creating this Beneville learned AI that can actually solve societal problems by aggregation. So there's some good news. So for each of these problems, uh, my team at MIT, but many researchers all over the world, uh, are developing techniques that can preserve privacy and still make it available. Uh, for societal good. For privacy, we have techniques to just split learning for quality. There's automation of machine learning and for incentives there are new types of data markets. So what does this honest, trusted, impartial broker need? First of all, it has to be decentralized, much like internet, not owned or controlled by any single entity.
It needs high quality, precise data. It needs to learn from wisdom because data turns into wisdom and wisdom don't launch into what machine can machine learning can use. And this broker also needs an ability to act, predict for a given task. These are tall orders. It's not easy to solve each of these pieces. So let's look at what has happened originally. We thought humans and our activity build gender data and that can be used for statistics and that can be used for everyone else. So humans to data. But the challenge in today's world, in the Orwellian AI is that the data and the static sticks has started identifying an individual. The example I gave you about struggle on your mental health and job discrimination, so it feels like
The key enemy of privacy is actually the precision, the fact that the data is so pure and that's the first step in taking back the challenge. Why should an app need your exact precise GPS location when we can just add some noise to it, blurt it a little bit, create a random location that's within a quarter mile so it doesn't reveal the fact that you went to this mental health facility. At the same time if you want to see the traffic and how many folks are roughly on the street is still good enough in aggregate when you average them to see the reds and the greens on the map and these are very powerful idea. In fact, the census in 2020 is going to use this notion of taking a raw census data, adding some noise and releasing it available for the public and also for researchers. And it will be available on a block by block basis in terms of education and employment and and race and social factors and so on. And of course, because of this technique called differential privacy invented by Cynthia Dwork and others, we can guarantee that the data does not reveal anything about the individual, but the data is enough to talk about everyone in aggregate
So that's a very powerful way to get started. So on the horizontal axis, we have different ways to think about privacy. As we saw, anonymization is no GERD, uh, with a Strava example. And at the same time using all lapping data sets, it has been shown many times that anonymization is not at all useful. A slightly better technique is to add noise as we saw for GPS location, probably the best way to achieve privacy is to keep all the return encrypted. Of course, there's a trade off if you anonymize it. We can learn a lot from it. If we add noise to it. Unfortunately it's difficult to do something more sophisticated other than doing aggregate queries. And encryption of course limits to what we can do. So to create this private AI, the key concept is not just to add noise, but instead of adding noise, think about how can we convert data into wisdom and only share the wisdom but not the raw data. And that's what the privacy can come in. And to measure techniques have emerged just in the last two and a half years. One called federator learning from Google and another technical split learning from our group at MIT
Let's take an example of medical diagnosis and treatment plans. Let's say we want to create a master algorithm that detects pneumonia from chest X rays and suggest the best treatment plan. We're looking at hundreds of thousands of patients. And imagine you have about a hundred hospitals. Each of them have chest X rays and treatment plans. But because of regulation and privacy, they cannot share it with, you know, the central facility. So what can we do to solve this problem? Well, the good thing about software versus AI is that software is wisdom from roots. So it's more like cooking. You know there's a procedure, you use ingredients and you get an outcome. It's rule-based system. AI on the other hand is wisdom from examples. And so AI is more like raising a child. You know, I would show a lot of examples to my daughter about how to eat and how to talk and how to walk and she's going to learn from that. And the fact that AI is wisdom from examples allows us to actually create a new type of privacy.
See if we want to create collective intelligence. What I could do is I could take about a hundred medical interns and send them to those hundred hospitals and ask them to just train in each of those facilities. You know, maybe one hospital has more smokers, one hospital has elderly, another hospital has a particular demographic, another hospital has different nationalities. And after having trained for three months, each medical intern won't be as good as figuring out the interpreting the chest X rays and treatment plans for every other type of demographic. And the hospitals a little bit good enough to do what they do in that one particular group. I can call all this a hundred interns back to my office and certainly I have collective intelligence. I can just take an average of all those interns. And when a new chest extra shows up, I can show to this a hundred interns and say, Hey, what do you think? And they can give me a pretty good answer. And in this process, I have not asked for raw data from any of those a hundred hospitals, but I have a collective intelligence from all these entities. So this fascinating technique is called federator learning and mentored by Brendan McMahan and others at Google and is already being used on your Android phones.
Speaker 2: 14:14 Let's look at it another way. You all know about the elephant and the blind man. So imagine the a hundred interns go out and instead of learning from all those x-rays with possible pneumonia or not and coming back, imagine the interns just take notes and they're going to make a lot of mistakes. They're going to say, Hey, this looks like a fan. This looks like a spear. This looks like a rope. They're going to convert the raw x-rays into whole bunch of features and they can send them to the center. And the center would say, you know, I never got a full picture of an elephant, so I'm going to preserve the privacy of the elephant. But now I know what this is. I know what a pneumonia looks like.
Speaker 2: 15:05 And this method is called split learning, as I said, from our group. And it's the way to create a master AI from wisdom. So each of these interns can take notes for every patient and convert the raw data into wisdom. We call it smashed wisdom. And this smash wisdom is sent to create the master algorithm. So pretty straightforward ideas, right, of converting data into wisdom and wisdom into AI and preserving privacy of individual patients. Of course, we want to do that with sophisticated machinery, sophisticated digital tools, and that's this new fangled AI that's privacy preserving. So we're living in an interesting world where we could get the best of both just with some software.
Speaker 2: 15:59 So my group members [inaudible] prints, backup Hocoma and Abhishek Singh are working on this problem. And the releasing a lot of code on the split learning. How do we split the learning between what happens in the hospitals and what happens at the servers. And of course, once the master AI is trained and it's really good at detecting pneumonia or suggesting the treatment plans, it can be shared worldwide. And for some reason, if the server wants to keep it a trade secret, that's fine too. You know, a new hospital in Philippines can just send an x-ray, not in its raw form but with smashed wisdom out of it and convert that and get this, get this, uh, get this response. So just make sure I, I stayed this in the machine learning language at the technical language. We can take a deep neural network and the early layers of the network execute at the hospital and the later layers of the deep neural networks execute at the server.
And we convert our raw data, in this case, x-rays and notes into an intermediate representation, which we call smash wisdom. And the smash wisdom is sent to the server so it can finish rest of the training train. And at the inference time, this hospital in Philippines can again convert the extra image into the smashed wisdom, a smasher presentation and get a response back from the Mostro Gordon. So this is how the systems can work. So so far we talked about a solution for privacy using techniques like federator learning and split learning. What about the quality if really to create a Benadryl and AI, the likelihood that every entity will have no talented machine learning engineers and data scientists and folks who put all this together and curate that is unlikely. So a new range of tech new set of techniques have emerged called automated machine learning that can not only figure out how we can do the janitorial work on all the data on the all the information, but can also figure out the best technique that applies that, that that particular information. And then finally, even if all this is working, what's the incentive for those a hundred hospitals to participate? Maybe they should get paid. Maybe some of those hospitals have some really unique datasets, some really unique patients who should be monetized for that. Or there could be other incentives beyond money for them to participate. Nevertheless, we need a stock market that's possible for AI that's possible for data that turns into wisdom and turns into AI. So these are some of the techniques that we think will mesh together to create the Benny will and AI.
The next time you see a booth that spikes your curiosity, talk to God. Don't just click on those buttons that says, I accept, I accept. You know, all those forms that you fill in or you never read the privacy agreement that you are signing up and sending over your life. Because something as low stakes as your location can be actually very, very important. Maybe not to you, but people around you and even the places you live in. At the same time, we can't be too paranoid and say, I don't want to share data for a benevolent AI because now we have a mechanism to convert data into wisdom and that wisdom will create a Beneful and AI. So let's fight our, for our freedom. Let's get our data in the right place. Let's solve big societal problems using Benny will India. Thank you.
https://splitlearning.github.io/
Data is siloed and invisible because of privacy regulations
So whats the reason for this data friction? Ask audience ..
decentralization
dark data sees the light of day
observ data vs RCTs
millions of clinical trials in parallel
https://splitlearning.github.io/
Distributed deep learning and inference without sharing raw data
MIT Alliance for Distributed and Private Machine Learning
Abstract: Friction in data sharing is a large challenge for large scale machine learning. Recently techniques such as Federated Learning, Differential Privacy and Split Learning aim to address siloed and unstructured data, privacy and regulation of data sharing and incentive models for data transparent ecosystems. Split learning is a new technique developed at the MIT Media Lab’s Camera Culture group that allows for participating entities to train machine learning models without sharing any raw data.
New Program: MIT Alliance for Distributed and Private Machine Learning
The program will explore the main challenges in data friction that make capture, analysis and deployment of AI technologies. The challenges include siloed and unstructured data, privacy and regulation of data sharing and incentive models for data transparent ecosystems. The research program with study automated machine learning (AutoML), privacy preserving machine learning (PrivateML) and intrinsic as well as extrinsic data valuation (Data Markets). By working with a stakeholder and innovator network, we aim to create a standard for data transparent ecosystems that can simultaneously address the privacy and utility of data. Our broad focus would be on key technologies such as Differential privacy, Federated Learning and Split Learning.
The program members will meet 4 times a year, publish case studies of AI on siloed data, will develop a curated github archive and engage in privacy aware data sharing protocol discussion towards a data exchange standard. We expect this integrated program to lead to many publications, training of talent, new technologies and standards.
MIT Media Lab consortium members can join the alliance as a special interest group SIG member. Non-member companies, startups and non-profits can join via undirected research gift. To know more about the program, please contact at vepakom(at)mit.edu
https://splitlearning.github.io/