Deep Fakes, Part 2: Man Vs. Machine

In our previous episode, we learned about GAN: an Artificial Intelligence technology developed in 2014, that’s already looking to revolutionize the way we create content across a variety of media: from images and videos to audio and text. GAN is based on the idea of Adversarial Training. Two neuronal networks compete against each other: one produces false information – for example, fake images of human faces – and the other tries to identify these forgeries and distinguish them from photographs of real human beings. The result is an artificial neural network capable of producing deep fakes of unprecedented quality and in a fraction of the time and cost required to do so until just a few years ago.

In late 2017, this technology left the realm of academic papers and became available to the general public. The almost obvious outcome was a flood of pornographic videos in which the faces of the original actress were replaced with those of Gal Gadot, Scarlett Johansen and others. In other cases, actors who appeared in famous movies were swapped with other well-known actors, such as Jim Carrey replacing Jack Nicholson in The Shining, Sylvester Stallone replaces Arnold Schwarzenegger in “Terminator 2” and so on.

After the first wave of fake celebrity porn clips, fake political clips began to appear as well. Someone replaced Argentina’s President’s face with that of Adolf Hitler, and those of Vladimir Putin and Donald Trump with Dr. Evil and Mini-Me from Austin Powers. I’ll let you figure out by yourselves who got to be Dr. Evil, and who’s his minion. Especially successful was a video in which former President Barack Obama is seen warning the public about the dangers of deep fakes and implores them no to believe everything they see on YouTube. Only halfway through the video, the viewer finds out that this video itself is a deep fake, with comedian Jordan Peel narrating the fake Obama.

These videos spurred several news commentators and academic experts to voice their concerns regarding the dangers deep fakes pose towards democracy. The 2016 US presidential elections were marred by fake news, such as “reports” about a child sex ring run by high-ranking Democratic party officials from the basement of a pizza joint in Washington. For most of us, these kinds of stories seem obviously false – but they were convincing enough to make at least one nut-case burst into the said pizzeria armed with a shotgun, fire three shots and demand to check the basement for himself. Imagine the effect this same story could have if we combine it with a deep faked video of Hillary Clinton seen abusing a child herself.

There is also the destructive potential of deep fakes in extreme situations of stress and confusion. Take, as an example, the riots that broke in Ferguson, Missouri, in 2014 after the shooting of Michael Brown – an 18-year-old African-American, by a white police officer. These riots ended with 1 death and several injured people, including 6 injured police officers. Imagine what could have happened if during these protests, while the tension between protesters and police forces is at its highest – someone creates a deep fake showing a senior police officer shooting and killing a black teenager at point blanc, practically executing him in cold blood. It will take a few hours until everybody realized this video is a deep-fake, but a few hours might already be too late in these stressful circumstances. Such a provocative video can turn a relatively-peaceful demonstration into a full-blown bloodbath. A similar video published on election day could influence election results.

And finally, there’s the more ‘personal’ threat from deep fakes. When it comes to a prominent politician or a celebrity actor – dozens of experts will probably analyze a deep fake from every angle to test its credibility. Unfortunately, when it comes to me and you, the ordinary folks – no one is going to take the time to verify the credibility of deep fakes we happen to appear in. This opens up the very real possibility of people using deep fakes to settle personal scores or hurt business rivals. All it takes to discredit a business-owner is to fabricate a deep fake showing her, for example, being mean to a disabled person – and let Facebook and Twitter do their thing. All it takes to ruin my career is to make a deep-fake of me talking about running a child sex ring. Once such incriminating “evidence” sees the light of day, the burden of proof that it is a fake is now upon me, the victim – and in most cases, the victims will not have access to the experts who can refute these forgeries.

And Such fakes can have devastating consequences for people’s lives, as one can learn from the story of Rana Ayyub, an Indian journalist of Muslim origin. Rana is an investigative journalist and is particularly active in fighting rape and violence against women, a painful topic in India. In early 2018, an eight-year-old Muslim girl was raped and murdered in Kashmir. The suspected murderers were Hindi, so when Rana stood by the victim’s family, some Hindu politicians accused her of being pro-Muslim and anti-Hindu.

The next day, some fake mobile phone screenshots – allegedly taken from Rana’s WhatsApp account – started circulating on social media. These screenshots showed messages such as “I hate India” and “I love Pakistan”. Rana was unphased: being a veteran journalist, she was used to these smear campaigns. She tweeted a message to her followers saying these screenshots were fakes and moved on with her life.

But the next day, as she was sitting in a cafe, her phone started beeping with hundreds of messages. When she opened one, she saw it was a porn video – with her in it. Someone deep-faked her face on a porn actress. Rana, as I noted, was already experienced in handling smear campaigns – but this time, it was different. India is notoriously conservative when it comes to sex, and participating in a porn video is a big deal. A big BIG deal. Rana could immediately tell this was a fake: the woman in the video was a good ten years younger then she, and her hair was smooth – while Rana’s is curly. But it didn’t matter. The fake video went explosively viral, with tens of millions of shares and retweets. Here’s Rana’s account in a personal column she wrote for the Huffington Post:

“It was devastating. I just couldn’t show my face. You can call yourself a journalist, you can call yourself a feminist but at that moment, I just couldn’t see through the humiliation.”

Rana got a deluge of dirty messages and profanities of every kind. She closed her Facebook account, but someone posted her phone number and the hatred kept on coming.

“I was sent to the hospital with heart palpitations and anxiety, the doctor gave me medicine. But I was vomiting, my blood pressure shot up, my body reacted so violently to the stress.”

Rana’s ultimate humiliation was when she went to the police to file a complaint against the deep fake creator.

“There were about six men in the police station, they started watching the video in front of me. You could see the smirks on their faces. I always thought no one could harm me or intimidate me, but this incident affected me in a way that I would never have anticipated.
The irony is, about a week before the video was released, I heard an editor talking about the dangers of deepfake in India. I didn’t even know what it was so I googled it. Then one week later it happened to me.”

All these potential threats I outlined, demonstrate the urgent need to find a reliable way to detect deep fakes – preferably in an automated way that will allow YouTube, Facebook and the like to tag pictures and videos as fakes as they are being uploaded to the platforms. This week, as I was writing this episode – I got a message from Sarah Thompson, a listener who is also a Social Media Authenticity Analyst at LeadStories.com. Sarah shared with me an article she wrote about a network of hundreds of fake profiles on FB, that promoted news stories supporting Donald Trump. I won’t get into the details of the story – it might actually be a cool topic for a future episode of Malicious Life – but Sarah’s investigation is interesting and relevant to our topic today because these fake profiles used deep-faked profile pictures that look, to the untrained eyes, as real as any other profile picture you can find on the social network. In effect, we’re seeing one of the first malicious uses of deep fakes to sow chaos and fake-news in social media.

Sarah’s story is also interesting because of how she was able to identify these deep-faked profiles images: many deep faked images have blurry or smudged areas of hair or skin or objects like earrings and rings that appear in weird or unnatural places. Sarah also came across another tell-tale sign. All the fake images were created with ThisPersonDoesNotExist.com, a website that offers free deep faked images of faces. It turns out that due to the specific algorithm used by ThisPersonDoesNotExist, all faces generated by the site have their eyes located in the same place in each image. Sarah analyzed the FB profile pictures that she suspected were deep fakes, and indeed – all those images had their eyes centered in this same location.

These subtle mistakes are not unique to ThisPersonDoesNotExist.com: currently, even the most advanced and sophisticated generative networks produce images that contain subtle errors and mistakes. These mistakes are often so subtle that you and I probably won’t be able to detect them – but experienced experts such as Sarah Thompson can if they know where to look.
In fake videos, for example, the original person’s body structure does not always match the structure of the face that the software is trying to fit on to it, resulting in rough and unnatural looking faces. These mistakes are usually a byproduct of the imperfect examples which are used to train these neural networks.

For example, in June 2018, several researchers from the University of Albany in the United States reported finding a way to detect deep fakes videos by analyzing the frequency of eye-blinking of the people in them. It turns out that most images used for training in GAN systems are of faces with open eyes – which makes sense because we tend not to keep pictures of ourselves taken with our eyes closed. So the videos and images produced by the generative networks contain very little, if any, blinking. The researchers have devised a tool that measures the frequency of blinking in a video and determines if it’s a fake or not. There are plenty of such minute signals that we can potentially use for detecting deep fakes. For example, most generative networks have a hard time faking a face when it’s not looking straight at the camera since most of the images they are trained on are of this kind. There is even research showing that it’s possible to compare slight skin coloring variations resulting from blood flow in and out of the blood vessels under the skin. Such slight variations are undetectable by a human eye, but computerized analysis can reveal them.

Unfortunately, deep fake creators are also aware of these vulnerabilities – and each time a new study finds one such vulnerability, they update and improve their networks to fix them. For example, just three months after the blinking frequency paper was published, deep fake software evolved to counter this issue with improved detection for blinking in the training process, rendering the researchers’ original tool useless.

Compering deep fakes created in 2019 to fakes created merely a year or two earlier, the improvement is obvious. The history of cybersecurity shows us that such conflicts, such as the permanent conflict between malware authors and AV vendors – often result in the malware authors becoming better and better at creating sophisticated attack tools. One can draw a very interesting parallel between the constant battle between deep fake creators and the researchers trying to find ways to detect these fakes – and the ‘virtual battle’ that takes place inside of a GAN system. Recall that GAN’s success at training good generative networks is mostly due to the constant opposition it faces from the discriminator networks. It stands to reason, then, that ironically – the very efforts to detect deep fakes will likely only cause it to get better and better over time.

So, what can we do? Well, there are two basic approaches to solving this problem.

The first is to try to solve it before it even becomes a problem. We could, for example, equip all cameras with chips that calculate the image or video’s hash value as it is being taken. This way, any modification done to the original file – such as changing a face in the video – can be easily detected by platforms such as Facebook when such a file is uploaded. This technology is relatively easy to implement – but the real challenge will probably be to convince or force all cameras and recording devices maker to embed it in their products. Perhaps social media platforms and media organizations will be instrumental in applying such pressure on equipment manufacturers since they have an incentive to maintain their credibility.

A second theoretical solution takes us further into what is almost science fiction.

Last August I attended Black Hat 2019, where I met three scientists whose work focuses on that very challenge.

[Alex] Hi. I’m Alex Comerford. I’m a data scientist at Lombard.
[Willimas] I’m George Williams and I’m the Director of Data Science at a small fabulous semiconductor company called GSI Technology and we work on alternate hardware architectures that are baked into silicon.
[Saunders] I’m Jonny Saunders. I’m a neuroscientist at the University of Oregon. I’m a grad student currently looking to graduate in just a few years, maybe 10, maybe 15. Who knows?

The goal these researchers have is to detect audio deep-fakes, that is – fake speech generated by artificial neural networks. Currently, there are relatively few tools that allow for this, but only because Audio, in general, is less “flashy” and attractive than video and images. GAN technology itself is capable of producing fake speech with the same ease that it can produce images or videos, and such audio deep fakes can be just as problematic as their visual siblings. For example, many of you probably remember the now-infamous Access Hollywood tape, that features Donald Trump bragging about how famous he is – so famous that he can do with women everything he wants, including grabbing them by the…you know. Trump now claims this audiotape is fake – a claim that was refuted in the press – but imagine how much more difficult would it be to face such a claim when good quality audio deep-fakes become commonplace.
As a podcaster, I suppose you can understand why I found their research particularly interesting: If someone wants to fake my voice, he has hundreds – if not thousands of hours of me talking to a microphone as examples he can use to train a GAN system.

I asked Alex how difficult is it to create deep faked speech today.

[Alex] Interviewee 1: I will take this one. It’s as easy as going to GitHub, cloning the repo and following the instructions.

[Ran] That easy.

[Alex] That easy and just downloading the model that some of these websites provide for you free of charge.

[Ran] But I mean cloning the system to a specific speech or a specific individual. Is that difficult or is it still very easy ?

[Alex] There are services that you can look for. I haven’t seen a lot of open source implementations that are really easy but you can look at services like Lyrebird which only using I think like a couple of minutes of audio can synthesize any person.

As we noted earlier, generative networks do have their vulnerabilities and tell-tale characteristics – and such vulnerabilities also exist when it comes to speech synthesis. Here’s Johnny Saunders.

[Saunders] The human voice is a resonant instrument. So what that means is that it produces harmonically-related frequencies. So if I’m going to produce frequency one, I’m going to also produce 2, 4, 8, the harmonic series. But the whole nature of speech is to manipulate those harmonics. So you make speech by sort of emphasizing or deemphasizing different sets of these harmonics.

What Saunders is saying is that the sounds generated by our vocal cords are composed of a multitude of frequencies, and these frequencies are interrelated. For example, if I play the ‘ahhhhhhhh’ sound – this sound is primarily made of a sound wave at a certain frequency, for example, a thousand Hertz, but also contains frequencies that are multiples of that primary frequency: for example, two thousand and four thousand Hertz also called the ‘harmonics’ of the primary frequency. These harmonies are a natural part of our speech, and the reason why the human voice sounds so rich and complex, compared to ‘pure’ sounds that contain only a single frequency without harmonics, such as a 1000 Hertz beep.

A generative network that learns to emulate human speech will also produce such multi-frequency sounds. But as of now, the sounds produced by the neural networks do not contain the same frequencies with the same relative intensity as natural speech. That is if someone analyzes my voice with a device capable of measuring the intensity of the various frequencies, and then compares it with a deep faked sample – the differences between the two samples will be obvious, to a machine of course, much like the differences between two different fingerprints, or two samples of handwriting.

Using Mice To Detect Deep Fakes

The classic approach to identifying deep fakes in such a case would be to try and develop a system that would analyze the sound samples, identify these small mistakes, and determine which sample is fake and which is real. Unfortunately, this approach is prone to failure because, as we have already seen, GAN by its very nature learns to overcome such detection methods quite effectively.

[Williams] It’s very trivial to beat this approach because it’s easy to incorporate those types of features back into the deep learning algorithms as an additional loss function. So you can say…
[Ran] Train the deep learning network to fool the detector.
[Williams] Exactly, yes.

So what can we do? How can we design a deep fake detection system that is both sensitive enough and flexible enough to successfully compete with AI’s continuous improvement? George Williams.

[Williams] So I’ve kind of been involved in this space for a while.[…] But also on my radar, I’ve been keeping my tabs on what has been happening outside of pure technology in the sort of – in the brain domain, in the neurosciences and I’ve noticed recently an uptick in a lot of really interesting research that I think is happening at the intersection of AI, of artificial neurons and biological ones.
In the last three years, for example, there are some great work out of I believe UCSF training pigeons to spot breast cancer and actually pigeons, they have a visual system which I believe is very similar to humans. […] So they’ve trained the pigeons to actually spot cancer in images of cells to the point where it’s highly accurate. So I was kind of monitoring a little bit of that work and then Jonathan’s work came on my radar in the past few months.

Why did these researchers use pigeons to spot breast cancer? To tell the truth, I have no idea. Maybe because it’s easier to work with pigeons than whales. But anyway, it’s not the pigeons they were working with here – but with their brains. A pigeon’s brain, like a human brain, is merely a machine that tries to identify patterns in the information it receives from the outside world and make decisions accordingly. However, biological minds – even the simplest ones – are still far more sophisticated and complex that even the most advanced AI we humans can even imagine today. Hence, if we can harness the power of the brain for detecting deep-fakes, it has a pretty good chance of standing up to the challenge. Williams and Alex teamed up with Johnny Saunders, whose research focused not on pigeons but mice. Saunders tries to teach his mice to distinguish different sounds in human speech.

[Saunders] So at the basic level, the mice are confronted with a sound and they have to – so in this test, we’re just teaching them about two categories of sound or teaching them about one consonant versus another embedded within a vowel phrase. So it’s like a G and an I or a G and an O or something like that.

Saunders’ ultimate goal is to train his mice to distinguish between authentic human speech and deep-faked speech. This begs the question: why do Saunders and his colleagues believe that mice have the potential to detect deep fakes better than humans do? After all, our brains are clearly superior to those of these little rodents. Well, it turns out there’s a fundamental and fascinating difference between the human and mouse brains. Johnny Saunders.

[Saunders] So in order to perceive speech normally, because speech is such an extremely fast signal, it’s an extremely variable signal, you can imagine all the different voices that you hear day to day, all the different sort of just like timbres of speech, rates of speech, accents that you hear. In order to actually solve that problem, the auditory system needs to simplify the signal quite a bit in its internal representation. So it needs to look for redundant information, look for uninformative information, discard that and focus just on what can be used to inform the phonetic identity of the thing you’re hearing. […]

As we listen to someone speak, the sound waves enter our auditory system one after the other non-stop. The brain is tasked with decoding these waves and extracting the words they contain – and it has but a few milliseconds to decode each sound before the next sound wave is already on its way. To be able to decipher the sounds fast enough, our brains have evolved the ability to filter out the non-critical information it receives, i.e. the frequencies that have no real impact on the content of speech.

We can liken this superfluous auditory information to the different shapes of letters in handwriting. We each have different handwriting, but as long as our writing is clear enough – our brains can ingore the small differences in letterforms and extract the meani of the words regardless. The same basic process is what happens in our auditory system, and that is why, for example, you can understand what I’m saying even though my accent is slightly different – well, maybe not that sightly, really – than what you’re accustomed to hearing. But that’s the whole point: We are so used to filtering out irrelevant information from sounds we hear, that we also filter out the little mistakes if they’re not critical enough to our understanding. Have you noticed that I said ‘ingore’ instead of ‘ignore’ and ‘meani’ instead of ‘meaning’? Maybe you did, but you probably understood what I wanted to say nevertheless. This built-in filtering is key to our ability to understand speech – but also hinders our ability to detect deep fakes.

Not so in the mice. The mouse’s auditory system is very similar to the human auditory system, but because the mouse doesn’t understand the words it hears – its brain doesn’t try to filter out the excess information.

[Saunders] But the mice don’t have the same lifetime exposure to these speech sounds. So we believe that they’re more of an acoustic sort of like blank slate – because they can learn really difficult acoustic categorization problems. […] So that’s the reason why we would believe that mice would be able to detect these sort of deep fakes and why we think it would be possible to do that.

Because mice are a “blank slate” when it comes to deciphering speech content, Saunders and his colleagues hope they can be harnessed to detect the tiny errors that AI makes when trying to emulate human speech. The keyword here is ‘hope’: this study is still in its infancy.

[Saunders] The actual amount of information that it takes to do this problem is enormous and when we actually look at the behavioral results from the mice, they do seem to generate these extremely fine categorization boundary for every mouse has a subtly different sort of pattern of errors to them and we can manipulate that based on what example sounds we show them during training.

At present, it seems that in the battle between deep fake creators and the researchers
trying to find ways to detect deep fakes – the former has the upper hand, for the simple reason that most researchers are currently focusing on improving GAN technology, and only a few are engaged in deep fake detection research. Maybe it’s because it’s more exciting to create new technologies than to stifle them, or maybe there’s a stronger financial motive – but whatever the reason, the numbers are clear. Prof. Hany Farid from the University of California at Berkeley said in an interview with the Washington Post:

“We are outgunned. The number of people working on the video-synthesis side, as opposed to the detector side, is 100 to 1.”

Who knows, maybe this situation will change as more and more people are aware of the dangers posed by deep fakes. For example, in December 2019, several notable tech companies such as Facebook, Amazon, and Microsoft, launched the “DeepFake Detection Challenge” or DFDC, inviting developers from all over the world to invent novel tools for deep fake detection. Facebook pledged a $ 10 million prize. Google has released a free dataset of tens of thousands of deep faked images, to be used by researchers working on this problem. Perhaps studies like that of George Williams and his team will yield surprising results as well.

To summarize, GAN’s new and exciting capabilities can – much like any technology – bring us great benefits such as, for example, better online shopping experience or richer virtual worlds. At the same time, it can also move us one step closer to a dystopian future full of fear, shame, and distrust. After all, Technology is just a tool. It’s up to us to make good use of it.

And even if we fail to negate the negative sides of deep fakes – one can still find a bright spot in this story. GAN technology empowers us to develop AI that is capable of creating new things, generating new ideas. This ability was, to this day, the almost exclusive domain of the human brain, and the fact that we can now create machines capable of imitating – even if relatively narrowly – what only our brains could do, gives researchers a potentially powerful tool to better understand our brain. In the same way that we better understand how our hearts work following research on pumps of various kinds, and how we better understand the behavior of bacteria in our bodies thanks to research on laboratory cultures – maybe GAN will also provide us new tools to understand the most amazing and complex machine we know of: the human brain.
Richard Feynman, the well-known physicist, had a saying that seems to be more relevant now than ever:

“What I cannot create, I do not understand.”

Latest episodes

Deep Fakes, Part 2: Man Vs. Machine

Hosted By

Ran Levi

Special Guest

George Williams

Director of Data Science and Chief Evangelist, Embedded AI, GSI Technology

Alex Comerford

Data Scientist

Jonathan Saunders

UNIVERSITY OF OREGON