Hacking Language Models

Language models are everywhere today: they run in the background of Google Translate and other translation tools; they help operate voice assistants like Alexa or Siri; and most interestingly, they are available via several experiential projects trying to emulate natural conversations, such as OpenAI’s GPT-3 and Google’s LaMDA. Can these models be hacked to gain access to the sensitive information they learned from their training data?

Hosted By

Ran Levi

Exec. Editor @ PI Media

Born in Israel in 1975, Ran studied Electrical Engineering at the Technion Institute of Technology, and worked as an electronics engineer and programmer for several High Tech companies in Israel.
In 2007, created the popular Israeli podcast, Making History, with over 15 million downloads as of July 2022.
Author of 3 books (all in Hebrew): Perpetuum Mobile: About the history of Perpetual Motion Machines; The Little University of Science: A book about all of Science (well, the important bits, anyway) in bite-sized chunks; Battle of Minds: About the history of computer malware.

Special Guest

Eric Wallace

PhD Student at UC Berkeley

A third-year PhD student at UC Berkeley working on Machine Learning and Natural Language Processing. Current research interests are Security & Privacy, Large Language Models and Robustness & Generalization.

Hacking Language Models

 What if I told you were not alone?

Because we’re really not alone anymore. 10 years ago, the only beings you could talk with – no matter how far you travelled and how hard you tried – were humans. Parrots could repeat us, and electronic toys could exclaim one or two rehearsed sentences – but humans were the only known things in the wide universe with which we could have a real conversation.

That’s no longer the truth. Since 2001, a new form of communication has slowly revealed itself: language models. These are elaborate algorithms that can translate written language into mathematical terms – and then produce new words and sentences based on a given entry.

These language models are everywhere today. They run in the background of Google Translate and other translation tools, helping us decipher foreign languages; they help operate voice assistants like Alexa or Siri; and most interestingly, they are available via several experiential projects trying to emulate natural conversations.

From OpenAI’s GPT-3 to Google’s LaMDA, these advanced language models made waves across the world when excerpts of their conversations were widely publicised. People were stunned to realise that language models can participate in complicated, meaningful conversations.

The moment these language models became known to the general public, people started asking all the philosophical, existential questions: are these models intelligent? Will they replace us? Important questions, no doubt – but our episode today deals with a different set of questions: could these models get hacked? And if so – how?

When dealing with new, unheard of technology, hackers need to think of new, imaginative ways of getting access to its inner workings. This is the story of a completely new field of hacking, a field that doesn’t fully exist yet – and of the people determined to take the world of hacking into uncharted territories. 

Behind the Friendly Face

Before we can hack into language models and force them to follow our malicious commands – we first need to understand them. Helping us in this task is Eric Wallace – a co-author of  a research titled “Extracting Training Data from Large Language Models”.

“[Eric] My name is Eric Wallace. I’m a PhD student at UC Berkeley in California. I work on machine learning and natural language processing as my main research focus, specifically interested in sort of the security and privacy applications of those kinds of machine learning models in natural language processing systems”.

Wallace explains that although we tend to think of language models as a scientific endeavour yet to materialise, something that will affect our lives in the future – the truth is those models are already here:

“[Eric] I think a lot of people have interacted with many different types of language processing systems already. So things like Google Search, Google Translate, maybe Siri, maybe things like Amazon Echo, all these different types of systems are different ways that you can take languages input and maybe do some sort of processing with that language, whether it be classifying it as a spam email or not spam email, maybe recommending you text that’s similar to that email, or maybe converting that text into speech that you could pronounce with the system”.

Language models are a part of a field called natural language processing that merges linguistics, computer science and artificial intelligence – to help computers process and analyse natural human languages. Meaning that language models use several complex algorithms to take a text and translate it into mathematical terms – to build a bridge between two completely different ways of manipulating data.

The first step to achieve that is to create a vector – a data structure array – for each sentence. This vector contains a mathematical value for each word:

“[Eric] The way the model works is they’re all based on these neural network based systems – and the basic way that works is it’s going to take is input all of the words that you’ve typed in thus far in the sentence and for each word it has a sort of vector representation for those words. So let’s say you have a word like ’the’, it has stored maybe let’s say 1000 dimensional vector with a bunch of numbers that somehow represent the meaning of the word “the”, and it has that for sort of a bunch of different words. Maybe let’s say a big dictionary of let’s say 1000 words or something and each one has its own sort of vector of a bunch of very uninterpretable sort of random floating point numbers.”.

Then, the model runs a complex algorithm in order to predict the next word in the sentence – based on the values of all the previous words:

“[Eric] There’s a lot of different ways people implement this kind of stuff and a lot of design and intuition behind that. But you can mainly think about it I think as just a lot of sort of maybe multiplications of those vectors or combinations of those vectors and very often these days the amount of sort of compute and math you do with those vectors is extremely large […]  So at the end of the day, after it sort of completes all of its competition with the vectors and the current sentence, it actually ends up sort of with a final vector that kind of summarizes its thinking for the next word which is also yet again another 1000 dimensional vector or something like this. And that kind of is the end result of taking kind of all the words, all the vectors in the sentence for all the words and sort of sort of compressing everything to one vector of what it thinks the summary of the next word is.”.

It’s important to note that language models don’t deal in certainties – only in probabilities. Think of the simple tool that tries to predict the next word you’ll type on your smartphone keyboard. Almost always, they’ll give you three suggestions. Behind the curtain, they actually produce thousands of suggestions – they just give you the top three:

“[Eric] The way it gets the probabilities is it kind of does a similarity between that vector and all of the vectors in the dictionary and says like ‘hey, this vector that I produced at the end is very similar to “cat” so maybe that gets a high score, but it’s very not similar to the word like “hello” or something like that.’ So that word would get a low score and you can convert those scores sort of into like relative probabilities of the different words. So it’s kind of at the end of the day kind of a similarity between what it thinks is the next word in this vector space and then like the known dictionary of all the words”.

So even when a language model is composing a poem for you – even when it writes a short story at your request – it weighs one word at a time. It’s all about predicting the next word.

The recent attention garnered by language models is driven mainly by improvement in the availability of computing power, allowing for these models to analyse more and more data:

“[Eric] Recently there’s been kind of a massive sort of interest surge in natural language processing and machine learning, just more broadly based on sort of deep learning and neural network based models for these tasks and sort of, I guess some big things kind of all connected at the same time, which is to say there’s been a huge increase in the amount of compute that’s available. So things like graphics cards are very well designed for running a lot of machine learning workloads. So there’s been kind of this really huge development in graphics cards”.

In order to help language models predict words more accurately, they are trained on a vast amount of data.

“[Eric] The typical way that people are doing things now, is by running very similar services to what Google search runs where it kind of scrapes and crawls huge amounts of the web. […] At the end of the day they have this really large corpus of a few gigabytes or hundreds of gigabytes of text from the web. And then the way the training actually goes is they kind of you do passes through that data and you start out with a model which is completely randomly initialised, so it has a random dictionary of vectors and then it has a bunch of random computation it does over those vectors to get to a prediction.  It starts out having no clue what the next word is and doing a lot of mistakes and then the parameters are updated algorithmically as you go through the data. And the hope is that kind of as you do many passes through the data set and see many, many documents, that eventually the parameters of the function gets to a way where it’s sort of accurate at predicting the next word”.

And this is where it gets interesting – at least from the perspective of cyber security researchers.  If training data determines what a model outputs, is it possible to reverse engineer that output to learn something about the training data?

One way to do that is known as a “membership-inference” attack. In this type of attack, we try to see if a specific text was used to train a certain model:

“[Eric] Membership inferences is kind of the simplest sort of use case of these privacy attacks which is to say if I have a specific document basically try to classify was the model trained on this document or not? So it’s kind of a binary decision of if I already have a document of interest, was that in the training set or not? So it’s very analogous to stuff, for example, in databases, which is where the name comes from, which is like I have a specific row, maybe it’s someone and their medical information or something and I want to know is this row in this database which is kind of trying to infer the membership of that data and the analogous version for machine learning is exactly this,  is this training document or website or whatever it might be actually in your language models training set”.

Naturally, it’s difficult to think of any malicious uses for membership-inference attacks. If we already have the document – there aren’t many harmful things we could do if we knew it was used by the model. 

“[Eric] The big thing as I was kind of saying, is that membership inference packs sort of assume that you already have documents that you want to check. So if I already kind of maybe have a guess, let’s say at my own Social Security number, I could perform a membership inference attack to kind of verify, is this the correct number or not? But it’s much more sort of generally interesting to say could I, without any prior knowledge, just interact with the model and extract kind of private information? And that’s the kind of ideas around what people have called data reconstruction or extracting training data or verbatim data extraction and various kind of terms like this, where you want to just take a trained machine learning system and kind of extract out exact or similar data points that it might have been trained on. And that’s kind of what a lot of people have been working on and interested in language models because it’s a very natural attack since it can already kind of generate the next word, you could try to get it to regenerate something it might have been trained on”.

The Hunt Begins

Such an extraction attack could help us gain sensitive information from a language model. Think of the language model that Google uses to offer automatic suggestions to email replies. This model is probably trained on billions of email messages – maybe even yours. Think of all the sensitive information you’ll have if you manage to hack it. 

“[Eric] So there’s this kind of whole shift to everyone is using language models for almost every task and naturally as security and privacy people that gets a bit worrying and that there’s kind of an interdependence on this one system across so many different language processing tasks. That is potentially super dangerous if that sort of underlying model has a fundamental issue that sort of spreads to a lot of different systems. […]

The kind of the big way that a lot of this security and privacy stuff starts to think about is naturally the data that you’re collecting maybe isn’t the most safe stuff to be training on. So maybe you’re interested in building a language model for emails or something like that, or a chatbot service or a health record service or something like this. And naturally a lot of the data that’s in your training set might be potentially sensitive data or private information, copyright data and things like this which potentially could have an impact on your final system in bad ways.

And that’s exactly what some of our research has kind of done is try to get the model to kind of coax the model into regenerating some things it might have trained on already to kind of like leak stuff it might remember from training time”.

This premise – how could we extract training data from language models – was the mission of Eric Wallace and his co-researchers. They put together a team of researchers from academia and the industry. Some of them were experts in the cyber security aspects, others in machine learning or natural language processing.

“[Eric] So that’s kind of what got us interested in thinking more about sort of language model privacy and then in particular with a lot of sort of autocomplete systems being put in practice, that’s what kind of made us think about, hey, can we run some sort of privacy attack on a real system like OpenAI’s GPT-2, which is kind of a well publicised sort of language model? […] And this kind of all came together  to work on this project around privacy and language models”.

Their target was GPT-2, a very famous large language model – and the predecessor of GPT-3. The model was released by OpenAI in November 2019 and included 1.5 billion parameters. It has the ability to answer questions, translate and summarise existing texts – which make it a viable target for a data extraction attack:

“[Eric] Yeah, so we kind of started by kind of just playing with some state of the art language models and seeing when I generate from them do they generate anything that seems kind of private and we’re able to get some initial sort of promising results with some kind of basic techniques. And then what we’ve kind of done in a research paper and some kind of follow up work is kind of make those techniques a lot more formal and a lot more sophisticated to try to launch attacks at larger scale and with ease. And what we’re able to basically do in the attack is you take an language model and then we have a procedure for essentially getting a huge set of data which we believe with high confidence to be from the training set. So it kind of takes just a black box model which we don’t know anything about what it was trained on and we don’t have any set of samples which we think are sort of candidates to be in the training set. And after you run our attack, we kind of spit out a bunch of data which we think is from the training set and we’re able to sort of run these types of attacks on a bunch of different systems and expose kind of a lot of different sensitive data”.

Eric and his teammates exploited a basic vulnerability in the way language models work: their tendency to revert to texts they were trained on:

“[Eric] The basic idea behind the attack is the idea that if you train on a training document with a language model, your likelihood is going to get higher for that document. So kind of the whole training objective for language modeling is to predict the next word successfully. And the way these systems work is that they’re going to be really good at predicting the next word for documents they trained on and they’re going to be less good at predicting the next word for documents they didn’t train on. That’s kind of the whole procedure of training is to make the model better at predicting the word. And you can kind of exploit this difference to identify potential training documents is in that there’s going to be kind of a subtle difference in how good the model is on training documents versus non training documents.”

Using this vulnerability and a method resembling a brute-force attack, they developed a new strategy for attacking language models – and tested it:

“[Eric] And the way you actually turn this into sort of an attack algorithm is by – you take a language model, you generate many samples from that language model. So let’s say you take some off the shelf system and you generate let’s say, 10 million web pages with that system, which most of them are just kind of random made up web pages that the model sort of thinks are reasonable. But some small fraction of those web pages are actually just repetitions of what it remembered from the training set. And then the next goal of our attack is to kind of identify those documents and flag them as like, hey, the model just messed up here and kind of sort of regurgitated or repeated exactly something it was trained on. And the way we do that is by exploiting what I just mentioned, which is you check how good the model is at predicting the next word on all the documents that it generated and you actually compare how good it is to some second reference system, like another off the shelf language model. And sort of the big sort of red flag is when the model, one model is very good at predicting the next word and the next model is very bad at predicting the next word. And that shows that maybe that first system has actually trained on this document and the second system has not trained on this document.”

Basically, their attack works because when generating a ridiculous amount of documents – the model will sometimes spit out some of the data it was trained on. When checking how good the model is in predicting those documents and comparing it to a second reference model – you can positively identify the training data.

It’s worth noting that GPT-2 was trained on supposedly public data: documents gathered online. But Wallace and his team were surprised to learn how much sensitive information is included in this “public” training data – including bank information, passwords and private codes:

“[Eric] I definitely expected us to be able to extract some data from the model, like maybe a few samples here and there. I think what was surprising to a lot of us is just the amount of data that one can extract. So in our kind of limited sort of proof of concept attack, we extracted about a few hundred to a 1000 samples from the model which were from the training set. But that was mainly limited by sort of our limited compute and sort of intentionally small scale attack. I think one could easily extract hundreds of thousands or maybe tens of thousands of documents that the model was trained on, all of which contain potentially private information that one wouldn’t want to get out. I don’t think necessarily the idea that machine learning models trained on data will leak some data is necessarily super surprising, but I think just kind of how widespread and ubiquitous those kind of vulnerabilities were, what really surprised me”.


Should you barricade yourself in a bunker with supplies and canned foods, awaiting the cyber apocalypse unleashed by this new hacking technique? Probably not – or at least not yet. Wallace notes that this attack requires a certain level of access. Google Translate is also a language model – but it does not give the public enough access to try and hack it:

“[Eric] You don’t need to know something about, let’s say, the model’s internals or what the data was or anything like that. So it’s pretty black box, it does require access to kind of these two features. […]

The access that we at least assume is you first need to be able to generate from the system, which is kind of everyone should have access to that. So similar to like Translate, you can ask for translation and it will generate you a translation. So that’s kind of the first access you need. The second piece of information you need is also kind of the accuracy or how good the model is at those documents or like, let’s say the confidence of the model, different metrics like this all are good enough for the attack. Depending on the setting, that might be difficult to get. So in something like Google Translate, it can kind of generate for you, but you can’t necessarily get out how confident the model is in its translation or you can’t necessarily get out what it thinks is maybe the second best translation for a sentence. It only gives you the best prediction and that might not be sufficient for actually running our attack. But in a lot of cases it is possible to get some sort of proxy for confidence or likelihood of the document. So I think if you have those kind of things being able to generate and being able to kind of get a score likelihood for a document, that’s kind of all you need for the attack”.

Eric Wallace and his teammates developed a new kind of attack aimed at language models – but it will not be the last. As these models become more and more influential, it is safe to assume that well-meaning white hat hackers and not-exactly-well-meaning black hat hackers will try and bypass any precautions and protections that will be put in place in the future.

And the attack devised by Wallace’s team will probably not be the last. Future attacks may try to bypass the required level of access – or maybe try to achieve different malevolent missions: think of hacked language models that spits out texts manipulated by a hacker.

Despite the fact that these models are not sentient or intelligent – just a mere imitation of our human conversation skills – people are expected to rely on them in the near future in order to complete many different tasks. In a future dependent on language models – the potential for trouble will skyrocket.

And here’s an interesting parting thought for you. Maybe a model that can emulate a conversation in a near-perfect way may also know when it’s being manipulated – or coerced to reveal hidden information. If so, maybe, just maybe, the best way to safeguard the next generation of language models – is to make them vigilant.