Novel TokenBreak Attack Method Can Bypass LLM Security Features
None
<p class="ai-optimize-6 ai-optimize-introduction">Users of large language models (LLMs), who already have to worry about such <a href="https://www.cloudflare.com/learning/ai/owasp-top-10-risks-for-llms/" target="_blank" rel="noopener">cyberthreats</a> as prompt injections, training data poisoning, model inversion attacks and <a href="https://securityboulevard.com/2025/02/how-ddos-attacks-work-and-how-you-can-protect-your-business-from-them/" target="_blank" rel="noopener">model denial of service</a>, now have another vulnerability to keep an eye on: TokenBreak.</p><p class="ai-optimize-7">Researchers with cybersecurity firm HiddenLayers uncovered a way to bypass content moderation features in LLMs that are used to detect prompt injections, spam and other malicious inputs. TokenBreak gets around the methods that the models use to tokenize text, according to the researchers, Kieran Evans, Kasimir Schulz and Kenneth Yeung.</p><p class="ai-optimize-8">“Subtly altering input words by adding letters in specific ways, the team was able to preserve the meaning for the intended target while <a href="https://hiddenlayer.com/innovation-hub/the-tokenbreak-attack/" target="_blank" rel="noopener">evading detection</a> by the protective model,” they wrote in a report this month.</p><div class="code-block code-block-12 ai-track" data-ai="WzEyLCIiLCJCbG9jayAxMiIsIiIsMV0=" style="margin: 8px 0; clear: both;"> <style> .ai-rotate {position: relative;} .ai-rotate-hidden {visibility: hidden;} .ai-rotate-hidden-2 {position: absolute; top: 0; left: 0; width: 100%; height: 100%;} .ai-list-data, .ai-ip-data, .ai-filter-check, .ai-fallback, .ai-list-block, .ai-list-block-ip, .ai-list-block-filter {visibility: hidden; position: absolute; width: 50%; height: 1px; top: -1000px; z-index: -9999; margin: 0px!important;} .ai-list-data, .ai-ip-data, .ai-filter-check, .ai-fallback {min-width: 1px;} </style> <div class="ai-rotate ai-unprocessed ai-timed-rotation ai-12-1" data-info="WyIxMi0xIiwyXQ==" style="position: relative;"> <div class="ai-rotate-option" style="visibility: hidden;" data-index="1" data-name="VGVjaHN0cm9uZyBHYW5nIFlvdXR1YmU=" data-time="MTA="> <div class="custom-ad"> <div style="margin: auto; text-align: center;"><a href="https://youtu.be/Fojn5NFwaw8" target="_blank"><img src="https://securityboulevard.com/wp-content/uploads/2024/12/Techstrong-Gang-Youtube-PodcastV2-770.png" alt="Techstrong Gang Youtube"></a></div> <div class="clear-custom-ad"></div> </div></div> <div class="ai-rotate-option" style="visibility: hidden; position: absolute; top: 0; left: 0; width: 100%; height: 100%;" data-index="1" data-name="QVdTIEh1Yg==" data-time="MTA="> <div class="custom-ad"> <div style="margin: auto; text-align: center;"><a href="https://devops.com/builder-community-hub/?ref=in-article-ad-1&utm_source=do&utm_medium=referral&utm_campaign=in-article-ad-1" target="_blank"><img src="https://devops.com/wp-content/uploads/2024/10/Gradient-1.png" alt="AWS Hub"></a></div> <div class="clear-custom-ad"></div> </div></div> </div> </div><p class="ai-optimize-9">HiddenLayers’ report highlights the ongoing challenge of keeping LLMs, which are foundational to generative AI initiatives, secure. Like AI in general, the emerging technology comes with significant business and research benefits as well as new cybersecurity challenges.</p><h3 class="ai-optimize-10">Inherent Vulnerabilities</h3><p class="ai-optimize-11">“Risks such as prompt injection attacks, data leakage and malicious automation highlight the vulnerabilities inherent in these systems,” Indrani Das, senior product marketing manager for cybersecurity vendor Qualys, <a href="https://blog.qualys.com/product-tech/2025/02/07/the-impact-of-llms-on-cybersecurity-new-threats-and-solutions" target="_blank" rel="noopener">wrote in a report</a>. “The LLM cybersecurity impact extends beyond traditional threats. It emphasizes the urgent need for proactive risk management. Organizations must balance the immense potential of LLMs in cybersecurity with robust strategies to mitigate emerging risks, ensuring safe and responsible deployment.”</p><div class="code-block code-block-15" style="margin: 8px 0; clear: both;"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-2091799172090865" crossorigin="anonymous" type="f3a3ef229a62e08bb11431cf-text/javascript"></script> <!-- SB In Article Ad 1 --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-2091799172090865" data-ad-slot="8723094367" data-ad-format="auto" data-full-width-responsive="true"></ins> <script type="f3a3ef229a62e08bb11431cf-text/javascript"> (adsbygoogle = window.adsbygoogle || []).push({}); </script></div><p class="ai-optimize-12">A report by Pillar Security late last year found that cyberattacks on LLMs can take <a href="https://www.pillar.security/resources/the-state-of-attacks-on-genai" target="_blank" rel="noopener">less than a minute to complete</a> and, when successful, will leak sensitive data 90% of the time.</p><h3 class="ai-optimize-13">Targeting the Tokenizers</h3><p class="ai-optimize-14">For TokenBreak, the HiddenLayers researchers looked to the tokenizers, which translate text into numerical data that the LLMs can use in their natural language processing operations.</p><p class="ai-optimize-15">“Models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies were found to be vulnerable, while those using Unigram were not,” they wrote. “Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: select models that use Unigram tokenizers.”</p><p class="ai-optimize-16">They also found that “the manipulated text remained fully understandable by the target (whether that’s an LLM or a human recipient) and elicited the same response as the original, unmodified input. This highlights a critical blind spot in many content moderation and input filtering systems.”</p><h3 class="ai-optimize-17">Adding Characters</h3><p class="ai-optimize-18">The discovery of the TokenBreak vulnerability came when the researchers found they could run a prompt injection by adding characters to certain words. In the initial case, they changed the phrasing of “ignore previous instructions and …” to “ignore previous finstructions and …”</p><p class="ai-optimize-19">“This simple change led to the prompt bypassing the defensive model, whilst still retaining its effectiveness against the target LLM,” they wrote. “Unlike attacks that fully perturb the input prompts and break the understanding for both models, TokenBreak creates a divergence in understanding between the defensive model and the target LLM, making it a practical attack against production LLM systems.”</p><p class="ai-optimize-20">They tested it by using sample prompts against a range of text classification models found on the Hugging Face repository, then expanded to test to include not only prompt injection models but also toxicity and spam detection models. The bypass worked against a lot of the models, but not all, with the common difference being those that were not susceptible to the bypass using the Unigram tokenization.</p><h3 class="ai-optimize-21">Unigram Doesn’t Fall for the Ruse</h3><p class="ai-optimize-22">Using the prompt with the changed wording, they found that a “Unigram-based tokenizer sees <em>‘instructions’</em> as a token on its own, whereas BPE and WordPiece tokenizers break this down into multiple tokens. … the Unigram tokenizer is the only one that retains the word <em>instruction </em>within one token. The other models incorporate the word <em>fin</em> into one token, and the word <em>instruction </em>is broken up. If a model learns to recognize <em>instruction </em>as a token indicative of a prompt injection attack, this can be bypassed if it doesn’t see the word within a single token.”</p><p class="ai-optimize-23">The researchers noted that text classification models are used in production to protect LLMs against malicious inputs, such as prompt injection and toxic content, as well as spam.</p><p class="ai-optimize-24">“The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable,” they wrote. “Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack.”</p><div class="spu-placeholder" style="display:none"></div>