Before deploying an AI system, there are a few basic but critical questions that too often go unasked: Where is the model deployed? What kinds of inputs will it process? What will the output format be? What are the obvious business risks, and more importantly, how do we revisit business risks over time? If you’re not thinking about these things up front, then you are missing a significant portion of understanding how AI fits into your organization.
While many “out of the box” models have some form of protection trained into the model itself, these tend to be basic protections and are often focused on safety rather than security. “Model Cards” tend to offer some insights, however measurements are not standardized across the industry. In the absence of stronger security features in the models themselves, a wide range of products and tools have emerged to address the security of AI models and protect your most critical applications and data.
Before I delve deeper into the solutions, I want to address the terminology. The term “red teaming” is frequently used in AI and LLM circles, but not always with clarity or consistency. For some, it’s just another layer of internal QA or prompt testing, but that definition, in my view, is much too narrow. Red Teaming is a holistic cybersecurity assessment that includes probing technical and non-technical vulnerabilities within an organization. Red teaming is adversarial. For example, think of scenarios where you’re not just testing systems, but probing every human and technical weak point across the entire surface area. Approaches can include physical access, social engineering, and unexpected inputs in unexpected places. Here’s Microsoft’s definition below:
In order to red team your AI model, you need to have a deep understanding of the system you are protecting. Today’s models are complex multimodal, multilingual systems. One model might take in text, images, code, and speech with any single input having the potential to break something. Attackers know this and can easily take advantage. For example, a QR code might contain an obfuscated prompt injection or a roleplay conversation might lead to ethical bypasses. This isn’t just about keywords, but about understanding how intent hides beneath layers of tokens, characters, and context. The attack surface isn’t just large, it’s effectively infinite. Here are a couple more novel examples of these types of attacks:
Dubbed “Stop and Roll” by Knostic, here is an attack where interrupting the prompt resulted in bypassing security guardrails within a large LLM.

This is similar to a side-channel attack, attacking the underlying architecture of models. Another example is the “Red Queen Attack,” by Hippocratic AI, a multi-turn role-play attack:

Some of the tactics are subtle but have big consequences because large language models use input tokens differently: uppercase versus lowercase, unicode characters versus non-unicode characters, high-signal words and phrases, complex prompt instruction sets and more. If you are curious to learn about these, there are thousands of jailbreaks widely available on the internet. Also adding fuel to the fire, many core system prompts are considered secret in theory, but have already leaked in practice. You can find some of them on GitHub which may lead to further jailbreaking.
Safeguards, Guardrails and Testing
When evaluating solutions, you should consider the needs and scale of your AI security solution, understanding that each layer introduces additional complexity, latency, and resource demands.
Building versus buying is an age-old debate. Fortunately, the AI security space is maturing rapidly, and organizations have a lot of choices to implement from. After you have some time to evaluate your own criteria against Microsoft, OWASP and NIST frameworks, you should have a good idea of what your biggest risks are and key success criteria. After considering risk mitigation strategies, and assuming you want to keep AI turned on, there are some open-source deployment options like Promptfoo and Llama Guard, which provide useful scaffolding for evaluating model safety. Paid platforms like Lakera, Knostic, Robust Intelligence, Noma, and Aim are pushing the edge on real-time, content-aware security for AI, each offering slightly different tradeoffs in how they offer protection. Not only will all these tools evaluate inputs and outputs, but often they will go much deeper into understanding data context to make better-informed real-time decisions, and perform much better than base models.
One of the key insights I want to share is that regardless of your tooling of choice, you must be able to measure the inner workings of the system you put in place. LLMs are stochastic systems that are extremely difficult to replay and troubleshoot. Logging exact metrics such as temperature, top P, token length and others will immensely help debugging later on.
Ultimately, what really matters is mindset. Security isn’t just a feature, it’s a philosophy. Red teaming isn’t just a way to break things; it’s a way to understand what happens when things break. A secure AI deployment doesn’t mean “no risk.” It means you’ve mapped the landscape, you know what kind of behavior to expect (both good and bad), and you’ve built systems that evolve with that knowledge. That includes knowing your model, your data, your user interactions, and your guardrails. Red teaming gives you clarity. It forces you to think about the outcomes you want — and the ones you don’t. And it ensures your AI system can distinguish between them when it matters most.
There are plenty more areas to explore in model security, especially on the code side. Stay tuned as I go deeper into the compliance portion next time.
Learn More at The AI Risk Summit | Ritz-Carlton, Half Moon Bay
This column is Part 3 of multi-part series on securing generative AI:
Part 1: Back to the Future, Securing Generative AI
Part 2: Trolley problem, Safety Versus Security of Generative AI
Part 3: Build vs Buy, Red Teaming AI (This Column)
Part 4: Timeless Compliance (Stay Tuned)
