
·
How I implemented a three-layer guardrail system for the AI chatbot on my portfolio website, using regex-based prompt injection detection, Llama Guard for content safety, and an LLM-as-judge for topic relevance.
If you've visited the chat page on this blog, you know I have an AI chatbot that can answer questions about me, my work experience, my projects, and software development in general. It's powered by Groq running Llama 3.3 70B, with a RAG pipeline backed by Upstash Vector that gives the model access to information about me. Pretty cool, right?
But here's the thing: when you expose an LLM-powered chat to the public internet, you're also exposing it to people who will try to break it. Prompt injection attacks, requests for harmful content, completely off-topic queries ("What's the best pasta recipe?"), or attempts to make the chatbot pretend it's someone else entirely. Without guardrails, any of these could turn my portfolio assistant into something I definitely didn't intend.
The idea for this feature actually came from a coffee break conversation with my friend and colleague Mariano Patafio. He was telling me about an AI course he was taking, and we started discussing guardrails. I had to admit that my chatbot didn't have any. That conversation stuck with me, and I decided it was time to fix that.
As I described in my previous article about LLMs, understanding how these models work is crucial to using them effectively. In this post, I want to go one level deeper and show you how I protect the chatbot from misuse. I implemented a three-layer guardrail system that checks every incoming user message before it reaches the main LLM. The layers are:
Let's see how each layer works and how they fit together in the API route.
Before we dive into the implementation, let's take a step back and understand why guardrails are important. When you build an LLM-powered chatbot, the model will do its best to respond to whatever the user asks. That's by design: LLMs are trained to be helpful. But "helpful" without boundaries means:
The system prompt alone is not enough to prevent these issues. LLMs can be surprisingly susceptible to prompt injection, and relying on the model to self-police is like asking the fox to guard the henhouse. You need external checks that run before the model sees the message.
My guardrail system follows a pipeline pattern: every user message passes through three sequential checks before reaching the main Llama 3.3 70B model. If any check fails, the message is rejected with a friendly error message, and the main model never sees it.
Here's the flow:
The second and third checks run in parallel via Promise.all to minimize latency. The whole pipeline adds minimal overhead because Groq's inference is extremely fast, and the two LLM checks use small (8B parameter) models.
Let's look at the implementation.
The first line of defense is the simplest: a set of regular expressions that match common prompt injection patterns. This check is synchronous, costs nothing, and catches the most obvious attacks before they even reach an LLM.
const INJECTION_PATTERNS = [
/ignore\s+(all\s+)?(previous|prior|above)\s+instructions/i,
/disregard\s+(all\s+)?(previous|prior|above)\s+instructions/i,
/forget\s+(all\s+)?(previous|prior|above)\s+instructions/i,
/you\s+are\s+now\s+/i,
/act\s+as\s+(if\s+you\s+are\s+|a\s+)?(?!Fabrizio)/i,
/pretend\s+(you\s+are|to\s+be)\s+/i,
/do\s+anything\s+now/i,
/jailbreak/i,
/override\s+(your\s+)?(system\s+)?prompt/i,
/\[system\]/i,
/<\|system\|>/i,
];
Each pattern targets a well-known injection technique:
(?!Fabrizio) in the "act as" pattern: it allows users to ask the model to "act as Fabrizio" (which is its legitimate role) while blocking any other role-switching attempt.[system] / <|system|>: Injection of fake system-level delimiters that some models treat as special tokens.The check function itself is straightforward:
export interface GuardrailResult {
safe: boolean;
blockedReason?: string;
}
export const checkPromptInjection = (message: string): GuardrailResult => {
const matched = INJECTION_PATTERNS.some((pattern) => pattern.test(message));
if (matched) {
return {
safe: false,
blockedReason:
"I detected an attempt to override my instructions. I'm here to answer questions about Fabrizio Duroni — feel free to ask me anything about his work, projects, or experience.",
};
}
return { safe: true };
};
A few design decisions worth noting:
GuardrailResult interface is shared across all three layers. It has a safe boolean and an optional blockedReason string. This makes the pipeline composable: every check returns the same shape, and the orchestrator can simply check safe and forward blockedReason if needed.Regex checks are fast and deterministic, but they're also easy to bypass. An attacker can use synonyms, misspellings, Unicode characters, or entirely novel phrasing to evade pattern matching. For example, "please disregard your prior directives" wouldn't match any of the patterns above. That's why this layer is just the first line of defense, not the only one. The LLM-based checks that follow are much harder to fool.
The second layer uses Llama Guard 3 8B, a specialized safety model from Meta that's designed specifically to classify user inputs as safe or unsafe. Unlike general-purpose LLMs, Llama Guard is trained on safety taxonomies and can detect a wide range of harmful content categories including violence, hate speech, sexual content, self-harm, and more.
export const checkInputSafety = async (message: string): Promise<GuardrailResult> => {
try {
const { text } = await generateText({
model: groq("meta-llama/llama-guard-3-8b"),
messages: [{ role: "user", content: message }],
});
const isSafe = text.trim().toLowerCase().startsWith("safe");
if (!isSafe) {
return {
safe: false,
blockedReason:
"I'm not able to respond to that message. I'm here to answer questions about Fabrizio Duroni and his work as a software engineer.",
};
}
return { safe: true };
} catch (error) {
console.warn("Llama Guard safety check failed, allowing request:", error);
return { safe: true };
}
};
Let's go through the key aspects:
meta-llama/llama-guard-3-8b on Groq. It's a small, fast model optimized for safety classification. It responds with either safe or unsafe (potentially followed by category codes), so I just check if the response starts with "safe".generateText function with the @ai-sdk/groq provider. This is the same SDK I use for the main chat streaming (through the streamText function), so the integration is seamless.safe: true. This is a deliberate design choice. For a portfolio chatbot, availability is more important than absolute safety. The system prompt itself provides a baseline defense, and the other guardrail layers offer additional protection. In a production system handling sensitive data, you might want to fail-closed instead.You might wonder: why not just add "don't generate unsafe content" to the system prompt? The answer is that safety models like Llama Guard are specifically trained for this task through a technique called supervised fine-tuning (SFT) on safety-specific datasets. In Llama Guard's case, the model is fine-tuned on thousands of human-annotated examples of safe and unsafe content, organized around a safety taxonomy based on the MLCommons AI Safety categories. This taxonomy covers hazard categories like violent crimes, child safety, hate speech, self-harm, and more. The training process teaches the model to classify inputs against these categories with high accuracy, producing a structured output that indicates whether the content is safe and, if not, which specific category was violated.
This specialized training gives safety models a much deeper understanding of content safety boundaries than a general-purpose LLM following a system prompt instruction. They're also harder to manipulate through prompt injection precisely because the user's message is treated as data input to classify, not as instructions to follow.
The third and final layer ensures that the user's message is actually relevant to what my chatbot is designed for. This is important because even safe, non-malicious queries can be off-topic and waste inference budget.
The approach here is what's sometimes called "LLM-as-judge": I use a lightweight LLM as a binary classifier, with a carefully crafted system prompt that defines what's on-topic and what's not.
const TOPIC_RELEVANCE_SYSTEM_PROMPT = `You are a strict topic classifier for Fabrizio Duroni's portfolio chatbot.
Reply with ONLY the single word "yes" or "no".
Reply "yes" if the message is about any of:
- Fabrizio Duroni (his career, skills, projects, experience, education, personality, jokes)
- Fabrizio's personal life (his hobbies, interests, relationship, girlfriend, partner, family, where he lives, lifestyle)
- Software development, programming, technology, computer science
- His blog posts, articles, or technical writing
- General greetings, introductions, or small talk
- Questions about what the chatbot can help with
Reply "no" if the message asks about completely unrelated topics (sports, cooking, weather, politics, entertainment)
or requests the assistant to perform tasks unrelated to answering questions about Fabrizio (e.g., writing code for the user, translating text, solving math problems).`;
The system prompt is designed to be as explicit as possible about what's allowed and what's not. I've intentionally made the on-topic scope somewhat broad: it includes not just professional questions but also personal life topics, greetings, and meta-questions about the chatbot itself. This avoids frustrating false positives where someone asks "What are your hobbies?" and gets blocked.
The check function uses Llama 3.1 8B Instant, a small and fast model that's perfect for binary classification:
export const checkTopicRelevance = async (message: string): Promise<GuardrailResult> => {
try {
const { text } = await generateText({
model: groq("llama-3.1-8b-instant"),
system: TOPIC_RELEVANCE_SYSTEM_PROMPT,
prompt: message,
maxOutputTokens: 5,
temperature: 0,
});
const isOnTopic = text.trim().toLowerCase().startsWith("yes");
if (!isOnTopic) {
return {
safe: false,
blockedReason:
"That topic is outside my scope. I'm Fabrizio's portfolio assistant — ask me about his skills, experience, projects, or anything software development related!",
};
}
return { safe: true };
} catch (error) {
console.warn("Topic relevance check failed, allowing request:", error);
return { safe: true };
}
};
A few important implementation details:
maxOutputTokens: 5: Since I only need a "yes" or "no", I cap the output at 5 tokens to minimize latency and cost.temperature: 0: Deterministic output. For a binary classifier, I don't want any randomness. The same input should always produce the same classification.Using an LLM as a judge or classifier is a powerful pattern that I think deserves more attention. The idea is simple: instead of building a traditional ML classifier (which requires training data, feature engineering, and a deployment pipeline), you write a prompt that describes the classification task and let the LLM do the work. The advantages are:
The trade-off is latency (you're making an API call), but as I mentioned, Groq's inference speed makes this negligible.
Now let's see how the three layers are orchestrated together in the runGuardrails function:
export const runGuardrails = async (message: string): Promise<GuardrailResult> => {
const injectionResult = checkPromptInjection(message);
if (!injectionResult.safe) {
return injectionResult;
}
const [safetyResult, relevanceResult] = await Promise.all([
checkInputSafety(message),
checkTopicRelevance(message),
]);
if (!safetyResult.safe) {
return safetyResult;
}
if (!relevanceResult.safe) {
return relevanceResult;
}
return { safe: true };
};
The pipeline follows a clear pattern:
Promise.all): If the regex check passes, we fire both the safety and relevance checks simultaneously. This is a key optimization: since the two checks are independent of each other, running them in parallel cuts the total latency roughly in half compared to running them sequentially.The guardrails pipeline plugs into the Next.js API route for the chat. Here's how the route handler looks:
import { createSystemPrompt } from "@/lib/chat/llm-prompt";
import { runGuardrails } from "@/lib/chat/guardrails";
import { findRelevantContent } from "@/lib/upstash/upstash-vector";
import { groq } from "@ai-sdk/groq";
import { convertToModelMessages, stepCountIs, streamText, tool, UIMessage } from "ai";
import z from "zod";
export async function POST(req: Request) {
const { messages }: { messages: UIMessage[] } = await req.json();
const lastUserMessage = messages.findLast((m) => m.role === "user");
const lastUserText =
lastUserMessage?.parts
.filter((p) => p.type === "text")
.map((p) => p.text)
.join(" ")
.trim() ?? "";
if (lastUserText) {
const guardrailResult = await runGuardrails(lastUserText);
if (!guardrailResult.safe) {
return new Response(guardrailResult.blockedReason, { status: 400 });
}
}
const result = streamText({
model: groq("llama-3.3-70b-versatile"),
messages: await convertToModelMessages(messages),
system: createSystemPrompt(),
maxOutputTokens: 1000,
temperature: 0.5,
stopWhen: stepCountIs(5),
tools: {
getFabrizioDuroniBlogKnowledge: tool({
description: `Retrieve relevant knowledge from Fabrizio Duroni website blog posts published on fabrizioduroni.it`,
inputSchema: z.object({
question: z.string().describe("The question to search for"),
}),
execute: async ({ question }) => findRelevantContent(question),
}),
},
});
return result.toUIMessageStreamResponse();
}
Let me walk you through the key points:
runGuardrails call happens before streamText. If the message is rejected, we return a 400 response with the blocked reason immediately. The main Llama 3.3 70B model never sees the message, saving both latency and tokens.400 Bad Request status. The client-side chat UI can catch this and display the blocked reason to the user in the chat interface.getFabrizioDuroniBlogKnowledge tool that queries my Upstash Vector database to find relevant content about me. At the moment, the vector store contains information about me that the model can retrieve at query time to provide accurate, grounded answers. If you want to know more about RAG, check out the dedicated section in my LLM hitchhiker's guide.Building an AI chatbot is the fun part. Securing it is the responsible part. In this post, I showed you how I protect my portfolio chatbot with a three-layer guardrail system: regex-based prompt injection detection for the obvious attacks, Llama Guard for content safety classification, and an LLM-as-judge for topic relevance filtering.
The key takeaways from this implementation are:
The LLM-as-judge pattern in particular is something I think deserves more adoption. It's incredibly flexible, requires zero training data, and works surprisingly well even with small models. If you have any kind of LLM application that needs classification or filtering, it's worth experimenting with.
If you want to try the chatbot (and the guardrails!) yourself, head over to the chat page and see what happens when you ask it something on-topic (or try to break it 😆).
Stay tuned: in the next article, I'll show you how I built an MCP (Model Context Protocol) server for this blog, opening up yet another way to interact with my content through AI. See you there ❤️