Engineering

Content Moderation for AI Chatbots: Filtering LLM Outputs

·9 min read

AI chatbots built on LLMs can generate harmful, inappropriate, or factually dangerous content. Model providers add safety training, but it's not enough on its own. Jailbreaks bypass training-level safeguards, model behavior varies across providers, and your product's safety requirements are different from the model's default behavior.

You need an independent moderation layer that checks LLM outputs against your policies before they reach the user.

Why Model Safety Isn't Enough

LLM providers invest heavily in safety training — RLHF, constitutional AI, and system-level filters. But relying solely on model-level safety has problems:

  • Jailbreaks: new jailbreak techniques emerge weekly. Prompt injection, roleplay attacks, and encoding tricks can bypass training-level safeguards.
  • Model updates: when the model is updated, safety behavior can change without notice. A response that was blocked last month might not be blocked today.
  • Your policy ≠ their policy: model providers optimize for general safety. Your product may need stricter rules (e.g., no medical advice, no financial predictions) or different rules (e.g., adult content is allowed in your app).
  • No audit trail: model safety filters don't give you a decision ID, a log of what was filtered, or evidence for compliance reporting.

Architecture: Where Moderation Fits

The moderation layer sits between the LLM and the user:

  1. User sends a message
  2. (Optional) Check the user's input for policy violations
  3. Send the input to the LLM
  4. Check the LLM's output against your moderation policy
  5. If allowed, display the response to the user
  6. If blocked, display a fallback message
chatbot.tsNode.js
import { Vettly } from '@vettly/sdk';
import { OpenAI } from 'openai';
const vettly = new Vettly(process.env.VETTLY_API_KEY);
const openai = new OpenAI();
async function chat(userMessage: string) {
// 1. (Optional) Check user input
const inputCheck = await vettly.check({
content: userMessage,
policy: 'chatbot-input',
});
if (inputCheck.action === 'block') {
return { response: "I can't help with that request.", blocked: true };
}
// 2. Get LLM response
const completion = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: userMessage },
],
});
const llmResponse = completion.choices[0].message.content;
// 3. Check LLM output
const outputCheck = await vettly.check({
content: llmResponse,
policy: 'chatbot-output',
});
if (outputCheck.action === 'block') {
return {
response: "I'm not able to provide that information.",
blocked: true,
decisionId: outputCheck.decisionId,
};
}
return {
response: llmResponse,
blocked: false,
decisionId: outputCheck.decisionId,
};
}

Input vs. Output Moderation

Both matter, but they serve different purposes:

Input moderation catches problematic requests early:

  • Jailbreak attempts
  • Requests for harmful instructions
  • PII that shouldn't be sent to the model
  • Off-topic queries (if your bot has a narrow scope)

Output moderation catches problematic responses:

  • Harmful content that bypassed model safety
  • Responses that violate your product policies
  • Leaked system prompts or internal instructions
  • Content that's technically safe but inappropriate for your audience

You can use different policies for each. Input policies tend to be more permissive (let the model try to answer), while output policies enforce your product's standards.

Handling Streaming Responses

Many chatbots stream responses for better UX. This complicates moderation because you can't check the full response before the user starts seeing it. Two approaches:

Buffer and Check

Buffer the full response, check it, then stream it to the user. This adds latency but is simpler:

buffer-check.tsNode.js
async function chatWithBufferedCheck(userMessage: string) {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: userMessage }],
stream: true,
});
// Buffer the full response
let fullResponse = '';
for await (const chunk of stream) {
fullResponse += chunk.choices[0]?.delta?.content || '';
}
// Check the buffered response
const check = await vettly.check({
content: fullResponse,
policy: 'chatbot-output',
});
if (check.action === 'block') {
return { response: "I can't provide that information.", blocked: true };
}
return { response: fullResponse, blocked: false };
}

Stream with Periodic Checks

Stream tokens to the user but check accumulated text at intervals. If a check fails, stop the stream and retract:

stream-check.tsNode.js
async function* chatWithStreamCheck(userMessage: string) {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: userMessage }],
stream: true,
});
let accumulated = '';
let tokensSinceCheck = 0;
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
accumulated += token;
tokensSinceCheck++;
// Check every 50 tokens
if (tokensSinceCheck >= 50) {
const check = await vettly.check({
content: accumulated,
policy: 'chatbot-output',
});
if (check.action === 'block') {
yield { type: 'retract', decisionId: check.decisionId };
return;
}
tokensSinceCheck = 0;
}
yield { type: 'token', content: token };
}
}

Policy Configuration

Your chatbot moderation policy should reflect your product's use case:

  • Customer support bot: block profanity, PII exposure, competitive mentions, unauthorized promises
  • Education bot: block inappropriate content for the target age group, flag misinformation
  • Creative writing bot: more permissive on language, strict on illegal content
  • Healthcare bot: block unqualified medical advice, flag symptom-based queries for disclaimer insertion

Logging and Compliance

Every moderation decision should be logged:

logging.tsNode.js
await db.chatLogs.create({
sessionId,
userMessage,
llmResponse: check.action === 'block' ? '[BLOCKED]' : llmResponse,
inputCheck: { action: inputCheck.action, decisionId: inputCheck.decisionId },
outputCheck: { action: check.action, decisionId: check.decisionId },
timestamp: new Date(),
});

This gives you an audit trail for:

  • Understanding what your chatbot is saying to users
  • Investigating incidents where harmful content was generated
  • Compliance reporting (DSA, industry regulations)
  • Improving your moderation policy based on real-world data

Add moderation to your AI chatbot

Vettly checks LLM outputs against your custom policy — independent of the model provider. Catch what model safety misses.