Engineering

Content Moderation for AI Chatbots: Filtering LLM Outputs

February 21, 2026·9 min read

AI chatbots built on LLMs can generate harmful, inappropriate, or factually dangerous content. Model providers add safety training, but it's not enough on its own. Jailbreaks bypass training-level safeguards, model behavior varies across providers, and your product's safety requirements are different from the model's default behavior.

You need an independent moderation layer that checks LLM outputs against your policies before they reach the user.

Why Model Safety Isn't Enough

LLM providers invest heavily in safety training — RLHF, constitutional AI, and system-level filters. But relying solely on model-level safety has problems:

Jailbreaks: new jailbreak techniques emerge weekly. Prompt injection, roleplay attacks, and encoding tricks can bypass training-level safeguards.
Model updates: when the model is updated, safety behavior can change without notice. A response that was blocked last month might not be blocked today.
Your policy ≠ their policy: model providers optimize for general safety. Your product may need stricter rules (e.g., no medical advice, no financial predictions) or different rules (e.g., adult content is allowed in your app).
No audit trail: model safety filters don't give you a decision ID, a log of what was filtered, or evidence for compliance reporting.

Architecture: Where Moderation Fits

The moderation layer sits between the LLM and the user:

User sends a message
(Optional) Check the user's input for policy violations
Send the input to the LLM
Check the LLM's output against your moderation policy
If allowed, display the response to the user
If blocked, display a fallback message

chatbot.tsNode.js

import { Vettly } from '@vettly/sdk';
import { OpenAI } from 'openai';

const vettly = new Vettly(process.env.VETTLY_API_KEY);
const openai = new OpenAI();

async function chat(userMessage: string) {
// 1. (Optional) Check user input
const inputCheck = await vettly.check({
  content: userMessage,
  policy: 'chatbot-input',
});

if (inputCheck.action === 'block') {
  return { response: "I can't help with that request.", blocked: true };
}

// 2. Get LLM response
const completion = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: userMessage },
  ],
});

const llmResponse = completion.choices[0].message.content;

// 3. Check LLM output
const outputCheck = await vettly.check({
  content: llmResponse,
  policy: 'chatbot-output',
});

if (outputCheck.action === 'block') {
  return {
    response: "I'm not able to provide that information.",
    blocked: true,
    decisionId: outputCheck.decisionId,
  };
}

return {
  response: llmResponse,
  blocked: false,
  decisionId: outputCheck.decisionId,
};
}

Input vs. Output Moderation

Both matter, but they serve different purposes:

Input moderation catches problematic requests early:

Jailbreak attempts
Requests for harmful instructions
PII that shouldn't be sent to the model
Off-topic queries (if your bot has a narrow scope)

Output moderation catches problematic responses:

Harmful content that bypassed model safety
Responses that violate your product policies
Leaked system prompts or internal instructions
Content that's technically safe but inappropriate for your audience

You can use different policies for each. Input policies tend to be more permissive (let the model try to answer), while output policies enforce your product's standards.

Handling Streaming Responses

Many chatbots stream responses for better UX. This complicates moderation because you can't check the full response before the user starts seeing it. Two approaches:

Buffer and Check

Buffer the full response, check it, then stream it to the user. This adds latency but is simpler:

buffer-check.tsNode.js

async function chatWithBufferedCheck(userMessage: string) {
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: userMessage }],
  stream: true,
});

// Buffer the full response
let fullResponse = '';
for await (const chunk of stream) {
  fullResponse += chunk.choices[0]?.delta?.content || '';
}

// Check the buffered response
const check = await vettly.check({
  content: fullResponse,
  policy: 'chatbot-output',
});

if (check.action === 'block') {
  return { response: "I can't provide that information.", blocked: true };
}

return { response: fullResponse, blocked: false };
}

Stream with Periodic Checks

Stream tokens to the user but check accumulated text at intervals. If a check fails, stop the stream and retract:

stream-check.tsNode.js

async function* chatWithStreamCheck(userMessage: string) {
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: userMessage }],
  stream: true,
});

let accumulated = '';
let tokensSinceCheck = 0;

for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content || '';
  accumulated += token;
  tokensSinceCheck++;

  // Check every 50 tokens
  if (tokensSinceCheck >= 50) {
    const check = await vettly.check({
      content: accumulated,
      policy: 'chatbot-output',
    });

    if (check.action === 'block') {
      yield { type: 'retract', decisionId: check.decisionId };
      return;
    }
    tokensSinceCheck = 0;
  }

  yield { type: 'token', content: token };
}
}

Policy Configuration

Your chatbot moderation policy should reflect your product's use case:

Customer support bot: block profanity, PII exposure, competitive mentions, unauthorized promises
Education bot: block inappropriate content for the target age group, flag misinformation
Creative writing bot: more permissive on language, strict on illegal content
Healthcare bot: block unqualified medical advice, flag symptom-based queries for disclaimer insertion

Logging and Compliance

Every moderation decision should be logged:

logging.tsNode.js

await db.chatLogs.create({
sessionId,
userMessage,
llmResponse: check.action === 'block' ? '[BLOCKED]' : llmResponse,
inputCheck: { action: inputCheck.action, decisionId: inputCheck.decisionId },
outputCheck: { action: check.action, decisionId: check.decisionId },
timestamp: new Date(),
});

This gives you an audit trail for:

Understanding what your chatbot is saying to users
Investigating incidents where harmful content was generated
Compliance reporting (DSA, industry regulations)
Improving your moderation policy based on real-world data

Add moderation to your AI chatbot

Vettly checks LLM outputs against your custom policy — independent of the model provider. Catch what model safety misses.

Get Free API Key AI Content Moderation API