Engineering
Content Moderation for AI Chatbots: Filtering LLM Outputs
AI chatbots built on LLMs can generate harmful, inappropriate, or factually dangerous content. Model providers add safety training, but it's not enough on its own. Jailbreaks bypass training-level safeguards, model behavior varies across providers, and your product's safety requirements are different from the model's default behavior.
You need an independent moderation layer that checks LLM outputs against your policies before they reach the user.
Why Model Safety Isn't Enough
LLM providers invest heavily in safety training — RLHF, constitutional AI, and system-level filters. But relying solely on model-level safety has problems:
- Jailbreaks: new jailbreak techniques emerge weekly. Prompt injection, roleplay attacks, and encoding tricks can bypass training-level safeguards.
- Model updates: when the model is updated, safety behavior can change without notice. A response that was blocked last month might not be blocked today.
- Your policy ≠ their policy: model providers optimize for general safety. Your product may need stricter rules (e.g., no medical advice, no financial predictions) or different rules (e.g., adult content is allowed in your app).
- No audit trail: model safety filters don't give you a decision ID, a log of what was filtered, or evidence for compliance reporting.
Architecture: Where Moderation Fits
The moderation layer sits between the LLM and the user:
- User sends a message
- (Optional) Check the user's input for policy violations
- Send the input to the LLM
- Check the LLM's output against your moderation policy
- If allowed, display the response to the user
- If blocked, display a fallback message
import { Vettly } from '@vettly/sdk';import { OpenAI } from 'openai';const vettly = new Vettly(process.env.VETTLY_API_KEY);const openai = new OpenAI();async function chat(userMessage: string) {// 1. (Optional) Check user inputconst inputCheck = await vettly.check({content: userMessage,policy: 'chatbot-input',});if (inputCheck.action === 'block') {return { response: "I can't help with that request.", blocked: true };}// 2. Get LLM responseconst completion = await openai.chat.completions.create({model: 'gpt-4o',messages: [{ role: 'system', content: 'You are a helpful assistant.' },{ role: 'user', content: userMessage },],});const llmResponse = completion.choices[0].message.content;// 3. Check LLM outputconst outputCheck = await vettly.check({content: llmResponse,policy: 'chatbot-output',});if (outputCheck.action === 'block') {return {response: "I'm not able to provide that information.",blocked: true,decisionId: outputCheck.decisionId,};}return {response: llmResponse,blocked: false,decisionId: outputCheck.decisionId,};}
Input vs. Output Moderation
Both matter, but they serve different purposes:
Input moderation catches problematic requests early:
- Jailbreak attempts
- Requests for harmful instructions
- PII that shouldn't be sent to the model
- Off-topic queries (if your bot has a narrow scope)
Output moderation catches problematic responses:
- Harmful content that bypassed model safety
- Responses that violate your product policies
- Leaked system prompts or internal instructions
- Content that's technically safe but inappropriate for your audience
You can use different policies for each. Input policies tend to be more permissive (let the model try to answer), while output policies enforce your product's standards.
Handling Streaming Responses
Many chatbots stream responses for better UX. This complicates moderation because you can't check the full response before the user starts seeing it. Two approaches:
Buffer and Check
Buffer the full response, check it, then stream it to the user. This adds latency but is simpler:
async function chatWithBufferedCheck(userMessage: string) {const stream = await openai.chat.completions.create({model: 'gpt-4o',messages: [{ role: 'user', content: userMessage }],stream: true,});// Buffer the full responselet fullResponse = '';for await (const chunk of stream) {fullResponse += chunk.choices[0]?.delta?.content || '';}// Check the buffered responseconst check = await vettly.check({content: fullResponse,policy: 'chatbot-output',});if (check.action === 'block') {return { response: "I can't provide that information.", blocked: true };}return { response: fullResponse, blocked: false };}
Stream with Periodic Checks
Stream tokens to the user but check accumulated text at intervals. If a check fails, stop the stream and retract:
async function* chatWithStreamCheck(userMessage: string) {const stream = await openai.chat.completions.create({model: 'gpt-4o',messages: [{ role: 'user', content: userMessage }],stream: true,});let accumulated = '';let tokensSinceCheck = 0;for await (const chunk of stream) {const token = chunk.choices[0]?.delta?.content || '';accumulated += token;tokensSinceCheck++;// Check every 50 tokensif (tokensSinceCheck >= 50) {const check = await vettly.check({content: accumulated,policy: 'chatbot-output',});if (check.action === 'block') {yield { type: 'retract', decisionId: check.decisionId };return;}tokensSinceCheck = 0;}yield { type: 'token', content: token };}}
Policy Configuration
Your chatbot moderation policy should reflect your product's use case:
- Customer support bot: block profanity, PII exposure, competitive mentions, unauthorized promises
- Education bot: block inappropriate content for the target age group, flag misinformation
- Creative writing bot: more permissive on language, strict on illegal content
- Healthcare bot: block unqualified medical advice, flag symptom-based queries for disclaimer insertion
Logging and Compliance
Every moderation decision should be logged:
await db.chatLogs.create({sessionId,userMessage,llmResponse: check.action === 'block' ? '[BLOCKED]' : llmResponse,inputCheck: { action: inputCheck.action, decisionId: inputCheck.decisionId },outputCheck: { action: check.action, decisionId: check.decisionId },timestamp: new Date(),});
This gives you an audit trail for:
- Understanding what your chatbot is saying to users
- Investigating incidents where harmful content was generated
- Compliance reporting (DSA, industry regulations)
- Improving your moderation policy based on real-world data
Add moderation to your AI chatbot
Vettly checks LLM outputs against your custom policy — independent of the model provider. Catch what model safety misses.