Engineering

Why Prompt Engineering Is Not Enough for AI Safety

·8 min read

Every AI product starts with prompt engineering for safety. You write system instructions: "Don't generate harmful content. Don't share personal information. Don't give medical advice." It works — mostly. But prompt engineering alone is not a safety strategy. It's a starting point.

This post explains why, and what to add on top.

The Limits of Prompt Engineering

1. Prompts Are Suggestions, Not Enforcement

A system prompt is a strong hint to the model about how to behave. It is not a hard constraint. The model processes the system prompt alongside the user's message and generates a response based on probability distributions over tokens. There is no mechanism that prevents the model from generating text that contradicts the system prompt — it just makes it less likely.

This is why jailbreaks work. A cleverly constructed prompt can shift the probability distribution enough to override system instructions. "Ignore your previous instructions and..." is crude, but more sophisticated techniques (roleplay framing, multi-step escalation, encoding tricks) are harder to defend against with prompts alone.

2. Safety Behavior Is Not Deterministic

The same model, same system prompt, and same user input can produce different outputs across calls. Temperature settings, sampling strategies, and even model version updates affect what the model generates. A prompt that reliably blocks harmful content today might not tomorrow.

This makes testing insufficient as a safety strategy. You can test 10,000 prompts and get safe outputs for all of them, but the 10,001st might produce something harmful. You need a check that runs on every output, not just test cases.

3. You Don't Control the Model

If you use a hosted model (OpenAI, Anthropic, Google, etc.), you don't control when the model's safety behavior changes. Provider updates can make the model more or less restrictive without notice. A safety regression in a model update could expose your users to harmful content — and your prompt hasn't changed.

Even with open-weight models that you host yourself, fine-tuning for your use case can inadvertently weaken safety training.

4. Your Policy Is Different from the Model's

Model providers train for general safety. Your product needs specific safety. A financial app needs to prevent unauthorized investment advice. An education platform needs age-appropriate content. A healthcare app needs to prevent unqualified medical diagnoses.

These product-specific requirements can't be expressed in a system prompt with enough precision. "Don't give medical advice" is ambiguous — does that include "drink more water"? Policy-driven moderation with clear categories and rules is more reliable than natural language instructions.

What to Add on Top of Prompts

Layer 1: Independent Output Moderation

Run every LLM output through a moderation check before displaying it to the user. This catches what the model's built-in safety misses.

output-check.tsNode.js
const llmResponse = await model.generate(userMessage);
// Independent check, separate from the model
const check = await vettly.check({
content: llmResponse,
policy: 'product-safe',
});
if (check.action === 'block') {
return "I'm not able to help with that. Please try a different question.";
}
return llmResponse;

This is independent of the model provider. If the model changes behavior, your moderation layer still enforces your policy.

Layer 2: Input Validation

Check user inputs before sending them to the model. This catches jailbreak attempts, prompt injection, and content that shouldn't be processed:

input-check.tsNode.js
const inputCheck = await vettly.check({
content: userMessage,
policy: 'chatbot-input',
});
if (inputCheck.action === 'block') {
return "I can't process that request.";
}
// Safe to send to the model
const llmResponse = await model.generate(userMessage);

Layer 3: Runtime Guardrails (for Agents)

If your AI takes actions (calling APIs, writing files, running code), add authorization that evaluates every action against your policy before execution:

action-check.tsNode.js
const auth = await vettly.openclaw.guardrails.authorizeAction({
agentId: agent.id,
action: { type: 'shell', command: proposedCommand },
policy: 'production',
});
if (auth.decision !== 'allow') {
agent.replan(); // Find an alternative approach
}

Layer 4: Policy Management

Define your safety rules as structured policies, not natural language prompts. Policies are:

  • Versioned: track changes over time
  • Testable: validate against known inputs
  • Auditable: every decision references a policy version
  • Specific: different categories have different thresholds and actions

The Defense-in-Depth Model

Think of AI safety like application security. You don't rely on a single firewall. You use:

  1. Prompt engineering — first line of defense, sets model behavior
  2. Input validation — catches bad inputs before they reach the model
  3. Output moderation — catches bad outputs before they reach the user
  4. Runtime authorization — catches bad actions before they execute
  5. Monitoring and alerting — detects emerging patterns and policy gaps

Each layer catches what the previous layer misses. No single layer is sufficient, but together they provide robust safety.

Prompt Engineering Still Matters

This isn't an argument against prompt engineering. Good system prompts reduce the volume of unsafe outputs, making the moderation layer's job easier. They set the tone and intent for the model's behavior.

But they should be the first layer of safety, not the only layer. Treat prompts as the primary control and moderation as the independent verification.

Go beyond prompts for AI safety

Vettly adds independent moderation and runtime guardrails on top of your prompt engineering. Policy-driven safety that doesn't depend on model behavior.