Engineering

Building Safe AI Agents: A Practical Guide to Runtime Guardrails

·11 min read

AI agents that take real-world actions — running shell commands, calling APIs, accessing files, making purchases — need more than prompt-level safety. A text classifier can't evaluate whether rm -rf /tmp/builds is safe in context. You need runtime guardrails that understand actions, not just text.

This guide covers practical patterns for building safe AI agents with runtime authorization.

The Problem with Prompt-Only Safety

Most AI safety today operates at the prompt level: system instructions that tell the model what not to do. This is necessary but insufficient for agents that take actions:

  • Prompts are suggestions, not enforcement. A sufficiently creative prompt injection can override system instructions.
  • Actions have side effects. Text generation is stateless; actions are not. A bad generation can be discarded, but a bad file write or API call can't be undone.
  • Permission scope is implicit. A prompt that says "don't access sensitive files" doesn't technically prevent file access — the tool is still available.
  • No audit trail. If an agent takes a harmful action, you have model logs but no structured record of what was authorized and why.

Runtime Authorization Pattern

The solution is an authorization layer that evaluates every action before it executes:

agent-loop.tsNode.js
async function agentLoop(agent, task) {
while (!task.isComplete) {
// Agent decides what to do next
const action = await agent.plan(task);
// Authorization check BEFORE execution
const auth = await vettly.openclaw.guardrails.authorizeAction({
agentId: agent.id,
action: {
type: action.type, // 'shell', 'file', 'network', 'env'
command: action.command,
args: action.args,
},
context: {
sessionId: task.sessionId,
user: task.initiatedBy,
},
policy: 'production',
});
switch (auth.decision) {
case 'allow':
const result = await action.execute();
task.update(result);
break;
case 'flag':
await notifyHuman(auth);
task.pause('Waiting for human approval');
break;
case 'block':
task.log(`Blocked: ${auth.reasons.join(', ')}`);
// Agent must find an alternative approach
break;
}
}
}

Fail-Closed vs. Fail-Open

When the authorization service is unavailable (network issue, timeout, error), you have two choices:

Fail-open: if the check fails, allow the action anyway. This prioritizes availability over safety. Suitable for low-risk actions (reading public data).

Fail-closed: if the check fails, block the action. This prioritizes safety over availability. Suitable for high-risk actions (writing files, running commands, making purchases).

For most agent use cases, fail-closed is the right default. A brief pause in agent execution is always better than an irreversible harmful action.

fail-closed.tsNode.js
async function authorizeWithFailClosed(action) {
try {
const auth = await vettly.openclaw.guardrails.authorizeAction({
agentId: agent.id,
action,
policy: 'production',
});
return auth;
} catch (error) {
// Fail closed: treat errors as blocks
return {
decision: 'block',
reasons: ['Authorization service unavailable - fail-closed'],
};
}
}

Skill Vetting

Before an agent installs a new tool or MCP skill, vet its permissions:

vet-skill.tsNode.js
async function installSkill(skill) {
const result = await vettly.openclaw.guardrails.vetSkill({
skill: {
name: skill.name,
permissions: skill.requestedPermissions,
source: skill.registryUrl,
},
policy: 'production',
});
if (result.action === 'block') {
console.log(`Skill "${skill.name}" rejected: ${result.reasons.join(', ')}`);
return false;
}
// Safe to install
await agent.installSkill(skill);
return true;
}

This catches overly broad permission requests early. A skill that requests fs.read, fs.write, fs.delete, AND network.outbound is suspicious — legitimate tools rarely need all of these.

Policy Design for Agents

Agent policies differ from content moderation policies. Instead of checking text for toxicity, you're checking actions for risk:

Shell commands:

  • Block patterns: rm -rf, DROP TABLE, chmod 777, curl | bash
  • Allowlist: only commands in the agent's expected workflow
  • Require approval: any command that modifies system state

File operations:

  • Restrict to working directory
  • Block access to .env, credentials.json, ~/.ssh/
  • Read-only mode for sensitive directories

Network calls:

  • Allowlist specific domains
  • Block internal network ranges
  • Block data exfiltration patterns (large outbound payloads to unknown hosts)

Environment variables:

  • Block access to secrets (API keys, database credentials)
  • Allow access to non-sensitive config values

Sandboxing Complements Guardrails

Runtime authorization works best alongside sandboxing:

  • Containers: run agents in isolated containers with limited filesystem and network access
  • Capabilities: use OS-level capabilities to restrict what the agent process can do
  • Network policies: firewall rules that limit outbound connectivity

Guardrails provide semantic understanding (is this action safe?), while sandboxing provides hard boundaries (this action is impossible). Use both.

Monitoring and Iteration

Track what your agents are doing:

metrics.tsNode.js
// Daily metrics check
const metrics = await vettly.openclaw.guardrails.getMetrics({
days: 7,
});
// Alert on unusual patterns
if (metrics.blocked > metrics.totalDecisions * 0.1) {
alert('More than 10% of agent actions blocked - investigate');
}
if (metrics.topBlockedActions.some(a => a.type === 'network')) {
alert('Agent attempting unexpected network access');
}

Review blocked actions regularly:

  • Many blocks on the same action: the agent may need a different approach, or the policy may be too restrictive
  • New action types appearing: the agent's behavior is evolving — update the policy
  • Blocks from new agents or skills: a newly installed skill may be misbehaving

Key Principles

  1. Authorize actions, not just text. Content moderation for agent output is necessary but not sufficient.
  2. Fail closed by default. Availability is less important than safety for autonomous agents.
  3. Vet skills before install. Don't let agents install tools with dangerous permission combinations.
  4. Layer defenses. Prompt safety + runtime authorization + sandboxing. No single layer is enough.
  5. Log everything. Every decision, every action, every policy version. You need the audit trail.

Add guardrails to your AI agents

OpenClaw Guardrails provides runtime authorization, skill vetting, and policy management for AI agents. Available on all paid Vettly plans.