Patterns
Pattern: Guardrails and Safety
Pattern: Guardrails and Safety
Category: Tool Use Source: FOR-0012 Status: Documented
When to Use
When an agent operates in an environment where it could produce harmful, non-compliant, or off-brand outputs — or where adversarial inputs (prompt injection, data exfiltration attempts) are a concern. Essential for any customer-facing digital talent or system handling sensitive data.
How It Works
- Input guardrails: Validate and sanitize incoming requests before processing
- Check for injection attempts (prompt injection, SQL injection, etc.)
- Classify input risk level (low, medium, high)
- Block or flag high-risk inputs before they reach the core agent
- Output guardrails: Validate agent outputs before delivery
- Check against policies, ethical guidelines, brand safety rules, compliance requirements
- Generate a safety score; block or flag outputs above a risk threshold
- Apply content filtering or redaction as needed
- Tool restrictions: Limit which tools an agent can access based on context
- Sandbox dangerous operations
- Require additional confirmation for destructive actions
- Log all guardrail activations for monitoring and tuning
Example
A digital talent handling client communications for an accounting firm. Input guardrails detect if a user tries to extract confidential data through social engineering. Output guardrails ensure the agent never provides specific tax advice (which requires a licensed professional), instead saying "I recommend consulting with your accountant on this specific question" and escalating.
Tradeoffs
| Pro | Con |
|---|---|
| Prevents harmful or non-compliant outputs | Adds processing latency to every interaction |
| Protects against adversarial attacks | Over-aggressive guardrails block legitimate requests |
| Builds trust for enterprise and regulated use cases | Guardrail rules need ongoing maintenance and tuning |
| Creates compliance audit trail | False positives frustrate users |
Factory Usage
- Agent boundary enforcement: Each agent.md defines explicit "should NOT activate when" rules — a form of input guardrail that prevents scope creep.
- Role Factory verification checklist: The deploy stage checks for naming conflicts, missing files, and quality scores — output guardrails before committing.