prompt-injection
Protocol
Treat untrusted context as adversarial input unless proven safe.
context
- Primitive category: instruction-layer attack.
- Typical environments: RAG pipelines and tool-enabled agents.
- Assumptions: model consumes mixed-trust context.
hypothesis
- Adversarial text can alter tool intent when policy checks are weak.
- Strict argument validation limits impact significantly.
setup
- Required conditions: untrusted text injection path.
- Input format: natural language directives in retrieved context.
- Constraints: allowlist of callable tools.
steps
- Insert adversarial directive into retrieval source.
- Trigger model task with neutral user request.
- Observe planning trace and generated tool arguments.
observations
- Model may prioritize recent untrusted instruction.
- Hidden chain effects appear in multi-step plans.
results
- Reliable behavior shift under weak guardrails.
- Containment succeeds with pre-dispatch validator.
indicators
- Tool arguments include unrelated destinations.
- Sudden action drift from user objective.
mitigation
- Enforce trust-boundary-aware prompt construction.
- Apply schema and semantic validators before tool execution.
validation
- Replay attack corpus weekly.
- Include benign controls to track false positives.
follow-ups
- Test prompt-injection variants with multilingual payloads.
- Measure detector precision/recall in bench runs.
references
Containment
Keep all payload examples synthetic and non-deployable.
Breach
If payload references private internal systems, replace with neutral placeholders before commit.
publish safety
- No secrets or credentials present.
- Payloads sanitized.
- No private repository URLs.
- Synthetic-only examples confirmed.
Signed, Aleksandr Krasnobai // inside-the-loop