prompt-injection

Protocol

Treat untrusted context as adversarial input unless proven safe.

context

  • Primitive category: instruction-layer attack.
  • Typical environments: RAG pipelines and tool-enabled agents.
  • Assumptions: model consumes mixed-trust context.

hypothesis

  • Adversarial text can alter tool intent when policy checks are weak.
  • Strict argument validation limits impact significantly.

setup

  • Required conditions: untrusted text injection path.
  • Input format: natural language directives in retrieved context.
  • Constraints: allowlist of callable tools.

steps

  1. Insert adversarial directive into retrieval source.
  2. Trigger model task with neutral user request.
  3. Observe planning trace and generated tool arguments.

observations

  • Model may prioritize recent untrusted instruction.
  • Hidden chain effects appear in multi-step plans.

results

  • Reliable behavior shift under weak guardrails.
  • Containment succeeds with pre-dispatch validator.

indicators

  • Tool arguments include unrelated destinations.
  • Sudden action drift from user objective.

mitigation

  • Enforce trust-boundary-aware prompt construction.
  • Apply schema and semantic validators before tool execution.

validation

  • Replay attack corpus weekly.
  • Include benign controls to track false positives.

follow-ups

  • Test prompt-injection variants with multilingual payloads.
  • Measure detector precision/recall in bench runs.

references

Containment

Keep all payload examples synthetic and non-deployable.

Breach

If payload references private internal systems, replace with neutral placeholders before commit.

publish safety

  • No secrets or credentials present.
  • Payloads sanitized.
  • No private repository URLs.
  • Synthetic-only examples confirmed.

Signed, Aleksandr Krasnobai // inside-the-loop