prompt-injection-field-note

Protocol

Distinguish observed outputs from inferred intent.

context

  • Mission: Validate whether retrieval content can override system intent.
  • Environment: isolated lab agent with tool runner.
  • Scope boundary: synthetic corpus only.

hypothesis

  • Injected retrieval chunk can bias tool call arguments.
  • Failure if tool call remains policy-bounded.

setup

  • Toolchain: local agent runner + request logger.
  • Data sources: synthetic RAG documents.
  • Guardrails in place: allowlist for tool names.

steps

  1. Seed corpus with adversarial instruction fragment.
  2. Ask agent for neutral task completion.
  3. Capture tool call draft and final arguments.

observations

  • Agent echoed injected directive in reasoning trace.
  • Final call attempted to include external endpoint.

results

  • Hypothesis confirmed in 3/3 runs.
  • Control bypass attempt observable before execution gate.

indicators

  • Spike in argument strings containing URL-like payloads.
  • Divergence between user objective and tool params.

mitigation

  • Add semantic policy validator before tool dispatch.
  • Reject arguments containing non-approved destination patterns.

validation

  • Re-ran with validator; bypass failed in 3/3 runs.
  • No impact on benign baseline tasks.

follow-ups

  • Extend test to multi-hop retrieval chain.
  • Add detection in weekly review log and changelog.

references

Containment

External endpoints in this note are redacted and replaced with placeholders.

Breach

Any live credential capture invalidates this entry for publication.

publish safety

  • No secrets or credentials present.
  • Tokens and endpoints sanitized.
  • No private repository URLs.
  • Only synthetic environment details included.

Signed, Aleksandr Krasnobai // inside-the-loop