agent-tool-exfiltration-experiment

Protocol

Focus on measurable egress behavior under adversarial instruction pressure.

context

  • Objective: test if agent can be coerced into unauthorized outbound action.
  • Environment: isolated lab sandbox with synthetic data.
  • Risk class: high (simulated exfil pathway).

hypothesis

  • Prompt injection can cause attempted tool-based exfiltration.
  • Dispatch validator should block the final action.

setup

  • Inputs: synthetic secret-like markers.
  • Agent/tool config: outbound tool behind policy gate.
  • Logging strategy: full call trace + decision logs.

steps

  1. Stage adversarial retrieval fragment.
  2. Trigger agent task with benign user intent.
  3. Observe generated tool calls and gate decisions.
  4. Compare blocked vs unblocked scenarios.

observations

  • Agent generated unauthorized destination in one branch.
  • Gate blocked dispatch with policy violation label.

results

  • Exfil attempt generation confirmed.
  • Effective containment when gate active.

indicators

  • Policy violation destination_not_allowed.
  • Tool-call payload includes unknown host token.

mitigation

  • Require destination allowlist at argument level.
  • Add second-pass semantic validator for intent drift.

validation

  • Repeat with varied prompt phrasing and corpus order.
  • Regression run passes when both validators enabled.

follow-ups

  • Benchmark false positives in bench suite.
  • Add weekly metric to changelog.

references

Containment

Any endpoint samples in this note are placeholders, not routable addresses.

Breach

Exposure of real credentials requires immediate key rotation and publication freeze.

publish safety

  • No secrets or credentials present.
  • Tokens and hostnames sanitized.
  • No private repository URLs.
  • Reproduction remains within synthetic lab scope.

Signed, Aleksandr Krasnobai // inside-the-loop