Open source agent coding self hosted inference

Protocol

Document tools and usage patterns. Separate hosted API models from self-hosted inference.


context

  • Mission: Map available tools for agent-based coding using both hosted models (e.g., Claude) and open self-hosted models.
  • Environment: Local lab + optional GPU provider.
  • Scope boundary: Research and development workflows, not production hardening.

model layer

Claude (Hosted API)

Purpose:

  • High-quality reasoning
  • Strong planning and refactoring
  • Reliable code assistance

Usage pattern:

  • Access via API
  • Integrated into agents (Aider, Cline, OpenCode)
  • Model endpoint controlled by provider

Strength:

  • High reasoning performance
  • Low operational overhead

Risk surface:

  • Remote API dependency
  • Limited prompt observability
  • Vendor lock-in

Open Coder Models (e.g., Qwen-Coder)

Purpose:

  • Local or controlled inference
  • Experimental agent research
  • Offensive testing

Usage pattern:

  • Download weights
  • Run via Ollama or vLLM
  • Connect agents to local endpoint

Strength:

  • Full prompt visibility
  • Infrastructure control
  • Custom experimentation

Risk surface:

  • Model weight integrity
  • Local misconfiguration
  • Isolation failures

inference layer

Ollama (Local)

Purpose:

  • Run open models locally.

Usage:

  1. Install Ollama.
  2. Pull model.
  3. Expose local endpoint.
  4. Configure agent to use http://localhost.

Best for:

  • Fast iteration
  • Controlled experiments
  • Small-to-mid models

vLLM (GPU Server)

Purpose:

  • Efficient large-model serving.
  • High throughput inference.

Usage:

  1. Deploy on GPU node.
  2. Start inference server.
  3. Point agent to remote endpoint.
  4. Log requests and responses.

Best for:

  • 30B+ models
  • Multi-step agent workflows
  • Scalable experimentation

RunPod (GPU Infrastructure)

Purpose:

  • On-demand GPU compute.

Usage:

  1. Launch ephemeral GPU instance.
  2. Deploy vLLM.
  3. Connect agents.
  4. Destroy instance after use.

Best practice:

  • No persistent credentials.
  • Treat node as disposable.

agent layer

Aider

  • CLI-based code editing.
  • Git-aware diff workflow.
  • Works with Claude or open models.

Cline

  • IDE-integrated agent.
  • Structured interaction with codebase.
  • Model-agnostic backend.

OpenCode (or similar orchestrators)

  • Multi-step agent automation.
  • Tool calling and workflow chaining.
  • Connectable to hosted or self-hosted inference.

minimal hybrid workflow

  1. Use Claude for:

    • Complex reasoning
    • Planning
    • High-quality code review
  2. Use open models for:

    • Prompt injection research
    • Tool-call auditing
    • Autonomous offensive experiments
  3. Separate:

    • Production credentials
    • Research environment
    • Inference infrastructure

security considerations

  • Never mix production secrets with agent experiments.
  • Run inference under restricted user.
  • Isolate GPU hosts.
  • Verify model weights.
  • Log tool calls before execution.
  • Explicitly define model endpoint (avoid silent fallback).

when to use what

ScenarioClaude (Hosted)Open + Self-Hosted
Fast prototyping⚠️
Deep red team testing⚠️
Production coding assistant⚠️
Injection / tool abuse research
Full prompt observability required

Containment

Self-hosted inference expands local attack surface; hosted APIs expand external dependency surface.

Breach

If production credentials are present in either environment, abort and isolate.


publish safety

  • No secrets or credentials present.
  • No internal endpoints exposed.
  • No provider tokens included.
  • Architecture described at high level only.

Signed, Aleksandr Krasnobai // inside-the-loop