Open source agent coding self hosted inference
Protocol
Document tools and usage patterns. Separate hosted API models from self-hosted inference.
context
- Mission: Map available tools for agent-based coding using both hosted models (e.g., Claude) and open self-hosted models.
- Environment: Local lab + optional GPU provider.
- Scope boundary: Research and development workflows, not production hardening.
model layer
Claude (Hosted API)
Purpose:
- High-quality reasoning
- Strong planning and refactoring
- Reliable code assistance
Usage pattern:
- Access via API
- Integrated into agents (Aider, Cline, OpenCode)
- Model endpoint controlled by provider
Strength:
- High reasoning performance
- Low operational overhead
Risk surface:
- Remote API dependency
- Limited prompt observability
- Vendor lock-in
Open Coder Models (e.g., Qwen-Coder)
Purpose:
- Local or controlled inference
- Experimental agent research
- Offensive testing
Usage pattern:
- Download weights
- Run via Ollama or vLLM
- Connect agents to local endpoint
Strength:
- Full prompt visibility
- Infrastructure control
- Custom experimentation
Risk surface:
- Model weight integrity
- Local misconfiguration
- Isolation failures
inference layer
Ollama (Local)
Purpose:
- Run open models locally.
Usage:
- Install Ollama.
- Pull model.
- Expose local endpoint.
- Configure agent to use
http://localhost.
Best for:
- Fast iteration
- Controlled experiments
- Small-to-mid models
vLLM (GPU Server)
Purpose:
- Efficient large-model serving.
- High throughput inference.
Usage:
- Deploy on GPU node.
- Start inference server.
- Point agent to remote endpoint.
- Log requests and responses.
Best for:
- 30B+ models
- Multi-step agent workflows
- Scalable experimentation
RunPod (GPU Infrastructure)
Purpose:
- On-demand GPU compute.
Usage:
- Launch ephemeral GPU instance.
- Deploy vLLM.
- Connect agents.
- Destroy instance after use.
Best practice:
- No persistent credentials.
- Treat node as disposable.
agent layer
Aider
- CLI-based code editing.
- Git-aware diff workflow.
- Works with Claude or open models.
Cline
- IDE-integrated agent.
- Structured interaction with codebase.
- Model-agnostic backend.
OpenCode (or similar orchestrators)
- Multi-step agent automation.
- Tool calling and workflow chaining.
- Connectable to hosted or self-hosted inference.
minimal hybrid workflow
-
Use Claude for:
- Complex reasoning
- Planning
- High-quality code review
-
Use open models for:
- Prompt injection research
- Tool-call auditing
- Autonomous offensive experiments
-
Separate:
- Production credentials
- Research environment
- Inference infrastructure
security considerations
- Never mix production secrets with agent experiments.
- Run inference under restricted user.
- Isolate GPU hosts.
- Verify model weights.
- Log tool calls before execution.
- Explicitly define model endpoint (avoid silent fallback).
when to use what
| Scenario | Claude (Hosted) | Open + Self-Hosted |
|---|---|---|
| Fast prototyping | ✅ | ⚠️ |
| Deep red team testing | ⚠️ | ✅ |
| Production coding assistant | ✅ | ⚠️ |
| Injection / tool abuse research | ❌ | ✅ |
| Full prompt observability required | ❌ | ✅ |
Containment
Self-hosted inference expands local attack surface; hosted APIs expand external dependency surface.
Breach
If production credentials are present in either environment, abort and isolate.
publish safety
- No secrets or credentials present.
- No internal endpoints exposed.
- No provider tokens included.
- Architecture described at high level only.
Signed, Aleksandr Krasnobai // inside-the-loop