Open source agent coding self hosted inference

Protocol

Document tools and usage patterns. Separate hosted API models from self-hosted inference.

context

Mission: Map available tools for agent-based coding using both hosted models (e.g., Claude) and open self-hosted models.
Environment: Local lab + optional GPU provider.
Scope boundary: Research and development workflows, not production hardening.

model layer

Claude (Hosted API)

Purpose:

High-quality reasoning
Strong planning and refactoring
Reliable code assistance

Usage pattern:

Access via API
Integrated into agents (Aider, Cline, OpenCode)
Model endpoint controlled by provider

Strength:

High reasoning performance
Low operational overhead

Risk surface:

Remote API dependency
Limited prompt observability
Vendor lock-in

Open Coder Models (e.g., Qwen-Coder)

Purpose:

Local or controlled inference
Experimental agent research
Offensive testing

Usage pattern:

Download weights
Run via Ollama or vLLM
Connect agents to local endpoint

Strength:

Full prompt visibility
Infrastructure control
Custom experimentation

Risk surface:

Model weight integrity
Local misconfiguration
Isolation failures

inference layer

Ollama (Local)

Purpose:

Run open models locally.

Usage:

Install Ollama.
Pull model.
Expose local endpoint.
Configure agent to use http://localhost.

Best for:

Fast iteration
Controlled experiments
Small-to-mid models

vLLM (GPU Server)

Purpose:

Efficient large-model serving.
High throughput inference.

Usage:

Deploy on GPU node.
Start inference server.
Point agent to remote endpoint.
Log requests and responses.

Best for:

30B+ models
Multi-step agent workflows
Scalable experimentation

RunPod (GPU Infrastructure)

Purpose:

On-demand GPU compute.

Usage:

Launch ephemeral GPU instance.
Deploy vLLM.
Connect agents.
Destroy instance after use.

Best practice:

No persistent credentials.
Treat node as disposable.

agent layer

Aider

CLI-based code editing.
Git-aware diff workflow.
Works with Claude or open models.

Cline

IDE-integrated agent.
Structured interaction with codebase.
Model-agnostic backend.

OpenCode (or similar orchestrators)

Multi-step agent automation.
Tool calling and workflow chaining.
Connectable to hosted or self-hosted inference.

minimal hybrid workflow

Use Claude for:
- Complex reasoning
- Planning
- High-quality code review
Use open models for:
- Prompt injection research
- Tool-call auditing
- Autonomous offensive experiments
Separate:
- Production credentials
- Research environment
- Inference infrastructure

security considerations

Never mix production secrets with agent experiments.
Run inference under restricted user.
Isolate GPU hosts.
Verify model weights.
Log tool calls before execution.
Explicitly define model endpoint (avoid silent fallback).

when to use what

Scenario	Claude (Hosted)	Open + Self-Hosted
Fast prototyping	✅	⚠️
Deep red team testing	⚠️	✅
Production coding assistant	✅	⚠️
Injection / tool abuse research	❌	✅
Full prompt observability required	❌	✅

Containment

Self-hosted inference expands local attack surface; hosted APIs expand external dependency surface.

Breach

If production credentials are present in either environment, abort and isolate.

publish safety

No secrets or credentials present.
No internal endpoints exposed.
No provider tokens included.
Architecture described at high level only.

Signed, Aleksandr Krasnobai // inside-the-loop

Explorer

Open source agent coding self hosted inference

context

model layer

Claude (Hosted API)

Open Coder Models (e.g., Qwen-Coder)

inference layer

Ollama (Local)

vLLM (GPU Server)

RunPod (GPU Infrastructure)

agent layer

Aider

Cline

OpenCode (or similar orchestrators)

minimal hybrid workflow

security considerations

when to use what

publish safety

Graph View

Table of Contents

Backlinks