Research diary · Qwen3.5-0.8B · 2026
The Pocket Coding Agent
How far can a small language model actually go? This exploration takes a 0.8B Qwen3.5 model from base weights to a working agentic coding agent on a single consumer GPU, through baselines, SFT, DPO, RL with verifiable rewards, and inference-time tricks, with every number traced back to a real run.
Mean reward (pass@1) across 19 agentic tasks. Each stage shows its headline condition; the full breakdown is below.
I started with a simple question: can a small language model do agentic coding inside a Claude-Code-style harness? After leaning on coding agents across Claude Code, Cursor, and the rest, the pattern was always the same: the output you would actually trust in production comes from large, capable models running in the cloud. So I got curious about the other end: how far are we from a small model doing most of the legwork locally, with a big cloud model only stepping in to review and refine? A hybrid setup anyone could run on modest hardware, without burning through thousands of dollars on tokens.
That is the question this project is built around: how far down can you go? Can a model small enough to run fully offline on a 16 GB consumer card be post-trained into something that reliably drives an agentic coding loop, read a file, run a command, react to the output, and stop at the right moment? The probe is Qwen3.5-0.8B, taken from base weights through SFT, DPO, and RLVR, then squeezed at inference time.
The leaderboard
Every stage on one surface: 19 agentic tasks, 3 runs, no-think, scored as mean reward (pass@1). rawno prompt tests whether the protocol is baked into the weights; full+ system prompt is the production condition. Click a score header to sort.
| Stage | Model | Raw | Full | Raw–Full gap |
|---|---|---|---|---|
| Base | qwen3.5:0.8b | 0.89 | 0.68 | −0.21 |
| SFT | …-sft-run6 | 0.846 | 0.890 | +0.04 |
| DPO β=0.05 | …-dpo-v1-beta05 | 0.923 | 0.883 | −0.04 |
| RLVR v1 | …-rlvr-v1 | 0.953 | 0.949 | −0.004 |
| RLVR v2 from-dpo | …-rlvr-v2-from-dpo | 0.880 | 0.891 | +0.01 |
| RLVR v2 from-v1, e1 | …-rlvr-v2-from-v1-epoch1 | 0.952 | 0.951 | +0.001 |
The arc: base drowned under the system prompt → SFT reversed the gap so the prompt became an asset → DPO lifted raw to the ceiling → RLVR v1 hit 0.95 with a near-zero raw/full gap → v2 held the line but did not beat it.
The series
Baselines: what a 0.8B model can already do
Wiring the raw model into Claude Code and pointing a 19-task harness at it. Thinking hurts, the system prompt drowns it, and the smaller model scores higher (for instructive reasons).
The harness and the ingredients
The homemade agentic harness that produces every number in the series, and the validated, system-prompt-free training data the model learns from.
SFT: teaching a 0.8B model the protocol
Baking the protocol into the weights. The system-prompt gap reverses from a 21-point liability to a 4-point asset, after a masking bug and a more-data-made-it-worse surprise.
DPO: learning preference, not protocol
Preference learning on top of SFT. A β sweep, raw climbing to 0.92 as it targets the ceiling, and a noise-tail lesson about near-identical pairs.
RLVR: how far can the model go with verifiable signals
GRPO with a verifiable reward, a hand-rolled trainer, the best model of the series, and a clean negative result about when a second epoch buys nothing.
Inference-time tricks
Greedy vs temperature, best-of-N as a Pareto question, and a verbose system prompt that still drowns the 0.8B even after training.
Synthesis: the full picture
The whole progression, technique value, cost-per-lift, and the single capability wall that no lever ever moved.