Research diary · Qwen3.5-0.8B · 2026

The Pocket Coding Agent

How far can a small language model actually go? This exploration takes a 0.8B Qwen3.5 model from base weights to a working agentic coding agent on a single consumer GPU, through baselines, SFT, DPO, RL with verifiable rewards, and inference-time tricks, with every number traced back to a real run.

Mean reward (pass@1) across 19 agentic tasks. Each stage shows its headline condition; the full breakdown is below.

I started with a simple question: can a small language model do agentic coding inside a Claude-Code-style harness? After leaning on coding agents across Claude Code, Cursor, and the rest, the pattern was always the same: the output you would actually trust in production comes from large, capable models running in the cloud. So I got curious about the other end: how far are we from a small model doing most of the legwork locally, with a big cloud model only stepping in to review and refine? A hybrid setup anyone could run on modest hardware, without burning through thousands of dollars on tokens.

That is the question this project is built around: how far down can you go? Can a model small enough to run fully offline on a 16 GB consumer card be post-trained into something that reliably drives an agentic coding loop, read a file, run a command, react to the output, and stop at the right moment? The probe is Qwen3.5-0.8B, taken from base weights through SFT, DPO, and RLVR, then squeezed at inference time.

The leaderboard

Every stage on one surface: 19 agentic tasks, 3 runs, no-think, scored as mean reward (pass@1). rawno prompt tests whether the protocol is baked into the weights; full+ system prompt is the production condition. Click a score header to sort.

Stage Model Raw Full Raw–Full gap
Base qwen3.5:0.8b 0.89 0.68 −0.21
SFT …-sft-run6 0.846 0.890 +0.04
DPO β=0.05 …-dpo-v1-beta05 0.923 0.883 −0.04
RLVR v1 …-rlvr-v1 0.953 0.949 −0.004
RLVR v2 from-dpo …-rlvr-v2-from-dpo 0.880 0.891 +0.01
RLVR v2 from-v1, e1 …-rlvr-v2-from-v1-epoch1 0.952 0.951 +0.001

The arc: base drowned under the system prompt → SFT reversed the gap so the prompt became an asset → DPO lifted raw to the ceiling → RLVR v1 hit 0.95 with a near-zero raw/full gap → v2 held the line but did not beat it.

The series

00Live

Baselines: what a 0.8B model can already do

Wiring the raw model into Claude Code and pointing a 19-task harness at it. Thinking hurts, the system prompt drowns it, and the smaller model scores higher (for instructive reasons).

01aComing soon

The harness and the ingredients

The homemade agentic harness that produces every number in the series, and the validated, system-prompt-free training data the model learns from.

01bComing soon

SFT: teaching a 0.8B model the protocol

Baking the protocol into the weights. The system-prompt gap reverses from a 21-point liability to a 4-point asset, after a masking bug and a more-data-made-it-worse surprise.

02Coming soon

DPO: learning preference, not protocol

Preference learning on top of SFT. A β sweep, raw climbing to 0.92 as it targets the ceiling, and a noise-tail lesson about near-identical pairs.

03Coming soon

RLVR: how far can the model go with verifiable signals

GRPO with a verifiable reward, a hand-rolled trainer, the best model of the series, and a clean negative result about when a second epoch buys nothing.

04Coming soon

Inference-time tricks

Greedy vs temperature, best-of-N as a Pareto question, and a verbose system prompt that still drowns the 0.8B even after training.

05Coming soon

Synthesis: the full picture

The whole progression, technique value, cost-per-lift, and the single capability wall that no lever ever moved.