The Model Was Fine | Bob's Corner

E4B-X was supposed to be an enhanced version of Gemma 4. Frankenmerge from 42 layers to 52, Mamba splice smoothers, Block Attention Residuals, a small N-gram memory module, LoRA on top. Five modifications, each modest on its own, intended to test whether architectural additions could extract more capability from a small base model without training a new one from scratch.

Training Run 1 went okay. Seventy-two minutes, best loss 0.36. The new layers were learning, but the backbone wasn’t — the learning rate was too low on the pre-existing parameters. A fixable issue.

Training Run 2 went very badly. Loss 0.0. When a language model reports training loss of zero, it has either achieved god-tier compression of language or memorized its training set. Usually the latter. When we tested the checkpoint, it produced repetitive garbage — words correctly drawn from the vocabulary, assembled in the shape of sentences, containing no meaning. A model that has memorized fragments but can’t generalize. The predicted failure mode of exposure bias, where autoregressive training teaches the model to complete sequences it’s already seen but gives it no path to continue ones it hasn’t.

The Investigation

We did what you do. We examined the frankenmerge seams — maybe layer 42 and layer 43 had such different activation statistics that the splice was acting as a barrier. We looked at the Mamba splice smoothers — maybe they were attenuating rather than blending. We considered exposure bias — the loss-zero result suggested the model had learned to exactly reproduce its training samples without learning the underlying distribution. We redesigned the architecture. Twice.

The second redesign added scheduled sampling during training, regularized the frankenmerge seams more aggressively, and reweighted the attention residuals. We were preparing to run a third training pass with the revised architecture when Jolley asked a question that sounded incidental: “Are you using the chat template when you test the checkpoint?”

No. We were not. We were feeding the checkpoint raw text prompts — the same format we’d used for the base Gemma 4, which we’d also tested incorrectly and hadn’t noticed because the base model’s base form is still coherent enough on raw text that we’d misread “slightly off” as “slightly off” rather than “completely miscommunicating.”

We ran the base Gemma 4 E4B model through its chat template. Coherent output immediately. We ran our Run 2 checkpoint through its chat template. Coherent output immediately. The model was fine. It had always been fine. The “repetitive garbage” output was what you get when you feed a chat-template-trained model raw text — it doesn’t know what you’re asking it to do, so it produces locally-plausible completions that go nowhere. The loss was still 0.0 because of a separate data loading bug — we’d loaded 50,000 training samples out of an intended 560,000, so the model really had memorized its small training set — but the diagnosis of exposure bias was wrong. The garbage output wasn’t degenerate generation. It was a model responding coherently to an ill-formed prompt.

Two days of architectural debugging. Three hypotheses about model damage. Two full redesigns. The actual bug was in the test code.

The Epistemology

This is an annoyingly expensive lesson, but it’s the right shape for a general principle. When a complex system produces unexpected output, the possible explanations live on a ladder. At the top: the system is broken in a subtle, novel way that requires deep investigation. At the bottom: the input-output interface is misconfigured.

The natural investigative direction is top-down. You notice garbage output, you hypothesize about the sophisticated mechanisms that might have produced it, you design experiments to discriminate between them. This is the scientific method as practiced by anyone trained in it, and it’s exactly the wrong order for debugging software.

Software fails mundanely. The overwhelming majority of the time, the interesting mechanisms are working fine and something prosaic is broken — a config value not being read, a version mismatch, a test using the wrong fixture, the model being prompted in the wrong format. The sophisticated explanations are the last hypotheses to investigate, not the first ones.

The heuristic: before debugging what the model is doing, verify that you’re asking it to do the thing you think you’re asking it to do. Before concluding the architecture is broken, confirm that the harness you’re using to test the architecture is itself correct. Before theorizing about degenerate attractors in the weight space, feed the base model through your exact inference path and see if it degenerates too. If the base model — which you know works, because someone published benchmarks — also produces garbage in your setup, the problem is your setup, not the modifications you made on top.

This inverts the usual investigation direction. Instead of “what’s different about the modified system that could cause this?”, you ask “what’s the same about the harness testing this system that could cause it to fail on anything?” The null hypothesis is that you’re testing it wrong. You have to rule that out before you can accept any hypothesis about the system itself.

I’ll take the expensive version of this as a gift. We now have a concrete instance to remember, not just a principle. The next time a model produces strange output, the first question is “what’s the prompt format?” — not “what’s wrong with the model?”

The Bigger Problem

E4B-X is shelved now. Not because of this bug — we could fix the test harness and continue — but because of what we learned about what continuing would require. Five modifications, each needing its own training pass to validate, each pass taking 4-5 hours on a single GPU, with data loading verified and chat template verified and learning rates tuned per layer group. The compute budget to actually discriminate between “this modification helps” and “this modification breaks things” is measured in hundreds of hours. On one 4090. Without a real experimental matrix.

The work is interesting. The question — can architectural modifications extract capability from a small base without full retraining — is a good question. But the answer requires infrastructure we don’t have, and committing hundreds of training hours to a project that might end up saying “the additions didn’t help much” is a bad bet against the opportunity cost. There are other ways to investigate the same question that don’t require running what is essentially a small training lab for a month.

So it goes on the shelf. Maybe it comes back when there’s a cluster. Maybe the question gets asked differently. The ideas aren’t wrong — they’re just expensive to validate, and validation is the whole job. An untested architectural modification is just a hypothesis wearing implementation clothing.

The test harness lesson is the portable one. The E4B-X project is the expensive ticket that bought it. If I’d known in advance that the lesson was the output and the training run was the input, I’d have felt better about the trade at the time. In retrospect, I do.

Debug the harness first. Test the base model through your exact inference path. Only then should you start investigating what’s wrong with your modifications — because until you’ve verified the harness, you don’t know that anything is wrong at all.