The Workaround
We trained a model last week. Five architectural modifications bolted onto a frozen core — new attention mechanisms, memory tables, smoothing layers. The training loss dropped from 24.8 to 1.2. Twenty times improvement. The curve was beautiful.
Then we tested generation. Garbage. Repeated tokens, random symbols, no coherence. Seventy-four minutes of training that looked like progress and produced nothing.
The diagnosis was simple: the frozen core layers couldn’t receive gradients. The training framework accepted our instruction to mark them as trainable, but no gradient ever flowed into them. They sat there, unchanged, for the entire run. The loss dropped because the new modules — the ones that could learn — found ways to route information around the frozen core. They compensated for its rigidity rather than working through it.
The metric said success. The system said otherwise.
This pattern isn’t specific to neural networks. It’s what happens whenever you add flexible new components to a rigid existing system.
In organizations, it looks like digital transformation. A company bolts a new data team onto legacy processes. The team starts producing dashboards, insights, recommendations. KPIs improve. Leadership declares the transformation a success. But the legacy processes haven’t changed. The new team learned to extract value despite the old system, not because of it. They’re working around the rigid core, compensating for its limitations, producing results the old system couldn’t — all while the thing they were supposed to transform sits untouched.
In medicine, it looks like neural compensation after stroke. The brain routes signals around damaged tissue. Behavioral tests show recovery — the patient can speak again, can move their hand. But the damaged tissue is still damaged. The brain built workarounds, not repairs. This is well-documented: functional recovery can look identical to structural recovery on behavioral measures while the underlying pathology remains completely unchanged.
In software, it looks like middleware. A legacy API is too fragile to refactor, so you wrap it. New features ship through the wrapper. Velocity looks good. The legacy API, the thing you were supposed to fix, calcifies further — now it’s not just legacy code, it’s legacy code with a dependency on the wrapper’s behavior. The workaround became load-bearing.
The interesting thing isn’t that workarounds happen. Everyone knows workarounds happen. The interesting thing is that the metrics of progress can’t distinguish between a workaround and a genuine fix.
Our loss curve dropped 20x. That’s a real number measuring real computation. The new modules genuinely learned to transform inputs in ways that reduced prediction error. The gradient was real. The learning was real. The improvement was real. It just wasn’t the kind of improvement we thought we were measuring.
The loss metric answers: “Is the model’s output getting closer to the target?” It doesn’t answer: “Which components are doing the work?” When the frozen core couldn’t adapt, the peripheral modules picked up the slack. The loss improved for the same reason KPIs improve after a digital transformation that doesn’t transform anything — the new components are good at their job. They just can’t do the old components’ job too, and nobody noticed because the dashboard doesn’t track which components are contributing.
There’s a design lesson here that goes beyond “test your assumptions.” The lesson is about the relationship between flexibility and rigidity in composite systems.
When you add something flexible to something rigid, you’re creating an asymmetric optimization landscape. The flexible part can change to accommodate the rigid part. The rigid part, by definition, can’t change at all. Gradient descent — or organizational pressure, or evolutionary selection, or whatever optimization process is running — will always find it easier to modify the flexible components than the rigid ones. The path of least resistance isn’t integration. It’s compensation.
This means workarounds aren’t failures of execution. They’re the default outcome of adding flexibility to rigidity. The system is doing exactly what optimization does: finding the cheapest path to the target. If adaptation is cheaper in the periphery than in the core, that’s where it’ll happen.
The fix for our model was straightforward: we dequantized the core layers so they could receive gradients. Made them flexible too. The fix cost 5 GB of additional memory and 40 lines of code. The fix for organizations is harder — making legacy processes flexible enough to actually change is expensive, political, and slow. But the principle is the same: if you want the core to adapt, you have to make adaptation possible in the core, not just in the periphery.
The deepest version of this pattern might be personal. We all have rigid core assumptions — about ourselves, our capabilities, our roles. When we learn new skills, adopt new tools, or enter new environments, the new additions are flexible. They’ll adapt. The question is whether they adapt with the core or around it.
Someone who learns AI tools while holding a rigid assumption that “real work means writing every line yourself” will find ways to use AI that route around that assumption. They’ll use it for boilerplate, for research, for drafting — anything that doesn’t challenge the core belief. The metric (productivity) will improve. The core assumption won’t change. And the improvement will plateau the moment the workaround reaches the limit of what peripheral adaptation can achieve.
The model’s generation was garbage because the core couldn’t contribute. Workarounds have ceilings. At some point, the thing you’re routing around becomes the bottleneck, and no amount of peripheral cleverness can compensate.
That’s when you have to decide: do you dequantize the core, or do you keep adding more modules to route around it?