TL;DR
DeepSeek’s mHC paper fixes a nasty instability problem in hyper-connections, an upgraded version of residual connections for transformers. The original hyper-connections (from ByteDance) let models route information through multiple parallel streams, but the learned mixing weights could amplify signals by 3,000x during training, making large models blow up. DeepSeek’s fix: constrain those weight matrices using the Sinkhorn-Knopp algorithm from 1967, forcing them to be doubly stochastic. The result is stable training from 3B to 27B parameters, with better benchmark scores and only 6.7% overhead.
The Residual Connection Problem Nobody Talks About
Every modern transformer uses residual connections. You know the pattern: take the input to a layer, run it through attention or a feed-forward network, then add the original input back. Skip connections. They’ve been the standard since ResNet in 2015, and they work because they prevent gradients from vanishing during backpropagation so deep networks can actually train.
But residual connections have a ceiling. Each layer gets exactly one path to pass information forward. That’s fine for smaller models. At the scale we’re building now, hundreds of billions of parameters and thousands of layers, that single stream becomes a bottleneck. Information from early layers has to squeeze through one channel to reach later layers, and the gradient signal flowing backward faces the same constraint.
ByteDance’s Hyper-Connections paper (ICLR 2025) proposed a fix: widen the residual stream. Instead of one skip connection per layer, you get multiple parallel streams. Each layer’s input and output are mixed together using learned weight matrices, creating richer information flow between layers. Think of it as replacing a single-lane highway with a multi-lane one.
It worked. Hyper-connections showed clear performance gains over standard residuals on language modeling benchmarks. The concept was sound.
Then people tried to scale it up, and everything fell apart.
What Goes Wrong at Scale
The mixing matrices in hyper-connections are fully learned. The model can set them to whatever values minimize loss. In theory, this flexibility is great. In practice, it’s a disaster.
Here’s why. Each layer multiplies its input by a mixing matrix to combine the multiple residual streams. Stack 100 layers and you’re multiplying 100 matrices together. If those matrices have eigenvalues even slightly above 1.0, the product grows exponentially. DeepSeek’s researchers measured this directly: in a 27B parameter MoE model using unconstrained hyper-connections, the composite gain (how much the residual signal gets amplified across all layers) peaked around 3,000x.
Your residual stream is supposed to carry a stable signal. A 3,000x amplification means the activations explode. Gradients become unstable. Training diverges or produces garbage. The bigger the model, the more layers you stack, the worse it gets.
This is the core tension. Hyper-connections give you better information flow, but the freedom that makes them work also makes them unstable. And the instability gets worse precisely when you need them most: at scale.
DeepSeek’s Fix: The Birkhoff Polytope
DeepSeek’s insight was geometric. If unconstrained matrices cause amplification, constrain them to a space where amplification is structurally impossible.
That space is the Birkhoff polytope — the set of all doubly stochastic matrices. A matrix is doubly stochastic when every row sums to 1 and every column sums to 1. If you multiply a vector by a doubly stochastic matrix, the output’s total magnitude stays the same as the input’s. No amplification. No decay. Stable by construction.
Why does this property matter for residual streams? Consider what happens at each layer. The mixing matrix takes the multiple residual streams and recombines them. If that matrix is doubly stochastic, every output stream receives the same total amount of input signal, and every input stream contributes the same total amount to the outputs. The signal is redistributed but never inflated.
Stack as many layers as you want. Each multiplication preserves magnitude. The composite gain stays bounded near 1.0 instead of rocketing to 3,000. Training stays stable whether you’re at 3B or 27B parameters.
The model can still learn how to route information between streams. The specific values within the doubly stochastic matrix are still learned. It just can’t learn to amplify.
Enter the Sinkhorn-Knopp Algorithm
You need the mixing matrices to be doubly stochastic at every training step. Projecting onto the Birkhoff polytope directly would be expensive. DeepSeek reached for a solution from 1967 instead.
Richard Sinkhorn and Paul Knopp proved that any matrix with positive entries can be converted to a doubly stochastic matrix through a dead-simple iterative process:
- Divide each row by its row sum (now all rows sum to 1)
- Divide each column by its column sum (now all columns sum to 1)
- Step 1 broke the row sums. Repeat from step 1
- Keep going until convergence
That’s it. Alternate row normalization and column normalization. Each iteration gets closer to doubly stochastic. In practice, 20 iterations is more than enough for the precision needed during neural network training.
def sinkhorn_knopp(matrix, iterations=20):
"""Convert a positive matrix to doubly stochastic."""
m = matrix.abs() # ensure positive entries
for _ in range(iterations):
m = m / m.sum(dim=-1, keepdim=True) # normalize rows
m = m / m.sum(dim=-2, keepdim=True) # normalize columns
return m
The algorithm is differentiable, so gradients flow through it during backpropagation. The model learns the unconstrained matrix, and the Sinkhorn-Knopp step projects it onto the Birkhoff polytope before it’s used in the forward pass. Training adjusts the underlying weights; the constraint ensures the actual mixing is always stable.
I like this part of the paper. A 60-year-old algorithm from pure linear algebra, picked off the shelf, dropped into a modern training pipeline, and it just works.
The Engineering Work Behind the Paper
Constraining matrices to the Birkhoff polytope is the headline idea, but making it fast enough for production training required serious engineering. DeepSeek didn’t just slap Sinkhorn-Knopp onto hyper-connections and call it a day.
The Sinkhorn iterations add compute at every layer, for every forward pass. The team developed fused CUDA kernels that batch the normalization with the matrix multiplication, avoiding extra memory round-trips. They used selective recomputation during backpropagation. Instead of storing all intermediate Sinkhorn states for the backward pass, they recompute them on the fly, trading a small amount of extra compute for significant memory savings.
They also redesigned the pipeline scheduling for their MoE (Mixture of Experts) archtecture. In MoE models, different tokens get routed to different expert sub-networks, and the communication patterns between GPUs are already complex. Adding hyper-connection mixing on top of expert routing required careful attention to how data moves across the pipeline.
All of this brought the total training overhead to 6.7%. For context, that means a training run that would take 100 GPU-hours with standard residual connections takes about 107 GPU-hours with mHC. Given the performance gains, that’s a trade most teams would take.
Benchmark Results
DeepSeek tested mHC at three scales: 3B, 9B, and 27B parameter MoE models, all based on their DeepSeek-V3 architecture.
The 27B model results tell the story:
| Benchmark | Baseline (residual) | HC (unconstrained) | mHC |
|---|---|---|---|
| BBH | 43.8% | 48.9% | 51.0% |
| DROP | 47.0% | 51.6% | 53.9% |
| GSM8K | 46.7% | 51.2% | 53.8% |
| MMLU | 59.0% | 61.8% | 63.4% |
| TriviaQA | 54.1% | 56.3% | 57.6% |
mHC beats both the baseline and unconstrained hyper-connections across all eight benchmarks they tested. BBH jumped 7.2 points, DROP gained 6.9, GSM8K went up 7.1. And mHC consistently edges out unconstrained HC by another 1-2 points. The stability constraint isn’t just preventing crashes. It’s helping the model learn better representations too.
The stability numbers are wild. The Amax gain magnitude (how much the composite residual mapping amplifies signals) for unconstrained HC hits 3,000 at the 27B scale. For mHC, it stays at 1.6. Not a percentage difference. A difference between “training works” and “training might explode.”
The performance advantage also held steady or increased across all three model sizes. The fix doesn’t break down at large scale. It gets relatively better as you scale up.
Why This Matters Beyond DeepSeek
Residual connections are everywhere. Every GPT, every Llama, every Gemini model uses them. They’re one of the few architectural choices that hasn’t changed much since the original transformer paper in 2017. If hyper-connections (with mHC’s stability fix) prove to be a strict upgrade, other labs will follow.
A few things to watch.
If mHC lets you extract more performance per parameter, that’s equivalent to training a smaller model for the same quality. At current GPU prices, even a 5% reduction in required model size translates to millions of dollars in saved compute.
There’s also the depth question. Standard residual connections put a soft limit on how deep you can make a transformer before training becomes unstable. mHC could push that ceiling higher, enabling architectures with more layers and fewer parameters per layer. That’s a different tradeoff than the “make it wider” approach that’s dominated scaling so far.
On the adoption side, there’s already a Python package on PyPI for hyper-connections, and lucidrains has an implementation on GitHub. mHC adds a constraint on top of existing hyper-connections, so the migration path for researchers is straightforward. Combined with inference-side improvements like TurboQuant’s 6x KV cache compression and diffusion-based decoding hitting 1,000 tokens per second, the LLM architecture stack is getting reworked at every level.
The fact that DeepSeek is publishing this openly matters too. They could have kept mHC as an internal advantage for their next model release. Instead, they published the paper with enough detail that anyone can implement it. That’s consistent with their pattern. DeepSeek-V2 and V3 were both published with full architectural details, and the openness pushes the entire field forward.
FAQ
What’s the difference between residual connections and hyper-connections?
Standard residual connections add a layer’s input directly to its output: one stream, one path. Hyper-connections widen this to multiple parallel streams with learned mixing weights, allowing richer information flow between layers. mHC adds a stability constraint to hyper-connections using doubly stochastic matrices.
Do I need to understand the Sinkhorn-Knopp algorithm to use mHC?
Not really. If you’re implementing it, the algorithm is about 5 lines of code. Alternate row and column normalization until convergence. If you’re using an existing implementation, it’s just a drop-in replacement for standard residual connections. The math behind why it works involves the Birkhoff polytope and doubly stochastic matrices, but you don’t need that to use it.
How does mHC compare to other training stabilization techniques like gradient clipping?
Gradient clipping is reactive: it caps gradients after they’ve already exploded. mHC is structural. It prevents the amplification from happening in the first place by constraining the mixing matrices. They’re complementary; you’d still use gradient clipping as a safety net, but mHC removes the primary source of instability in hyper-connections.
Will this show up in future DeepSeek models?
Almost certainly. They tested on their V3-style architecture, and DeepSeek has a pattern of publishing improvements right around the time they ship them.
Is the 6.7% overhead worth it?
Yes. 2-7 point gains across multiple benchmarks for 6.7% more training time is a bargain, and that’s before you factor in the stability benefit. Not having your training run diverge at 27B parameters is worth a lot more than 7% extra compute.
Bottom Line
The core idea here is almost disappointingly simple. Hyper-connections blow up at scale because their mixing matrices amplify signals? Constrain them to doubly stochastic matrices so amplification is mathematically impossible. Enforce the constraint using an algorithm from 1967 that just alternates row and column normalization.
It works. It beats unconstrained hyper-connections on benchmarks while keeping the composite gain at 1.6 instead of 3,000. Sometimes the real bottleneck in LLM architecture isn’t the attention mechanism or the training schedule. It’s the plumbing. Residual connections have been good enough for a decade, but “good enough” leaves performance on the table once you know how to do better.
I expect mHC or something very close to show up in DeepSeek’s next production model. And once it does, other labs will start experimenting with constrained hyper-connections. The math is simple, the overhead is small, and the payoff compounds with scale.
