Routing Absorption in Sparse Attention:
Why Random Gates Are Hard to Beat

Keston Aquino-Michaels
No Way Labs

February 2026

Abstract. Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model’s Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 ± 0.60 vs. 49.83 ± 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.

1 Introduction

Attention is concentrated. In a 31M-parameter transformer trained on WikiText-103, the top 64 out of 512 key positions per query capture 90.6% of the total attention mass (Section 4.1). In Qwen3-1.7B, the concentration is even sharper: oracle top-k masking at k=64 (87.5% sparsity) raises perplexity from 11.52 to only 11.57; the remaining 437 positions carry almost no signal. This suggests that a small learned gate should easily identify which entries to keep.

And indeed it can, but only after training is over. A lightweight bilinear gate (dgate=32, 1.3% of model parameters) trained post-hoc on a frozen dense checkpoint converges to near-oracle routing in 1,000 steps, closing >94% of the gap between random and oracle masks at all tested sparsity levels (Table 4). The same gate architecture, trained end-to-end alongside the model’s Q/K/V projections for 50,000 steps, learns almost nothing: its perplexity matches that of a frozen random gate to within 2.2%.

This paper asks: why does end-to-end sparse attention training fail when the structure clearly exists?

The answer is routing absorption. The model’s parameters (~31M), which collectively outnumber the gate’s 80× to 1, continuously adapt to compensate for whatever mask is imposed. After 50,000 steps of co-training, the gate’s mask has been “absorbed” into the Q/K/V representations: removing the gate changes almost nothing, replacing it with random noise changes almost nothing, and the gate’s predictions carry little more information about attention structure than chance. This is the attention analog of a well-documented phenomenon in Mixture-of-Experts (MoE), where experts co-adapt to any router until random routing matches learned routing [1, 2, 3], but with a crucial structural difference: in MoE, experts are self-contained modules, while in attention, shared Q/K/V projections enable cross-layer compensation that makes absorption more severe.

The contribution of this paper is not a method but an analysis. We present four controlled experiments on a 31M-parameter model that isolate different aspects of the absorption mechanism (Figures 2 and 3), with preliminary scale evidence from Qwen3-1.7B consistent with the phenomenon persisting at 55× larger scale. We connect the results to the MoE literature and discuss implications for recent sparse attention methods that rely on learned routing.

2 Background and Setup

2.1 The Sparse Attention Routing Problem

Given a pretrained transformer with attention scores A = QK/√dh and attention weights P = softmax(A), the sparse attention routing problem is: learn a gate function G(x) that predicts which entries of P to keep, such that masking the rest preserves model quality. The gate produces scores G(h) ∈ ℝn×n; at deployment, only the top-k entries per query are retained and the rest are masked to −∞ before softmax.

In the end-to-end setting, G is trained jointly with the model: the gate and the Q/K/V projections co-evolve. In the post-hoc setting [12], the model is frozen and only the gate is trained, typically by distillation against the model’s own attention distributions.

2.2 Experimental Setup

All experiments use a 6-layer, 256-dimensional, 4-head pre-norm transformer (~31M parameters) trained on WikiText-103 (512-token chunks, batch size 16, cosine LR schedule). The dense baseline achieves 37.32 perplexity.

The gate adds per-head projections Wgq(h), Wgk(h) ∈ ℝd×dgate producing gate scores G(h) = (xWgq(h))(xWgk(h))/√dgate. With dgate=32, this adds 393K parameters (1.3% of the model).

We use this small model deliberately: it is large enough to exhibit routing absorption but small enough to run end-to-end sparse training experiments (50,000 steps) and controlled ablations at modest cost. The full experimental suite (31M pretraining, all ablations, and Qwen3 fine-tuning) cost under $150 on rented GPUs; every result in this paper can be independently reproduced on a single consumer GPU in under a day. In Section 5 we present evidence that the phenomenon intensifies at scale.

2.3 Connection to MoE Routing Absorption

Routing absorption was first identified in Mixture-of-Experts models. Roller et al. [1] showed that hash-based (random) routing matches learned routing in MoE language models. Chen et al. [2] and Fan et al. [3] confirmed that experts co-adapt to any router, making the routing function nearly irrelevant. Clark et al. [4] provided scaling laws showing that routing quality has diminishing returns as expert capacity grows.

The mechanism is intuitive: when the routed compute (experts) has far more parameters than the router, the compute adapts to the router rather than vice versa. In attention, the full model (~31M parameters: Q/K/V projections, feedforward layers, and embeddings) plays the role of experts, and the gate (~393K parameters) plays the role of the router, an 80:1 parameter asymmetry that makes absorption nearly inevitable.

3 Evidence for Routing Absorption

We present four independent experiments, each isolating a different aspect of the absorption mechanism. All use k=64 (87.5% sparsity) unless otherwise noted.

3.1 Experiment 1: Learned Gates Match Random Gates

We train the 31M model end-to-end with differentiable soft gating for 50,000 steps. The mask is applied as element-wise multiplication with sigmoid gate scores (no hard top-k, so gradients flow through the gate).

Table 1. End-to-end soft gating at k=64. Mean ± std over 3 seeds (42, 137, 256).

ConditionValidation PPLSeedsGate trainable?
Learned soft gate48.73 ± 0.603Yes (50K steps)
Random soft gate49.83 ± 0.043No (frozen)
Dense baseline37.32N/A

The gap is 1.10 ppl, or 2.2% (Table 1). The learned gate, despite 50,000 steps of gradient updates on 393K parameters, converges to within 2.2% of a frozen random gate. Across 3 independent seeds, learned gates achieve 48.73 ± 0.60 ppl and random gates 49.83 ± 0.04. The learned gate does extract a small signal (the gap is statistically nonzero), but it captures only 9% of the possible improvement from random (49.83) to dense (37.32), and 91% of the routing benefit has been absorbed by Q/K/V co-adaptation. Both conditions are far from the dense baseline (37.32 ppl), confirming that the model’s representational capacity has been consumed by adapting to the presence of a mask, not to its content. This is not a failure of the gate architecture or the training dynamics: the same gate, trained post-hoc on a frozen checkpoint, converges to near-oracle in 1,000 steps (Section 4). The failure is specific to joint training, the regime where Q/K/V can co-adapt to absorb the gate’s signal.

Convergence speed rules out insufficient training.

A natural objection is that 50,000 steps may simply be insufficient for the gate to learn. But when Q/K/V are frozen, the same gate architecture converges from 46.86 to 37.33 ppl (over 99% of its final improvement) in just 500 steps (Section 4.3). The gate can learn routing in hundreds of steps when the target is stable. That it shows only marginal signal (9% of the possible improvement) in 50,000 end-to-end steps is not a matter of training budget; it is evidence that the optimization landscape is nearly flat with respect to the gate parameters because Q/K/V continuously absorb whatever the gate does.

3.2 Experiment 2: Hard Top-k Gets Zero Gradient

An even simpler explanation applies to hard top-k gating: the gate receives no task gradient at all. The mask M = 1[rank(G) ≤ k] is piecewise-constant with ∂M/∂G = 0. In PyTorch, the implementation chain topk() → scatter_() → masked_fill() produces a hard binary mask with no gradient path back to G.

We verify empirically: hard top-k gating gives 71.22 ppl with learned gates and 71.24 ppl with frozen random gates—identical, as expected when the gate sees zero gradient.

This is often dismissed as a mere implementation detail (“just make the mask differentiable”). But the juxtaposition with Experiment 1 is instructive: Experiment 2 shows the gate gets zero gradient through hard top-k; Experiment 1 shows that even full gradients through soft gating barely help (2.2% improvement over random). The gate captures almost nothing with no gradients and almost nothing with gradients. The bottleneck is not gradient flow; it is what happens when the gradients arrive: Q/K/V absorb the signal faster than the gate can impose it.

3.3 Experiment 3: The Distillation Contrast

The strongest evidence comes from distilling gates on two different checkpoints: one with mask-agnostic Q/K/V (dense-trained), one with co-adapted Q/K/V (soft-gated, from Experiment 1). In both cases, we freeze the model and train only the gate projections for 1,000 steps using BCE loss against the oracle top-k mask. Both gates converge to high F1 against their respective oracles. But deploying with hard top-k produces radically different results:

Table 2. The distillation contrast (k=64). Both gates predict oracle masks well (F1 > 0.8). Dense-trained Q/K/V tolerate hard masking; co-adapted Q/K/V do not. The 12× perplexity gap is a direct measurement of co-adaptation.

Source checkpointGate F1Deploy PPLInterpretation
Dense (mask-agnostic)0.84248.6Tolerates any mask
Soft-gated (co-adapted)0.804601.6Catastrophic

This experiment (Table 2) directly measures co-adaptation. The soft-gated model’s Q/K/V have learned representations that depend on the specific form of the sigmoid mask—not on which entries are masked, but on the continuous, differentiable nature of the masking function itself. Replacing sigmoid gating with binary top-k gating, even when the gate predicts the “right” entries, changes the mask’s functional form and breaks the co-adapted representations.

The dense checkpoint, by contrast, has never seen any mask. Its Q/K/V representations are mask-agnostic: they work with whatever sparsity pattern is applied post-hoc, because they never specialized to any particular one. This is the property that makes post-hoc distillation work.

3.4 Experiment 4: Stochastic Masking Doesn’t Help

A natural hypothesis is that co-adaptation can be prevented by randomizing the mask during training, analogous to dropout [5] preventing feature co-adaptation. We test this by training with a fresh random mask per forward pass (a Gumbel-softmax attention-pattern dropout).

Table 3. Stochastic mask training. Deploying the stochastically-trained model without any mask gives 78.19 ppl vs. 37.32 for the baseline.

Deployment conditionPPLNotes
Dense (no mask)78.19Should be ~37
Fixed random masks (5 seeds)104.43 ± 0.66
Dense baseline37.32

The result is unambiguous (Table 3). Even deploying the stochastically-trained model with no mask at all gives 78.19 ppl, more than double the baseline. The stochastic masking has not made the model robust to masks; it has damaged the Q/K/V representations by forcing them to work under adversarial conditions during every training step. The model learns to tolerate mask noise by flattening its attention distributions, sacrificing the sharp, concentrated patterns that carry most of the signal.

This rules out the naïve “attention dropout” hypothesis for this masking scheme: unlike weight dropout (which regularizes by preventing feature co-adaptation), per-step random mask replacement destroys the attention structure. Other stochastic strategies (e.g., annealing mask noise or mixing dense and sparse forward passes) remain untested.

4 Why Post-hoc Works: The Decoupling Argument

If routing absorption prevents gates from learning during end-to-end training, why does post-hoc distillation work so well?

4.1 The Structure Exists

The dense model’s attention distributions contain strong, learnable structure (Figure 1). Across all layers, the top 64 positions per query (out of 512) capture 90.6% of the total attention mass. Some layers are even more concentrated: layer 2 concentrates 100% of its attention mass in the top-64, and layer 3 captures 99.8%. The entropy ratio (actual entropy / maximum entropy) averages 0.497, well below the maximum-entropy value of 1.0 that would indicate uniform attention.

This rules out the hypothesis that attention is too diffuse for sparse routing to work. The structure is there; the question is whether the gate can learn it without the Q/K/V adapting away.

Attention concentration by layer
Figure 1. Attention concentration by layer in the 31M model. Each bar shows the fraction of total attention mass captured by the 64 highest-weight positions out of 512. Layers 2 and 3 are nearly one-hot, concentrating 100% of mass in the top 64; even the least concentrated layer captures 66%. The dashed line shows the average across all layers (90.6%).

4.2 Frozen Q/K/V Enable Fast Convergence

When the model is frozen, the gate faces a static optimization landscape: the attention distributions don’t change, so the gate simply needs to learn to predict them. This is a supervised learning problem with a fixed target, and the gate solves it efficiently:

Table 4. Post-hoc KL distillation on the frozen 31M model. 1,000 steps of gate training match or exceed 50,000 steps of end-to-end training (at k=64), using 50× fewer gate-training steps.

ConditionStepsGate F1PPLvs. Oracle
End-to-end learned gate50,000N/A48.73+6%
Post-hoc KL distillation1,0000.83348.83+6%
Post-hoc KL (k=128)1,0000.88840.24+2%
Post-hoc KL (k=256)1,0000.93137.57+0.2%
Oracle top-k (k=64)1.00046.00
Dense baseline37.32

At k=64, post-hoc distillation in 1,000 steps (Table 4) matches 50,000 steps of end-to-end training, a 50× reduction in gate-training steps. At higher k, post-hoc distillation far exceeds what end-to-end training achieves at any budget.

The key insight is not that post-hoc distillation is a clever method; it is that decoupling is the key ingredient. Any approach that prevents Q/K/V from co-adapting to the gate (by freezing, by using a separately trained checkpoint, or by any other means) should work. Post-hoc distillation is simply the most practical way to achieve decoupling.

4.3 Gate-Only Training: Isolating the Gate

To confirm that the gate architecture itself is capable, we run a gate-only training experiment: freeze the dense checkpoint, train only the gate projections, and measure whether learned gates improve over random.

Table 5. Gate-only training with frozen Q/K/V (3 seeds). When co-adaptation is prevented by freezing, the gate drives perplexity from 46.8 to 37.29 (near-dense), a 20% improvement. Convergence is fast: over 99% of the improvement occurs in 500 steps (37.33 ppl).

ConditionFinal PPLΔ PPL
Learned gates (trainable)37.29 ± 0.00−9.57
Random gates (frozen)46.86 ± 0.100.00

The result is stark (Table 5 and Figure 2a): across 3 seeds, learned gates converge to 37.29 ± 0.00 ppl, a 20% improvement, while random gates remain at 46.86 ± 0.10 with zero improvement. The contrast with Experiment 1 is sharp: 500 steps to converge when Q/K/V are frozen vs. negligible signal in 50,000 steps when Q/K/V co-adapt.

Convergence dynamics
Figure 2. Convergence dynamics under decoupled vs. co-adapted training. (a) Post-hoc gate-only training on the frozen 31M model: the learned gate converges from 46.8 to 37.3 ppl in 500 steps (>99% of total improvement), while the frozen random gate stays flat. (b) Single-layer co-adaptation at Qwen3 scale: learned and random gates converge to identical perplexity (8.80), with the no-gate control reaching 8.70; the gate adds overhead rather than signal.

5 The Mechanism: Parameter Asymmetry

Why does co-adaptation overwhelm the gate but not vice versa? The answer is parameter asymmetry. The model contains ~31M parameters that can freely adjust to compensate for any 393K-parameter gate. In the gradient landscape, the model has 80× more degrees of freedom to absorb the gate’s signal than the gate has to impose it.

This asymmetry mirrors MoE exactly. In a typical MoE layer, the router has d × E parameters (where E is the number of experts, typically 8–64), while the experts collectively have E × dff × d parameters. The ratio is dff:1, typically 4:1 to 16:1. Clark et al. [4] showed that the benefit of routing quality diminishes as expert capacity grows, consistent with absorption increasing with the parameter asymmetry.

Gate capacity ablation.

The parameter asymmetry argument predicts that increasing gate capacity should delay but not prevent absorption. We test this on Qwen3-1.7B in the post-hoc setting by sweeping dgate ∈ {32, 64, 128} with BCE distillation (Table 6). With KL distillation, dgate=32 already achieves 99.9% efficiency at k=64, leaving no room for increased capacity to help.

Table 6. Gate capacity ablation on Qwen3-1.7B (post-hoc distillation). KL distillation at dgate=32 outperforms all BCE configurations. Dense: 11.52 ppl; oracle (k=64): 11.57 ppl.

dgatek=64 PPLk=128 PPLk=256 PPL
32 (KL)12.2411.7211.56
32 (BCE)102.8225.7620.78
64 (BCE)84.8327.4218.65
128 (BCE)88.5730.5019.18

5.1 Direct Test: Single-Layer Absorption at Qwen3 Scale

To test routing absorption at scale without full pretraining, we run a controlled experiment on Qwen3-1.7B: freeze the entire model except one layer’s attention projections (Q/K/V/O), add soft bilinear gates to that layer, and fine-tune for 5,000 steps.

Table 7. Single-layer absorption at Qwen3 scale (layer 14).

ConditionPPL
Dense baseline (no fine-tune)10.85
No gate (Q/K/V/O only)8.70
Learned gate + Q/K/V/O8.80
Random gate + Q/K/V/O8.80
Gate gap (learned − random)0.00
Gate overhead (gated − no gate)+0.10

The learned and random gates converge to identical perplexity (8.80 vs. 8.80), consistent with routing absorption at 1.7B scale. The no-gate control reaches 8.70, better than either gated condition. The gate is not merely useless; it is actively harmful, adding 0.10 ppl of overhead.

Absorption gradient: varying co-adaptation capacity.

We sweep the number of unfrozen layers n ∈ {0, 2, 4, 8} (out of 28), with soft bilinear gates on all layers, training for 5,000 steps at k=64.

Table 8. Absorption gradient at Qwen3 scale. As more layers unfreeze, the random gate improves (42→17 ppl) because Q/K/V co-adapt to compensate, while the learned gate remains flat (~10 ppl).

nunfrozenLearned PPLRandom PPLGapCo-adapt. capacity
010.7142.26−31.55None (post-hoc)
210.1323.07−12.947% of layers
410.0321.04−11.0114% of layers
810.4817.40−6.9229% of layers
Absorption gradient
Figure 3. The absorption gradient at Qwen3-1.7B scale. As more layers unfreeze (increasing co-adaptation capacity), the random gate’s perplexity drops toward the learned gate’s level. The shaded area shows the gap shrinking from 31.5 (post-hoc, no co-adaptation) to 6.9 (29% of layers unfrozen).

At n=8 (29% of layers), the gap has shrunk to 6.9 ppl, closing 78% of the original 31.6 ppl gap. The closure rate is decelerating: the first 2 unfrozen layers (7% of the model) close 59% of the gap, while the next 6 layers close only 19% more. This is consistent with Experiment 1’s endpoint at 31M scale, where full end-to-end training drives the gap to 2.2%.

5.2 Why Scale Makes Absorption Worse

The absorption gradient explains why scale intensifies absorption: larger models have more parameters that can compensate.

Table 9. Post-hoc distillation efficiency across scales.

31M modelQwen3-1.7B
kOracle/DenseEfficiencyOracle/DenseEfficiency
6446.0/37.3 = 1.2394%11.57/11.52 = 1.00499.9%
12839.5/37.3 = 1.0696%11.53/11.52 = 1.00199.9%
25637.5/37.3 = 1.0199%11.52/11.52 = 1.00099.8%
Perplexity vs sparsity
Figure 4. Perplexity vs. sparsity for oracle, KL-distilled gate, BCE-distilled gate, and random masking at two scales. (a) On the 31M model, KL and BCE gates perform similarly, both tracking oracle closely. (b) On Qwen3-1.7B (log scale), KL distillation tracks oracle at all sparsity levels while BCE diverges catastrophically at high sparsity (102.8 vs. 12.2 at 87.5% sparsity).

6 Implications

6.1 For Sparse Attention Methods

Several recent methods learn sparse attention patterns end-to-end [9, 10, 11]. Our results suggest caution for methods that employ per-query token-level masking: routing absorption may allow these methods to converge to solutions where the learned routing contributes less than it appears, because Q/K/V co-adapt to make any reasonable routing pattern work.

We note that routing absorption as characterized here is specific to token-level gated masking. Other sparse attention formulations may sidestep absorption through different structural choices. MoSA [11] uses expert-choice routing over tokens, which is architecturally distinct from per-query masking. NSA [10] operates at block granularity with a compression branch.

That said, ablations replacing learned routing with fixed or random routing, in the style of Roller et al. [1], would be informative for any method claiming learned routing provides a benefit. NSA is particularly interesting: our analysis predicts that the sliding window and compression components, which impose fixed structure, contribute more than the learned selection component.

6.2 For the MoE Literature

Routing absorption in attention provides a cleaner experimental setting than MoE for studying co-adaptation. Our distillation contrast experiment (Section 3.3) provides a particularly direct measurement of co-adaptation: the 12× perplexity gap between deploying the same gate on mask-agnostic vs. co-adapted Q/K/V quantifies the extent to which co-adaptation has made the representations mask-dependent.

Where the analogy breaks down.

In MoE, experts are discrete modules with independent parameters; compensation for poor routing is confined to within-expert adaptation. In attention, Q/K/V projections are shared across all positions, enabling a richer compensation pathway. Our absorption gradient experiment (Table 8) provides direct evidence: unfreezing 8 of 28 layers reduces the learned-vs-random gap from 31.6 to 6.9 ppl—the model compensates for random gating by adjusting Q/K/V in other layers, a cross-layer compensation pathway that has no analog in MoE.

6.3 The Decoupling Principle

The broadest implication is a general principle: when a small auxiliary network must learn a routing decision over a much larger compute substrate, decouple the routing from the substrate’s training. Post-hoc distillation is one instance; others include:

7 Limitations

Our controlled experiments use a 31M-parameter model for the full end-to-end training ablations. The single-layer Qwen3 experiment (Section 5.1) is consistent with absorption at 1.7B scale but in a constrained setting (one unfrozen layer); full end-to-end sparse pretraining at 1.7B scale remains out of scope. The absorption gradient experiment (Table 8) sweeps up to n=8 unfrozen layers (29% of 28). At this point, 78% of the gap between post-hoc and co-adapted settings has already closed, with diminishing marginal returns (the first 2 layers close 59%; the next 6 close only 19% more). Extending to n=28 would not resolve the remaining ambiguity: even with all layers unfrozen, the experiment is a 5,000-step fine-tuning run, not full pretraining, so the gap cannot be expected to close completely. The definitive full-convergence result is provided by the 31M model (Section 3.1), where all parameters co-adapt for the entire 50,000-step pretraining run and the learned-vs-random gap falls to 2.2%. The Qwen3 gradient experiment serves a different role: demonstrating that the mechanism (Q/K/V absorbing the routing signal) is present at 55× larger scale, not replicating the endpoint.

The 50,000-step soft gating experiment (Section 3.1) tests one training duration. It is possible that with substantially longer training, the gate could eventually overcome absorption. However, the convergence speed contrast argues against this: the same gate converges in 500 steps when Q/K/V are frozen (Section 4.3) but shows only marginal improvement in 50,000 end-to-end steps, a 100× training budget that captures only 9% of the possible routing benefit.

We study only the bilinear gate form gqgk. More complex routing mechanisms might resist absorption better, though the MoE literature suggests that router architecture matters less than the fundamental parameter asymmetry [4].

8 Conclusion

Learned sparse attention gates largely fail end-to-end because of routing absorption: Q/K/V projections co-adapt to absorb the gate’s signal, leaving learned gates barely better than random. Four controlled experiments isolate different aspects of this mechanism: learned-vs-random near-equivalence under soft gating, zero gradient under hard gating, catastrophic deployment on co-adapted Q/K/V, and failure of stochastic mask regularization.

The phenomenon parallels routing absorption in MoE and follows from the same cause: parameter asymmetry between the router and the routed computation. Post-hoc approaches work precisely because they decouple the gate from Q/K/V training, preventing co-adaptation entirely.

The practical takeaway is that sparse attention routing should be treated as a post-training compression step, not an end-to-end training objective, and that any method claiming to learn routing end-to-end should be ablated against random routing to check for absorption.

References

[1] S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston. Hash layers for large sparse models. In NeurIPS, 2021.

[2] T. Chen, Z. Zhang, A. Jaiswal, S. Liu, and Z. Wang. Sparse MoE as the new dropout: Scaling dense and self-slimmable transformers. In ICLR, 2023.

[3] D. Fan, B. Messmer, and M. Jaggi. Towards an empirical understanding of MoE design choices. arXiv:2402.13089, 2024.

[4] A. Clark et al. Unified scaling laws for routed language models. In ICML, 2022.

[5] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.

[6] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.

[7] S. Liu et al. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. In ICLR, 2022.

[8] A. Gadhikar, S. Mukherjee, and R. Burkholz. Why random pruning is all we need to start sparse. In ICML, 2023.

[9] C. Lou, Z. Jia, Z. Zheng, and K. Tu. Sparser is faster and less is more: Efficient sparse attention for long-range transformers. arXiv:2406.16747, 2024.

[10] J. Yuan et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In ACL, 2025.

[11] P. Piękos, R. Csordás, and J. Schmidhuber. Mixture of sparse attention: Content-based learnable sparse attention via expert-choice routing. arXiv:2505.00315, 2025.

[12] Y. Gao et al. SeerAttention: Learning intrinsic sparse attention in your LLMs. arXiv:2410.13276, 2024.

[13] Z. Wang, J. Zhu, and J. Chen. ReMoE: Fully differentiable mixture-of-experts with ReLU routing. In ICLR, 2025.

Code and data: github.com/no-way-labs/routing-absorption