MASS: MoErging through Adaptive Subspace Selection

MASS is a novel model Mixture of Experts merging technique that adaptively selects subspaces for combining models, improving performance across diverse tasks.

Open Table of Contents

Abstract
Highlights
Method Overview
Experimental Insights
Thoughts

Abstract

Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input’s intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2x storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

Highlights

MASS adds a lightweight router in activation space that selects task subspaces per input by projecting intermediate features onto TSV spans, no task data or training needed, optimal for models from public repositories.
Extends routing to head selection when the task is unknown at inference, removing the need for an oracle head, challenging an unrealistic assumption common in existing model merging literature: that task identity is known at inference time.
Recovers ~98% of the fine-tuned endpoints’ average accuracy; consistently outperforms existing methods across all benchmarks.
Stores only top-k singular components per task and orthogonalizes across tasks (as in TSV-M) to mitigate interference during adaptive merges.
Training-free, introduces only a two-pass inference and roughly ~2x parameter footprint vs. one pretrained model, irrespective of task count.
Tested on CLIP ViT on various task benchmarks and Flan-T5; batched routing closes the gap to <1% in most settings.

Method Overview

Start from layer-wise task deltas and SVD. For each task $i$ and layer $\ell$ , compute the update $\Delta_{i}^{(\ell)} = \theta_{ft,i}^{(\ell)} - \theta_{pre}^{(\ell)}$ and perform SVD $\Delta_i = U_i \Sigma_i V_i^\top$ . Keep top-k singular components per task (compression).
Fixed merge via TSV-M (pre-processing) or TA. Concatenate truncated $U_i, V_i$ across tasks, orthogonalize them to reduce interference (as in TSV-M), and build a single “fixed” merged model $\hat\Delta = U_\perp \Sigma V_\perp^\top, \quad \theta_{MT} = \theta_{pre} + \alpha \hat\Delta.$ Before routing, filter redundant directions: flatten each $\Delta_i$ and drop task updates with cosine similarity above a threshold $\epsilon$ to any accepted update so no family of tasks dominates the router later.
Data-free, projection-based router. On the first pass, run $x$ through $\theta_{MT}$ to get an intermediate activation $z^\ell$ . For each task subspace $V_i$ , compute the residual $r_i = \lVert z^\ell - V_i V_i^\top z^\ell \rVert_2.$ Convert $-r_i$ to weights with a softmax, threshold by $\eta$ , and keep Top-k selected tasks.
Adaptive low-rank merge on the fly. Combine only the selected subspaces: $\Delta*{ada} = \sum*{i \in \Omega} U*i \Sigma_i V_i^\top, \quad \theta*{MASS} = \theta*{pre} + \alpha \Delta*{ada}$ Make a second pass with $\theta_{MASS}$ ; evaluate the corresponding selected heads and pick the max-logit head/class. (MASS does not assume an oracle head, in contrast to existing model merging literature).
Routing layer choice. Routing at mid/late layers works best on average (e.g., around layer 9 in ViT-B models; MLP blocks slightly favored), though best layer is task-dependent, with per-task variance changing across layers, an avenue for future adaptive layer selection.

Experimental Insights

SOTA across scales. MASS improves accuracy and normalized average accuracy on all 8/14/20-task suites in vision and perform better in 6/8 NLP benchmarks.
High per-task retention. Retains ≥96% of per-task accuracy (8-task), ≥93% (14-task), and ≥88% for most tasks in the 20-task setting, mitigating worst-case collapses common in fixed merges.
Router comparisons. A learnable MLP router can eke out the best numbers but requires labeled data. MASS’s projection router beats nearest-neighbor routing while staying training-free and data-free, crucial for real-world model hubs with no datasets attached.
Batch wins. If inputs in a batch share a task, choosing one adaptive merge per batch pushes normalized accuracy to ≥97% in 8/9 scenarios, essentially closing the gap to multi-task-learning.
Overhead & storage. MASS needs only two forward passes and about ~2× parameters vs. a single pretrained model, independent of the task count, a practical sweet spot vs. storing all endpoints or ensembling them.

Thoughts

MASS reframes model merging as “route-then-compose” in a shared low-rank, orthogonalized basis given by TSV-M. The key design choices, redundant-direction filtering, mid-layer residual routing, and on-the-fly TSV recomposition, tackle the classic failure mode of one-size-fits-all fixed merges: task interference. Two limitations are natural next steps: (i) routing layer adaptivity (since the best layer varies by task and architecture), and (ii) finer-grained subspace selection for OOD inputs or unseen blends of skills. Still, the central lesson is compelling: with structured low-rank updates in hand, simple geometry (projections and orthogonalization) is enough to recover near-oracle accuracy, without touching data or training a router.