Table of Contents
Open Table of Contents
Abstract
Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input’s intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2x storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.
Highlights
- MASS adds a lightweight router in activation space that selects task subspaces per input by projecting intermediate features onto TSV spans, no task data or training needed, optimal for models from public repositories.
- Extends routing to head selection when the task is unknown at inference, removing the need for an oracle head, challenging an unrealistic assumption common in existing model merging literature: that task identity is known at inference time.
- Recovers ~98% of the fine-tuned endpoints’ average accuracy; consistently outperforms existing methods across all benchmarks.
- Stores only top-k singular components per task and orthogonalizes across tasks (as in TSV-M) to mitigate interference during adaptive merges.
- Training-free, introduces only a two-pass inference and roughly ~2x parameter footprint vs. one pretrained model, irrespective of task count.
- Tested on CLIP ViT on various task benchmarks and Flan-T5; batched routing closes the gap to <1% in most settings.
Method Overview
- Start from layer-wise task deltas and SVD. For each task and layer , compute the update and perform SVD . Keep top-k singular components per task (compression).
- Fixed merge via TSV-M (pre-processing) or TA. Concatenate truncated across tasks, orthogonalize them to reduce interference (as in TSV-M), and build a single “fixed” merged model Before routing, filter redundant directions: flatten each and drop task updates with cosine similarity above a threshold to any accepted update so no family of tasks dominates the router later.
- Data-free, projection-based router. On the first pass, run through to get an intermediate activation . For each task subspace , compute the residual Convert to weights with a softmax, threshold by , and keep Top-k selected tasks.
- Adaptive low-rank merge on the fly. Combine only the selected subspaces: Make a second pass with ; evaluate the corresponding selected heads and pick the max-logit head/class. (MASS does not assume an oracle head, in contrast to existing model merging literature).
- Routing layer choice. Routing at mid/late layers works best on average (e.g., around layer 9 in ViT-B models; MLP blocks slightly favored), though best layer is task-dependent, with per-task variance changing across layers, an avenue for future adaptive layer selection.
Experimental Insights
- SOTA across scales. MASS improves accuracy and normalized average accuracy on all 8/14/20-task suites in vision and perform better in 6/8 NLP benchmarks.
- High per-task retention. Retains ≥96% of per-task accuracy (8-task), ≥93% (14-task), and ≥88% for most tasks in the 20-task setting, mitigating worst-case collapses common in fixed merges.
- Router comparisons. A learnable MLP router can eke out the best numbers but requires labeled data. MASS’s projection router beats nearest-neighbor routing while staying training-free and data-free, crucial for real-world model hubs with no datasets attached.
- Batch wins. If inputs in a batch share a task, choosing one adaptive merge per batch pushes normalized accuracy to ≥97% in 8/9 scenarios, essentially closing the gap to multi-task-learning.
- Overhead & storage. MASS needs only two forward passes and about ~2× parameters vs. a single pretrained model, independent of the task count, a practical sweet spot vs. storing all endpoints or ensembling them.
Thoughts
MASS reframes model merging as “route-then-compose” in a shared low-rank, orthogonalized basis given by TSV-M. The key design choices, redundant-direction filtering, mid-layer residual routing, and on-the-fly TSV recomposition, tackle the classic failure mode of one-size-fits-all fixed merges: task interference. Two limitations are natural next steps: (i) routing layer adaptivity (since the best layer varies by task and architecture), and (ii) finer-grained subspace selection for OOD inputs or unseen blends of skills. Still, the central lesson is compelling: with structured low-rank updates in hand, simple geometry (projections and orthogonalization) is enough to recover near-oracle accuracy, without touching data or training a router.