Skip to content
Go back

MASS: MoErging through Adaptive Subspace Selection

Edit page
Sapienza University of Rome , January 2025 - April 2025

MASS is a novel model Mixture of Experts merging technique that adaptively selects subspaces for combining models, improving performance across diverse tasks.

Paper Code

Table of Contents

Open Table of Contents

Abstract

Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input’s intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2x storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

Highlights

Method Overview

  1. Start from layer-wise task deltas and SVD. For each task ii and layer \ell, compute the update Δi()=θft,i()θpre()\Delta_{i}^{(\ell)} = \theta_{ft,i}^{(\ell)} - \theta_{pre}^{(\ell)} and perform SVD Δi=UiΣiVi\Delta_i = U_i \Sigma_i V_i^\top. Keep top-k singular components per task (compression).
  2. Fixed merge via TSV-M (pre-processing) or TA. Concatenate truncated Ui,ViU_i, V_i across tasks, orthogonalize them to reduce interference (as in TSV-M), and build a single “fixed” merged model Δ^=UΣV,θMT=θpre+αΔ^.\hat\Delta = U_\perp \Sigma V_\perp^\top, \quad \theta_{MT} = \theta_{pre} + \alpha \hat\Delta. Before routing, filter redundant directions: flatten each Δi\Delta_i and drop task updates with cosine similarity above a threshold ϵ\epsilon to any accepted update so no family of tasks dominates the router later.
  3. Data-free, projection-based router. On the first pass, run xx through θMT\theta_{MT} to get an intermediate activation zz^\ell. For each task subspace ViV_i, compute the residual ri=zViViz2.r_i = \lVert z^\ell - V_i V_i^\top z^\ell \rVert_2. Convert ri-r_i to weights with a softmax, threshold by η\eta, and keep Top-k selected tasks.
  4. Adaptive low-rank merge on the fly. Combine only the selected subspaces: Δada=iΩUiΣiVi,θMASS=θpre+αΔada\Delta*{ada} = \sum*{i \in \Omega} U*i \Sigma_i V_i^\top, \quad \theta*{MASS} = \theta*{pre} + \alpha \Delta*{ada} Make a second pass with θMASS\theta_{MASS}; evaluate the corresponding selected heads and pick the max-logit head/class. (MASS does not assume an oracle head, in contrast to existing model merging literature).
  5. Routing layer choice. Routing at mid/late layers works best on average (e.g., around layer 9 in ViT-B models; MLP blocks slightly favored), though best layer is task-dependent, with per-task variance changing across layers, an avenue for future adaptive layer selection.

Experimental Insights

Thoughts

MASS reframes model merging as “route-then-compose” in a shared low-rank, orthogonalized basis given by TSV-M. The key design choices, redundant-direction filtering, mid-layer residual routing, and on-the-fly TSV recomposition, tackle the classic failure mode of one-size-fits-all fixed merges: task interference. Two limitations are natural next steps: (i) routing layer adaptivity (since the best layer varies by task and architecture), and (ii) finer-grained subspace selection for OOD inputs or unseen blends of skills. Still, the central lesson is compelling: with structured low-rank updates in hand, simple geometry (projections and orthogonalization) is enough to recover near-oracle accuracy, without touching data or training a router.

Citation


            @misc{crisostomi2025massmoergingadaptivesubspace,
    title={MASS: MoErging through Adaptive Subspace Selection}, 
    author={Donato Crisostomi and Alessandro Zirilli and Antonio Andrea Gargiulo and Maria Sofia Bucarelli and Simone Scardapane and Fabrizio Silvestri and Iacopo Masi and Emanuele Rodolà},
    year={2025},
    eprint={2504.05342},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2504.05342}, }
          

Edit page
Share this post on:

Next Post
Task Singular Vectors: Reducing Task Interference in Model Merging