Skip to content
Go back

Task Singular Vectors: Reducing Task Interference in Model Merging

Edit page
Sapienza University of Rome , February 2024 - November 2024
Poster presentation at CVPR, June 2025, Nashville, Tennessee, USA

Task Singular Vectors reduce the task interference during model merging through separable models, with empirical and theoretical analysis.

Paper Code Poster PDF

Table of Contents

Open Table of Contents

Abstract

Task Arithmetic has emerged as a simple yet effective method to merge models without additional training. However, by treating entire networks as flat parameter vectors, it overlooks key structural information and is susceptible to task interference. In this paper, we study task vectors at the layer level, focusing on task layer matrices and their singular value decomposition. In particular, we concentrate on the resulting singular vectors, which we refer to as Task Singular Vectors (TSV). Recognizing that layer task matrices are often low-rank, we propose TSV-Compress (TSV-C), a simple procedure that compresses them to ~10% of their original size while retaining ~99% of accuracy. We further leverage this low-rank space to define a new measure of task interference based on the interaction of singular vectors from different tasks. Building on these findings, we introduce TSV-Merge (TSV-M), a novel model merging approach that combines compression with interference reduction, significantly outperforming existing methods.

Highlights

Method Overview

  1. Layer-wise SVD & TSVs: For each layer’s task matrix (fine-tuned minus base weights) ΔW=W(task)W(base),\Delta W_\ell = W_\ell^{(task)} - W_\ell^{(base)}, compute an SVD ΔW=UΣV\Delta W_\ell = U_\ell \Sigma_\ell V_\ell^\top. Treat the left/right singular vectors UU_\ell and VV_\ell as Task Singular Vectors (TSVs), a structured basis for each task at that layer.

  2. Low-rank truncation (TSV-C): Keep only the top-k singular components per task and layer to compress task updates ~10x while preserving ~99% of the original accuracy.

  3. Interference scoring (STI): Quantify Singular Task Interference by measuring cross-task overlap of TSVs, via a dot product of UUU^\top U and VVV^\top V deviations from identity; higher overlap ⇒ more interference.

  4. Interference reduction: Decorrelate TSVs across tasks by whitening or orthogonal Procrustes, which are shown to be equivalent transformations, on the concatenated TSVs; this minimizes cross-task interactions before merging.

  5. Whitened low-rank merge (TSV-M): Form a block-diagonal matrix of retained singular values, recombine with the whitened TSVs of U and V to obtain a low-rank per-layer update, and then add it (with a global scale α\alpha) to the base model to produce the merged multi-task network. To ensure ease of adoption, we recommend using a default value based on extensive empirical testing, which has shown robustness across a variety of tasks.

Experimental Insights

Thoughts

TSV offer an interpretable and efficient mechanism to adapt and merge models, with natural extensions to continual and federated learning. One open question is the precise link between a layer-matrix’s maximum rank and the amount of information it can store. Intuitively, TSV can be seen as a mathematical procedure that, given a set of matrices each with a different rank, retains for each rank dimension the most informative directions up to the available rank, then orthogonalizes them jointly to minimize interference. In this way, the maximum rank of the single-layer matrix, in which we want to store all the information coming from the entire set, and no more than that can be done to save information, and then recombine it without adding interfering directions. This “use-the-rank-you-have” principle suggests that simple tasks, like MNIST, may occupy fewer rank directions than a more complex one, if you do not want to have equal treatment for each task.

A related matrix-rank perspective: for two task matrices A and B of the same layer, one can ask when the rank(A + B) is equal to rank(A) + rank(B). Equality holds under a strong but illustrative condition: both the column spaces and the row spaces are mutually orthogonal, equivalently, AB=BA=0A^\top B = B^\top A = 0 so that C(A)C(B)={0}\mathcal{C}(A)\cap\mathcal{C}(B) = \{0\} and R(A)R(B)={0}\mathcal{R}(A)\cap\mathcal{R}(B) = \{0\}, in other words when the intersection between the column space and also the row space of A and B are just the zero vector. This is precisely the scenario TSV-M aims to approximate by whitening/orthogonalizing TSVs before recombination, thereby pushing A and B toward additive, non-interfering contributions within each layer’s rank budget.

Analogy spaces as cubes

A useful visualization analogy that has helped me to understand this concept is the figure above, where the space of a neural network layer is represented as an empty cube. Each task occupies a subspace within this cube, represented by smaller colored cubes. When two tasks are naively merged, they interfere; their subspaces overlap, leading to conflicts in the shared parameters. By orthogonalizing the TSVs subspaces, we effectively rearrange these smaller cubes so that they fit into the larger cube without overlapping, allowing each task to utilize its own distinct subspace. This visualization helps to conceptualize how TSV-M reduces interference and maximizes the effective use of the available parameter space; but also open an interesting question about the effective usage of knowledge by these networks, in my opinion, in this weay, we are not merging the networks the way we have intended from beginning: fuse the knowledge to get a better understanding of the entire world. Here we are just fusing more networks togheter, each one specialized in a small task, without really to get a better and general understanding of all the concepts together. This is a huge limitation point in my opinion and I do not have a clear idea on how and when we will be able to effectively resolve this issue, or if this research direction is telling us that a neural network can only store information in separate subspace (circuits, modules, or pathways) without really being able to find useful intersections between different tasks.

Citation


            @INPROCEEDINGS{11092448,
author={Gargiulo, Antonio Andrea and Crisostomi, Donato and Bucarelli, Maria Sofia and Scardapane, Simone and Silvestri, Fabrizio and Rodolà, Emanuele},
booktitle={2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
title={Task Singular Vectors: Reducing Task Interference in Model Merging}, 
year={2025},
volume={},
number={},
pages={18695-18705},
keywords={Training;Analytical models;Accuracy;Merging;Buildings;Interference;Vectors;Matrix decomposition;Through-silicon vias;Tuning;model merging;parameter-efficient fine-tuning (peft);task vectors;singular value decomposition (svd);model compression;multi-task learning;deep learning;neural networks;computer vision},
doi={10.1109/CVPR52734.2025.01742}}
          

Edit page
Share this post on:

Previous Post
MASS: MoErging through Adaptive Subspace Selection