Compacter: Efficient Low-Rank Hypercomplex Adapter Layers


Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Compacter inserts task-specific weight matrices into a pretrained model's weights. Each weight matrix is efficiently computed as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.07% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE, and outperforms fine-tuning and other parameter-efficient fine-tuning methods in low-resource settings.