Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns

Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns

Abstract

This work presents one of the first comprehensive studies on different sparse attention patterns in Transformer models. We first discuss the essentiality of pre-training for sparse attention pattern models and point out that the efficient fine-tuning only approach also yields a good-enough model. Then, we provide an analysis on two of the most widely used patterns, local patterns and global patterns, and conclude that the not-well-studied global patterns have unique powers that are not attainable by local patterns. We also show that fixing the pattern loses a certain amount of information, and different tasks can benefit from using different patterns across different layers of the model. Illuminated by our findings, we propose a novel Adaptive Axis Attention (AAA) method, which learns different attention patterns for each Transformer layer depending on the downstream task during fine-tuning. AAA aims to distinguish the important tokens from the unimportant ones and allow the model to focus on the former, depending on the task and the layer in the model. It has the benefit of not requiring pre-training to accommodate the sparse patterns, and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.

Publication
In Findings of the Association for Computational Linguistics.