This work presents one of the first comprehensive studies on different sparse attention patterns in Transformer models. We first discuss the essentiality of pre-training for sparse attention pattern models and point out that the efficient fine-tuning …