Enhancing Accuracy through Architectural Modifications in Vision Transformers

Mezzini, Mauro; Ferrato, Alessio; Limongelli, Carla; Sansonetti, Giuseppe

The Transformer architecture represents one of the most significant advancements in Artificial Intelligence in the last decade. Its evolution started from Language Translation and ended as the main tool in Natural Language Processing, eventually giving rise to the Large Language Models. One of the most critical components of the Transformer architecture is the multi-head attention mechanism. In the self-attention mechanism, the output sequence y = (y0, . . . , yN−1) (with N denoting the number of tokens of the input sequence) of a self-attention module is obtained as a linear combination of the input sequence x = (x0,...,xN−1), that is, yj = N−1 α(xj,xi)xi. The i=0 most popular choice for the coefficients α is the one in which N−1 α(xj,xi) = 1 and i=0 α(xj,xi) ≥ 0 for all i,j = 0,1,...N −1. To ensure this, a score function a(xj,xi) is provided and the softmax operation on a ensures that the coefficient α forms a convex combination, that is, α(xj,xi) = softmax(a(xj,xi)). In the multi-head self-attention module, there are h scores matrices where h is the number of heads, and the scores of different sequences are batched together to speed up the computation. We propose a methodology consisting of applying a convolutional filter to the batch of scores, followed by a rectified linear unit operation. We tested this modification using a Vision Transformer architecture on the CIFAR-10 dataset, obtaining encouraging results.

Mezzini, M., Ferrato, A., Limongelli, C., Sansonetti, G. (2024). Enhancing Accuracy through Architectural Modifications in Vision Transformers. In Proceedings of MLDM.it 2024.