Discussion of article "Neural networks made easy (Part 10): Multi-Head Attention"


New article Neural networks made easy (Part 10): Multi-Head Attention has been published:

We have previously considered the mechanism of self-attention in neural networks. In practice, modern neural network architectures use several parallel self-attention threads to find various dependencies between the elements of a sequence. Let us consider the implementation of such an approach and evaluate its impact on the overall network performance.

The Self-Attention algorithm uses three trained weight matrices (Wq, Wk and Wv). Matrix data is used to get 3 entities: Query, Key and Value. The first two entities define the pairwise relationship between the elements of the sequence, and the last one defines the context of the analyzed element. 

It's no secret that situations are not always clear-cut. On the contrary, it seems that in most cases a situation can be interpreted from different points of view. So, conclusions can be completely opposite, depending on the selected point of view. It is important to consider all possible variants in such situations and to make a decision only after careful analysis. The Multi-Head Attention mechanism has been proposed for solving such problems. Each "head" has its own opinion, while the decision is made by a balanced vote. 

The Multi-Head Attention architecture implies the parallel use of multiple self-attention threads having different weight, which imitates a versatile analysis of a situation. The results of operation of self-attention threads are concatenated into a single tensor. The final result of the algorithm is found by multiplying the tensor by the W0 matrix, the parameters of which are selected during the neural network training process. The whole architecture replaces the Self-Attention block in the encoder and decoder of the transformer architecture.

Author: Dmitriy Gizlyk