Description of the Multi-Head Self-Attention architecture

The Self-Attention technology discussed earlier identifies dependencies between sequence objects in a certain context and then ranks them using the Softmax function. However, when solving practical problems, it is not always possible to give such an assessment unambiguously. Typically, dependency coefficients between objects change greatly when the point of view or context of the element being analyzed, changes. The final decision on element dependencies is always a compromise. The use of Multi-Head Self-Attention is specifically designed to help discover dependencies between elements by comprehensively considering the input data. The additional input trainable matrix of weights will help the model learn to find this compromise.

Perhaps the simplest solution to such a problem would be to expand our CNeuronAttention attention by adding an array of Self-Attention blocks to it. This approach is possible, but it is irrational. It leads to an increase in the number of objects proportionally to the increase in the number of attention heads. Furthermore, the sequential execution of operations for each attention head does not allow the organization of simultaneous parallel computation of attention for all heads. Additionally, the subsequent operation related to the concatenation of results of attention heads will also require resource and time overhead.

There is a solution, which lies in the realm of matrix operations in mathematics. Having knowledge and understanding of matrix operations in mathematics greatly aids in comprehending the mathematics of neural networks and provides a clear picture of the potential for dividing operations into parallel threads.

Let's go through the Self-Attention algorithm and think about transforming operations to implement Multi-Head Self-Attention.

First, we calculate vectors Query, Key, and Value. These vectors are calculated by multiplying each element of the original sequence by the corresponding matrix WQ, WK, and WV.

To organize Multi-Head Self-Attention, we need to repeat this operation based on the number of attention heads. Let's start with a simple example using three attention heads.

I think everything is clear here.

Now let's look at the dimensions of tensors. Remember that the architecture of the model provides for the same number of sequence elements at all stages. Each element of the sequence is described by a certain vector of values. Since the Self-Attention mechanism is applied in the same way to each element of the sequence, we can analyze operations only with the description vector of one element, as an example. Moreover, the size of this vector is the same for the tensors of the original data and values. However, it may differ from the dimension of the description vector of one element of the sequence in the query and key tensors. Let's use nI for the size of the source data vector and nK for the size of the key vector. Then the tensors will have the following dimensions.

Tensor	I	Q	WQ	K	WK	V	WV
Size	nI	nK	nI * nK	nK	nI * nK	nI	nI * nI

The specified tensor sizes are applicable to all attention heads. Let's try to combine the corresponding weight matrices into one large one.

Such weight matrices will have size nI*3nK for query matrices WQC and WKC. The matrix WVC will have the size nI*3nI, where 3 is the number of attention heads.

Let's substitute the concatenated matrices into the formulas for determining vectors.

According to the rules of matrix multiplication, we obtain the following tensor sizes.

Tensor	I	QC	WQC	KC	WKC	VC	WVC
Size	nI	3nK	nI * 3nK	3nK	nI * 3nK	3nI	nI * 3nI

Compare the tensor sizes in the two tables: they are very similar. The only difference is that they are multiplied by the number of attention heads. What practical value does it have for us? It's all very straightforward. Instead of creating multiple instances of objects for each attention head, we can create just one object for computing each entity. As when organizing a similar process in the Self-Attention mechanism, we can use our convolutional layers, but we will need to increase the window size of the results proportionally to the number of attention heads.

Next, we define pairwise dependencies between the elements of the sequence. To do this, we will multiply the Query vector with the Key vectors of all elements of the sequence. This iteration is repeated for the Query vector of each element of the sequence. As a result of this iteration, we obtain a Score matrix of size N*N, where N is the size of the sequence.

As a result of this operation, we expect to obtain one coefficient of dependency between a pair of sequence elements for each attention head. However, the operation of multiplying two concatenated vectors will return only one value. As in the case of single-headed Self-Attention.

We can change the dimensionality of vectors and convert them into two-dimensional matrices. This makes sense, as we can allocate the data for each attention head into a separate row vector. However, by adding them to the formula above, we will get a square matrix with a side length equal to the number of attention heads, whereas we expected to obtain a vector with a size equal to the number of heads.

There is still a way out. Let's remember the matrix multiplication rule.

We will substitute here our two-dimensional matrices of multi-head attention. Don't forget that the second matrix is transposed before multiplication.

As you can see, the vector we expected to obtain forms the diagonal of the matrix of results. And all other operations are just a waste of resources for us. But we can split this procedure into operations. For example, we will not transpose the key matrix and use the Hadamard product of matrices (element-wise matrix multiplication).

After this, to obtain the expected result, all we need to do is add the elements of the matrix row by row.

In the end, we got the result in two operations instead of one. However, it's important to note two things:

We use a transposed matrix In the Self-Attention formula, which is also an operation on a matrix, although it is not highlighted separately. And its implementation also requires resources. When splitting into operations, we abandoned this procedure.
The vector of coefficients is determined in two operations, regardless of the number of attention heads.

The next step is to divide the resulting values by the square root of the dimension of the Key vector and normalize it with the Softmax function in the context of each Query. In this way, we obtain the coefficients of pairwise interdependence between sequence elements.

At this point, we will not complicate or simplify anything. Matrix division by a constant is always performed element-wise regardless of the matrix size, but we will need to normalize the data on a per-attention-head basis.

We multiply each Value vector by the corresponding interdependence coefficient and obtain the adjusted value of the element. The goal of this iteration is to focus on relevant elements and reduce the influence of irrelevant values.

To solve this problem, we will use the techniques applied in paragraph 2. First, let's change the dimension of the vector of values and reduce it to a two-dimensional matrix. In it, the rows will correspond to each individual head of attention.

After this, we can use element-wise multiplication of the dependency coefficient vector by the matrix of values.

Then we summarize all the adjusted Value vectors for each element. The result of this operation will indeed be the vector of output values of the Self-Attention layer.

In the last point, there is nothing more to add. We will summarize the value of vector elements separately in the context of Query queries and attention heads. We can easily parallelize the execution of this task by creating a separate thread for finding each individual vector.

After completing all the points of the Self-Attention mechanism in the mode with multiple attention heads, we receive a vector of results for each attention head. Consequently, the overall size of the tensor of results will exceed the size of the original data tensor proportionally to the number of heads. To reduce the dimensionality, the Multi-Head Self-Attention algorithm provides for multiplying the concatenated tensor of results by an additional weight matrix W0. As you can imagine, this procedure is very similar to the operation of a fully connected neural layer without an activation function. We performed similar operations in step 1 to determine Query, Key, and Values vectors. This means that we can use the same solution and use the previously created convolutional layers.

Here, we can also note another point. When describing the operation of the Self-Attention block, we paid attention to the moment when the size of the vectors describing one element of the sequence of Value tensors and the source data are equal. This requirement was based on the need for subsequent addition of tensors of Self-Attention results and initial data. In the case of multi-head attention, we always end up with a concatenated tensor of results that is larger than the tensor of the original data. To align them, multiplication of the result tensor by the matrix W0 is used. Therefore, in order to save resources, we can reduce the dimensionality of the description vector of a single sequence element in the Value tensor without risking errors in subsequent data processing.

The rest of the algorithm of the transformer encoder remains unchanged, and we can leverage the developments from the previous section.

Now that we have a complete understanding of the principles behind the algorithm, we can proceed to its implementation.

Multi-Head attention

Building Multi-Head Self-Attention in MQL5