Discussion of article "Neural networks made easy (Part 8): Attention mechanisms"


New article Neural networks made easy (Part 8): Attention mechanisms has been published:

In previous articles, we have already tested various options for organizing neural networks. We also considered convolutional networks borrowed from image processing algorithms. In this article, I suggest considering Attention Mechanisms, the appearance of which gave impetus to the development of language models.

When analyzing a candlestick symbol chart, we define trends and tendencies, as well as determine their trading ranges. It means, we select some objects from the general picture and focus our attention on them. We understand that objects affect the future price behavior. To implement such an approach, back in 2014 developers proposed the first algorithm which analyzes and highlights dependencies between the elements of the input and output sequences [8]. The proposed algorithm is called "Generalized Attention Mechanism". It was initially proposed for use in machine translation models using recurrent networks as a solution to the problem of long-term memory in the translation of long sentences. This approach significantly improved the results of the previously considered recurrent neural networks based on LSTM blocks [4].

The classical machine translation model using recurrent networks consists of two blocks, Encoder and Decoder. The first block encodes the input sequence in the source language into a context vector, and the second block decodes the resulting context into a sequence of words in the target language. When the length of the input sequence increases, the influence of the first words on the final sentence context decreases. As a consequence, the quality of translation decreases. The use of LSTM blocks slightly increased the capabilities of the model, but still they remained limited. 

The authors of the general attention mechanism proposed using an additional layer to accumulate the hidden states of all recurrent blocks of the input sequence. Further, during sequence decoding, the mechanism should evaluate the influence of each element of the input sequence on the current word of the output sequence and suggest the most relevant part of the context to the decoder.

Author: Dmitriy Gizlyk