Burning transformers on kernels is a high levelling, in my opinion ) cool skill of the author of the article
Totally agree. The first article caused vague doubts with all sorts of 5S ), but after screwing in kernels I just want to praise the author relentlessly))))
Still the test results are worse than in the previous article. I would like the author's comments in this regard.
What is reflected in the article is less than 5% of what the author could and probably wanted to say with this article. If you see only these 5% and have not tried to experiment on other tasks, it is of little use.
At the very least, you need extensive multifaceted tests.
What is reflected in the article is less than 5% of what the author could and probably wanted to say with this article. If you see only these 5% and have not tried experimenting on other tasks, it is of little use.
At the very least, you need extensive multifaceted tests.
That's what I'm trying to provoke the author with another 0.5% of information. At the end of his previous articles, the author compared his results with the previous ones.
For me, it is the attempt to use attention mechanisms to predict trading signals that is of particular interest. After all, if this approach is recognised as the most effective for text generation (and GPT really works wonders), we can expect it to be effective when working with other types of sequences - numerical series of quotes.
And for the multi-threaded implementation of Self-Attention - of course, thanks and respect to the author.
Some conceptual questions are interesting:
How does this Self-attention system differ from a simple fully connected layer, because in it too the next neuron has access to all previous ones? What is its key advantage? I can't understand it, although I have read quite a lot of lectures on this topic.
Ключевое отличие механизма Self-attention от простого полносвязного слоя заключается в способности Self-attention динамически выделять различные части входных данных. В отличие от полносвязного слоя, который обрабатывает все входы одинаково, Self-attention присваивает различные веса разным частям входных данных, фокусируясь на более релевантных частях для выполнения задачи.
This dynamic weighting provides the main advantage of the mechanism: increased sensitivity of the model to relationships between elements of the input sequence, improving performance in tasks requiring contextual understanding
I am not the author of the article, but here is the answer I found to your questions:
If to restate it in human language, the meaning is as follows: "the SA mechanism is a development of a fully-connected neural network, and the key difference from PNN is that the elementary element that PNN analyses is the output of a single neuron, while the elementary element that SA analyses is a certain vector of context"? Am I right, or are there other key differences?
I have also seen this machine translation, but still it is somewhat incorrect.
If to rephrase it into human language, the meaning is as follows: "the SA mechanism is a development of a fully-connected neural network, and the key difference from PNN is that the elementary element that PNN analyses is the output of a single neuron, while the elementary element that SA analyses is a certain vector of context"? Am I right, or are there some other key differences?
The vector is from recurrent networks, because a sequence of letters is fed to translate the text. BUT SA has an encoder that translates the original vector into a shorter length vector that carries as much information about the original vector as possible. Then these vectors are decoded and superimposed on each other at each iteration of training. That is, it is a kind of information compression (context selection), i.e. all the most important things remain in the algorithm's opinion, and this main thing is given more weight.
In fact, it is just an architecture, don't look for sacred meaning there, because it does not work much better on time series than the usual NN or LSTM.
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
You agree to website policy and terms of use
New article Neural networks made easy (Part 8): Attention mechanisms has been published:
In previous articles, we have already tested various options for organizing neural networks. We also considered convolutional networks borrowed from image processing algorithms. In this article, I suggest considering Attention Mechanisms, the appearance of which gave impetus to the development of language models.
When analyzing a candlestick symbol chart, we define trends and tendencies, as well as determine their trading ranges. It means, we select some objects from the general picture and focus our attention on them. We understand that objects affect the future price behavior. To implement such an approach, back in 2014 developers proposed the first algorithm which analyzes and highlights dependencies between the elements of the input and output sequences [8]. The proposed algorithm is called "Generalized Attention Mechanism". It was initially proposed for use in machine translation models using recurrent networks as a solution to the problem of long-term memory in the translation of long sentences. This approach significantly improved the results of the previously considered recurrent neural networks based on LSTM blocks [4].
The classical machine translation model using recurrent networks consists of two blocks, Encoder and Decoder. The first block encodes the input sequence in the source language into a context vector, and the second block decodes the resulting context into a sequence of words in the target language. When the length of the input sequence increases, the influence of the first words on the final sentence context decreases. As a consequence, the quality of translation decreases. The use of LSTM blocks slightly increased the capabilities of the model, but still they remained limited.
The authors of the general attention mechanism proposed using an additional layer to accumulate the hidden states of all recurrent blocks of the input sequence. Further, during sequence decoding, the mechanism should evaluate the influence of each element of the input sequence on the current word of the output sequence and suggest the most relevant part of the context to the decoder.
Author: Dmitriy Gizlyk