Discussion of article "Neural networks made easy (Part 8): Attention mechanisms"

DeadMorose777 2024.02.17 17:23 #11

Maxim Dmitrievsky #:

Vector is from recurrence networks, because to translate the text, a sequence of letters is fed in. But SA has an encoder that translates the original vector into a shorter vector that carries as much information about the original vector as possible. Then these vectors are decoded and superimposed on each other at each iteration of training. That is, it is a kind of information compression (context selection), i.e. all the most important things remain in the algorithm's opinion, and this main thing is given more weight.

In fact, it is just an architecture, don't look for sacred meaning there, because it does not work much better on time series than the usual NN or LSTM.

A vector is just a sequential set of numbers. This term is not tied to recurrent NNs, or even to machine learning in general. This term can be used absolutely in any mathematical problem in which the order of numbers is required: even in school arithmetic problems.
Looking for sacral meaning is the most important thing if you need to design something unusual. And the problem of market analysis is not in the models themselves, but in the fact that these (market) time series are too noisy and whatever model is used, it will pull out exactly as much information as it is embedded. And, alas, it is not enough. To increase the amount of information to be "pulled out", it is necessary to increase the initial amount of information. And it is precisely when the amount of information increases that the most important features of EO - scalability and adaptability - come to the fore.

With what to replace From theory to practice Charles Dow's theory

Maxim Dmitrievsky 2024.02.17 17:29 #12

DeadMorose777 #:
A vector is simply a sequential set of numbers. This term is not tied to recurrent HH, or even to machine learning in general. This term can be used in absolutely any mathematical problem in which the order of numbers is required: even in school arithmetic problems.
Searching for sacral meaning is the most important thing if you need to design something unusual. And the problem of market analysis is not in the models themselves, but in the fact that these (market) time series are too noisy and whatever model is used, it will pull out exactly as much information as it is embedded. And, alas, it is not enough. To increase the amount of information to be "pulled out", it is necessary to increase the initial amount of information. And it is precisely when the amount of information increases that the most important features of EO - scalability and adaptability - come to the fore.

This term is attached to recurrent networks that work with sequences. It just uses an additive in the form of an attention mechanism, instead of gates like in lstm. You can come up with roughly the same thing on your own if you smoke MO theory for a long time.

That the problem is not in the models - 100% agree. But still any algorithm of TC construction can be formalised in one way or another in the form of NS architecture :) it's a two-way street.

Random Flow Theory and Machine learning in trading: "The 'perfect' trading system

Dmitriy Gizlyk 2024.02.18 17:58 #13

DeadMorose777 attention mechanism without any high-level libs and a 3-line implementation.
Some conceptual questions are interesting:
How does this Self-attention system differ from a simple fully connected layer, because in it too the next neuron has access to all previous ones? What is its key advantage? I can't understand it, although I have read quite a lot of lectures on this topic .

There is a big "ideological" difference here. In brief, a full-link layer analyses the whole set of source data as a single whole. And even an insignificant change of one of the parameters is evaluated by the model as something radically new. Therefore, any operation with the source data (compression/stretching, rotation, adding noise) requires retraining of the model.

Attention mechanisms, as you have correctly noticed, work with vectors (blocks of data), which in this case it is more correct to call Embeddings - an encoded representation of a separate object in the analysed array of source data. In Self-Attention each such Embedding is transformed into 3 entities: Query, Key and Value. In essence, each of the entities is a projection of the object into some N-dimensional space. Note that a different matrix is trained for each entity, so the projections are made into different spaces. Query and Key are used to evaluate the influence of one entity on another in the context of the original data. Dot product Query of object A and Key of object B show the magnitude of dependency of object A on object B. And since Query and Key of one object are different vectors, the coefficient of influence of object A on B will be different from the coefficient of influence of object B on A. The dependency (influence) coefficients are used to form the Score matrix, which is normalised by the SoftMax function in terms of Query objects. The normalised matrix is multiplied by the Value entity matrix. The result of the operation is added to the original data. This can be evaluated as adding a sequence context to each individual entity. Here we should note that each entity gets an individual representation of the context.

The data is then normalised so that the representation of all objects in the sequence have a comparable appearance.

Typically, several consecutive Self-Attention layers are used. Therefore, the data contents at the input and output of the block will be very different in content, but similar in size.

Transformer was proposed for language models. And was the first model that learnt not only to translate the source text verbatim, but also to rearrange words in the context of the target target target language.

In addition, Transformer models are able to ignore out-of-context data (objects) due to context-aware data analysis.

Description of the Multi-Head Description of architecture and Discussion of article "Neural

DeadMorose777 2024.02.21 15:47 #14

Dmitriy Gizlyk #:

There is a big "ideological" difference here. In brief, the full-link layer analyses the entire set of input data as a whole. And even an insignificant change of one of the parameters is evaluated by the model as something radically new. Therefore, any operation with the source data (compression/stretching, rotation, adding noise) requires retraining of the model.

Attention mechanisms, as you have correctly noticed, work with vectors (blocks of data), which in this case it is more correct to call Embeddings - an encoded representation of a separate object in the analysed array of source data. In Self-Attention each such Embedding is transformed into 3 entities: Query, Key and Value. In essence, each of the entities is a projection of the object into some N-dimensional space. Note that a different matrix is trained for each entity, so the projections are made into different spaces. Query and Key are used to evaluate the influence of one entity on another in the context of the original data. Dot product Query of object A and Key of object B show the magnitude of dependency of object A on object B. And since Query and Key of one object are different vectors, the coefficient of influence of object A on B will be different from the coefficient of influence of object B on A. The dependency (influence) coefficients are used to form the Score matrix, which is normalised by the SoftMax function in terms of Query objects. The normalised matrix is multiplied by the Value entity matrix. The result of the operation is added to the original data. This can be evaluated as adding a sequence context to each individual entity. Here we should note that each object gets an individual representation of the context.

The data is then normalised so that the representation of all objects in the sequence has a comparable appearance.

Typically, several consecutive Self-Attention layers are used. Therefore, the data contents at the input and output of the block will be very different in content, but similar in size.

Transformer was proposed for language models. And was the first model that learnt not only to translate the source text verbatim, but also to rearrange words in the context of the target target target language.

In addition, Transformer models are able to ignore out-of-context data (objects) due to context-aware data analysis.

Thank you very much! Your articles have helped a lot to understand such a complex and complex topic.

The depth of your knowledge is just amazing really.

Discussion of article "Neural networks made easy (Part 8): Attention mechanisms" - page 2