Discussing the article: "Neural Networks Made Easy (Part 95): Reducing Memory Consumption in Transformer Models"

 

Check out the new article: Neural Networks Made Easy (Part 95): Reducing Memory Consumption in Transformer Models.

Transformer architecture-based models demonstrate high efficiency, but their use is complicated by high resource costs both at the training stage and during operation. In this article, I propose to get acquainted with algorithms that allow to reduce memory usage of such models.

The MLKV method is a logical continuation of the MQA and GQA algorithms. In the specified methods, the KV cache size is reduced due to the reduction of KV heads, which are shared by a group of attention heads within a single Self-Attention layer. A completely expected step is the sharing of Key and Value entities between Self-Attention layers. This step may be justified by recent research into the role of the FeedForward block in the algorithmTransformer. It is assumed that the specified block simulates the "Key-Value" memory, processing different levels of information. However, what is most interesting for us is the observation that groups of successive layers compute similar things. More precisely, the lower levels deal with superficial patterns, and the upper levels deal with more semantic details. Thus, it can be concluded that attention can be delegated to groups of layers while keeping the necessary computations in the FeedForward block. Intuitively, KV heads can be shared between layers that have similar targets.

Developing these ideas, the authors of the MLKV method offer multi-level key exchange. MLKV not only shares KV heads among Query attention heads in the same Self-Attention layer, but also among the attention heads in other layers. This allows the reduction of the total number of KV heads in the Transformer, thus allowing for an even smaller KV cache.

Author: Dmitriy Gizlyk

 
And how do you know the network has learnt something rather than generating random signals?
 
Maxim Dmitrievsky #:
And how do you realise that the network has learned something rather than generating random signals?

Actor's stochastic policy assumes some randomness of actions. However, in the process of learning, the range of scattering of random values is greatly narrowed. The point is that when organising a stochastic policy, 2 parameters are trained for each action: the mean value and the variance of the scatter of values. When training the policy, the mean value tends to the optimum and the variance tends to 0.

To understand how random the Agent's actions are I make several test runs for the same policy. If the Agent generates random actions, the result of all passes will be very different. For a trained policy the difference in results will be insignificant.

 
Dmitriy Gizlyk #:

The Actor's stochastic policy assumes some randomness of actions. However, in the process of training, the range of random values scatter is strongly narrowed. The point is that when organising a stochastic policy, 2 parameters are trained for each action: the mean value and the variance of the scatter of values. When training the policy, the mean value tends to the optimum, and the variance tends to 0.

To understand how random the Agent's actions are I make several test runs for the same policy. If the Agent generates random actions, the result of all passes will be very different. For a trained policy the difference in results will be insignificant.

Got it, thanks.