GPT architecture

In June 2018, OpenAI introduced GPT, the neural network model, which immediately showed the best results in a number of language tests. In February 2019, they released GPT-2, and in May 2020, everyone learned about GPT-3. These models demonstrated the possibility of a neural network to generate texts. Experiments were also conducted on the generation of music and images. The main disadvantage of the models is the requirements for computing resources. It took a month to train the first GPT on a machine with 8 GPUs. This disadvantage is partially compensated by the ability to use pre-trained models to solve new problems. However, the size of the model requires resources for its functioning.

Conceptually, GPT models are built on the basis of the transformer we have already looked at. The main idea is to pre-train a model without a teacher on a large volume of data and then fine-tune it on a relatively small amount of labeled data.

The reason for the two-step training is the size of the model. Modern deep machine learning models, such as GPT, have a large number of parameters, numbering in the hundreds of millions or more. Therefore, training of such neural networks requires a huge training dataset. When using supervised learning, creating a labeled training dataset can require significant effort. At the same time, there are numerous digitized texts available on the internet which are not unlabeled, making them suitable for unsupervised learning models. However, the results of unsupervised learning are statistically inferior to supervised learning. Therefore, after unsupervised learning, the model undergoes fine-tuning on a relatively small labeled dataset.

Unsupervised learning allows GPT to learn a language model, while fine-tuning using labeled data tailors the model for specific tasks. In this way, a single pre-trained model can be replicated and configured to perform different language tasks. The limitation lies in the language of the source dataset for unsupervised learning.

As practical experience has shown, such an approach yields good results across a wide range of language tasks. For example, the GPT-3 model is able to generate related texts on a given topic. But it's worth noting that the mentioned model contains 175 billion parameters and was pre-trained on a dataset of 570 GB.

Despite the fact that GPT models were designed for natural language processing, they also showed impressive results in music and image generation tasks.

Theoretically, it is possible to use GPT models with any sequences of digitized data. The question lies in having sufficient data and resources for unsupervised pre-training.

Comparative testing of Attention models

Description of the architecture