Large Language Models: Compairing Gen2/Gen3 Models (GPT-3, GPT-J, MT5 and More)
Large Language Models are sophisticated neural networks that produce texts. By creating one word at a time, given a context of other words, these models produce texts that can rival a humans output. The creation of LLMs began back in 2018 when the transformer neural network architecture was discovered. Since then, ever more complex transformer models in terms of parameter amount, and continues up to this data with ever more complex model architectures, consumed amount of texts, and parametric complexity.
Continuing the last article which focused Gen1 LLMs with a timespan from 2018–02 to 2020–06, this article covers both Gen2 and Gen3 LLMs up to 2022–10. In contrast to the last article, which explained the model’s architecture and benchmark performance more technically, this article instead explores general tendencies of LLM training, fine-tuning and capabilities. Following models are covered:
Open AI
- GPT-3
- CodeX
Eleuther AI
- GPT-J
- GPT-Neo X
Google Research
- MT5
- PaLM
- Flan T5 / Flan-PaLM
This article originally appeared at my blog admantium.com.
GPT-3
The GPT-3 model was the very first large-scale language model that surfaced the capabilities of few-shot learning. The pre-trained model can be used as-is with instruction prompts to perform several NLP tasks. These prompts are formulated in natural language, they explain the context of a task as well as some examples, and the specific task to solve. The model receives one or a chain of these instruction prompts to solve a task. With this, classical NLP tasks like translation and question answering are supported, as well as use-cases like descrambling text or performing arithmetic.
Although different versions of GPT-3 were trained, the best performing model is the 176B parameter model, trained on 499B tokens. The training corpora includes this material: A filtered and deduplicated version of Common Crawl, WebText dataset, two Book corpora, and the English Wikipedia. It was trained on a not further specified cloud computing cluster with V100 GPUs. Of the GPT-3 model, different versions were released and accessible via the OpenAI model, using codenames like ada, babbage, curie, davinci up to GPT-3.5 Turbo.
An astonishing capability is GPT-3 performance for question answering and specifically open-domain question answering. This NLP task has two flavors. Classically, a system is given a context to search in, e.g. a paragraph of text, and a question. It then needs to highlight the relevant part. An open domain task can ask any arbitrary question, and the system needs to use its stored information, or can access additional external sources, to provide an answer. When GPT-3 is presented with an open-domain question, it shows good performance with one-shot or few-shot examples.
CodeX
CodeX is a generative AI model for creating programming language code. It can produce code in Python, Go, JavaScript, Perl, PHP, Ruby, Shell, Swift, and TypeScript. The research paper extensively shows the capabilities for generating Python code. By applying repeated sampling of instruction-prompts that contain the models previous answers, functionally correct and executable code can be generated. Applying this method led to 70% task completion in a programming task challenge.
This model is commercially used in the GitHub Copilot function. When this assistant-like feature is activated in a programming IDE, it will constantly scan the program context and provide code suggestions ranging from lines to complete methods and even test cases.
Technically, CodeX is a GPT-3 model that was additionally fine-tuned for code generation. The training set contains curated public GitHub repositories, e.g. 159 gigabytes of Python code.
GPT-J
GPT-J is a LLM case study with two goals: Training a LLM with a data source containing unique material, and using the training frameworkMesh Transformer JAX to achieve a high training efficiency through parallelization. There is no research paper about GPT-J, but on its GitHub pages, the model, different checkpoints, and the complete source code for training is given.
The training material is named The Pile, a 800GB large corpus consisting of 22 different sources, including scientific research papers from ArXiV, legal documents from the the FreeLaw Project, and eBooks from Project Gutenberg campus. As shown in its documentation, GPT-J performance is on par with the GPT-3 6B model. Also, the model can be used for advanced theorem proving and natural language understanding.
The model was trained on Google Cloud TPU VM alpha version, which since then evolved to a publicly available service.
GPT-Neo X
GPT-Neo X is the successor model of GPT-J that follows similar paradigms with the publication of the model and its technical realization. Its GitHub code repository contains again all Python scripts that were used to train this 20B parameter model as well as the model.
This 20B model was trained on the same datasets as its predecessor, aptly named The Pile. Furthermore, the libraries Megatron and DeepSpeed were used to achieve better computing resource utilization, and eventually GPT-NeoX evolved into its own framework for training other LLMs. It was used, for example, as the foundation for Llemma, an open-source model specializing on theorem proving.
During performance comparison with other models, one strong trend emerged: Using instruction prompts with five examples significantly increases the GPT-Neo X performance relatively to other models. Another noteworthy point is that the research paper authors fully disclose and reflect on limitations regarding hyperparameter tuning and missing data deduplication of the training material.
MT5
Googles T5 model used a unique approach to structure the input data format: a declarative explanation or instruction followed by a context. This precursor to instruction prompts can explain the rich amount of tasks for which the model is suitable.
Continuing this style of LLM training, the multilingual T5 model was trained on the multilingual C4 corpus. For this, any web page that passes the line length filter (containing 3 lines with more than 200 characters) were scraped, then filtered and deduplicated. Overall, this corpus contains 101 natural languages.
The mT5 model follows the same architecture as the T5 model, and was released as different parameter size models, from the small 77M to the xl 3b and xxl 11b parameter model. All models are available from the Github repository.
To alleviate the original T5 idea, the tasks were designed as zero-shot, translate-train (create target-language prompts by machine-translating from English) and in-language multitask (design instruction-prompts for tasks in the target language) fine-tuning. Using the translate-train approach, applying mT5 to multilingual benchmark results in new state-of-the-art performance in all benchmarks.
Finally, an interesting observation of this model is its “accidental translation” tendency when used in question answering tasks. Three error types are distinguished: normalization, e.g. in which other UTF-8 representation of chars, grammatical adjustments, in which the output is formulated differently, and the accidental translation in which the generated text is correct to the question, but in a different natural language. This behavior can be countered by re-using pre-training multilingual examples in the fine-tuning phase.
PaLM
LLM parameter size scaling has been shown to increase the performance of models. To investigate the technical limits of parameter size, and to systematically understand which LLM capabilities emerge, researchers created the 540B Pathways Language Model.
The training input data consist of 780B tokens, including web pages (proportionally filtered with a quality score), source code from GitHub, Wikipedia, social media content, books and news. The model was trained on 6144 TPU v4 using the JAX and T5X libraries. Furthermore, it utilizes the name giving pathway mechanism to parallelize training by executing two components on Google TPU pods: a) offloading training batches and performing forward and backward computations, and b) optimizer updates, including local and remote gradients.
The PaLM model exceeds the performance of most other fine-tuned models, proving that parameter scaling is a key point. Used with the BIG benchmark, comprising tasks like logical reasoning and translation, using five-shot learning shows best results in 44 out of 58 tasks. Similarly, the model supports advanced tasks, like logical interference chains, pattern recognition, semantic parsing, reading comprehension, and even code generation. This leads to the conclusion that higher order functions emerge from an LLM if it has been trained on a sufficient corpus and has a high parameter size.
Flan T5 / Flan-PaLM
Increased interest and research results about instruction finetuning led Google to apply a case study of Flan (Fine tuning language models) to their other released models like T5 and PaLM. The fine-tuning dataset was specifically created for this task. It combines three core ideas: a) create a rich task-mixture, b) apply chain-of-thoughts instructions on selected reasoning and sentence composition tasks, and c) use different templates and formats for task instruction. Overall, the complete fine-tuned data set contains the impressive amount of 1836 different tasks, grouped into 146 task categories and combining 473 datasets.
The results speak for themselves. The Flan version of a 540B PaLM model outperforms its not fine-tuned counterpart by 9.4% on average across all tasks, and the Flan 11B T5 model is even 26.6% better.
Conclusion
This article presented 7 different gen2 and gen3 LLMs published between 2020–07 to 2022–10. Models from OpenAI, Eleuther AI, Meta and Google Research were published. Each company seeks different goals with their model, from commercial marketing to full open-source publication with permissive licenses. However, all models evolved along the same aspects: a) Parameter amount, b) training data-set diversity and token count, c) training libraries and support. Two facts can be recognized. First, LLMs scale with parameter size. What was started with GPT-3 176B parameter was cemented with the 540B PaLM model: More complexity leads to models with better capacity. Second, instruction prompts lead to increasing few-shot learning performance. Google recognized this aspect already in its T5 model, where instruction prompts were used for the training material. Other models started to specifically fine-tune with instruction prompts examples, which will evolve to dedicated instruction data-sets. Google Flan T5 model showed a significant better performance. The next article continues to cover other Gen2 and Gen3 models.