Large Language Models: Compairing Gen2/Gen3 Models (Bloom, Gopher, OPT and More)

10 min readFeb 19, 2024

Large Language Models are sophisticated neural networks that produce texts. Since their inception in 2018, they evolved dramatically and deliver texts that can rival humans. To better understand this evolution, this blog series investigates models to uncover how they advance. Specifically, insights from published papers about each model are explained, and conclusions from benchmark comparisons are drawn.

This article covers the following LLMs:

Hugging Face

  • Bloom
  • BloomZ

Deep Mind

  • Gopher
  • Chinchilla
  • Sparrow


  • Megatron-Turing NLG

Meta AI

  • Open Pretrained Transformer

Allen Institute for AI

  • TK Instruct

This article originally appeared at my blog


The BLOOM LLM is the result of an open research collaboration between 100 persons with the explicit goal to democratize LLMs. In its research paper, the organization of this research project, the different working groups, and research areas to which individuals contributed, are explained. To guide the overall project, a set of core values are put in use. In addition to this organizational information, the complete model training scripts, the input data, and the model checkpoints are published and publicly accessible.

BLOOM is a decoder-only transformer The training data is named the ROOTS corpus, containing 1.61 TB of data with multi-lingual content from 252 sources, augmented by source code repositories from GitHub and a CommonCrawl dataset. All text material was cleaned to obtain text “written by humans for humans”, and then deduplicated before being used as input to training the model. This multilingual corpus contains 46 natural languages and 13 programming languages.

The model was trained on a French government-funded supercomputer with 48 nodes and a total number of 384 NVIDIA A100 GPUs. For training, the libraries Megatron and DeepSpeed were used. As with other LLMs, different model sizes were published, ranging from 560M to 3B, 7.1B and the 176B parameter version.

This model was used with zero-shot or few-shot instructions on a variety of NLP benchmarks seen in gen 1 LLMs, such as SuperGlue, the machine translation datasets WMT14, Flores-101, and DiaBLa, and the WikiLingua text summarization. It was also applied in code generation. The BLOOM model surpasses GPT-Neo and GPT-J by a margin, but in code generation, CodeX gets a better score. When comparing generated text in aspects like accuracy, bias and fairness, the not fine-tuned BLOOM model achieves good results, but can not surpass the GPT3 davinci model.


The BloomZ model is a fine-tuned version of Bloom. Based on research insight that instruction finetuning greatly enhances a models few-shot learning capacity and therefore increases its overall performance in benchmarks, the english-only P3 instruction dataset was extended with two variants. The xP3 datasets contains english-prompts but a multilingual task context in 46 different languages. And the xP3mt variant contains also machine-translated non-english prompts. The tasks cover a wide range such as multiple-choice, extractive and closed book qa, as well as summarization, program synthesis and coreference resolution.

Contrasting the performance of Bloom with BloomZ, the evidence is clear: Instruction Fine-Tuning increases the performance in each task significantly. Furthermore, the performance in sentence completion, natural language interference, and coreference resolution scales along the instruction fine-tuning dataset variants. For example, in natural language interference, the Bloom-P3 scores 47.7, BloomZ-xP3 55.3, and BloomZ-xP3mt 57.7.

Several interesting aspects could be observed. First, the english-only fine-tuning prompts increase the LLMs task generalization in all of its other trained languages. Second, the fine tuning leads to several best performances in tasks using zero-shot prompts only. Third, the model can even generate text in languages it was not trained for when a specially crafted multi-shot prompt for a task is given. A better performance for all not-trained language can not be seen, but some tests highlight astonishing results, such as in natural language interference.


With the Gopher Model, Deep Mind systematically checked the influence of model size to model performance. The gopher model family are autoregressive transformer trained in five different sizes: 44M, 117M, 417M, 1.4B, 7.1B, 280B.

All models are trained with 300B tokens from a dataset called MassiveText. This dataset is inspired by The Pile, and contains text from several sources, such as books, news, source code from GitHub, and Wikipedia. The text processing pipeline is very detailed: content filtering, text extraction, quality filtering, repetition removal, document deduplication, and test-set filtering. Only English texts are considered. The Jax library is used for training.

The resulting models are compared with each other and with the following models: GPT-3 (175B), Jurassic-1 (178B), Megatron-Turing NLG (530B). Benchmark tasks range from language modelling, reading comprehension to fact checking and the BIG bench, comprising 156 tasks in total. The researcher found out that a uniform task improvement in reading comprehension is achieved, but in common sense and logical reasoning tasks, the gopher model is worse.

Megatron-Turing NLG

Large LLMs at scale is the research goal of the Megatron Turing NLG. To train this 560B model, both software and hardware innovations needed to be made. The research paper mentions the essential learnings. To slice the 270B Token input effectively, pipeline and data parallelism are fundamental. This is achieved by combining the Deep Speed open-source library for creating batches from the input data, and by parallelizing the resulting tensors with the Megatron framework.

The training hardware is massive: 560 DGX A100 servers with each 8 A100 GPUs. Peak computing output of a single GPU with 16FP precision is 312 tFLOP/s per GPU. Data Sources include The Pile, and snapshots from common crawl, real new, and CC-Stories. Similar to other research, effective input text filtering was deemed essential, and the paper mentions all applied methods. From all sources, the natural language text is extracted, a quality score computed, and a fuzzy similarity score computed. Only texts that pass given threshold values are considered. This leads to 339 billion tokens, from which 270B were taken by weighting the input data sources.

Applied benchmarks encompass completion prediction, reading comprehension, and commonsense reasoning. Compared with GPT3 and Gopher, results vary: In completion prediction, only a marginal improvement can be seen, but in reading comprehension, the zero-shot example performance increases from GPT3 60.50 to Megatron with 78.20.


The Chinchilla LLMs are a continuation of Gopher. The researchers were interested to determine how computing power in terms of FLOPs should be invested in training. Essentially: Is it better to scale the model’s parameter size, or should the amount of training input data be increased? They trained models with sizes ranging from 70M to 10B, and estimated the best model size and token size.

Based on this observation, they then trained Chinchilla, a 70B parameter size model, with 1.4 trillion input tokens. To compare: 175B GPT-3 had 300B, and 280B Gopher had 300B tokens. The input tokens are from the MassiveText dataset, which follows the same principles to collect, clear, de-duplicate text as in Gopher. The training is done on TPUv3 and TPUv4, using the Jax and Haiku library.

The results are astonishing: The smaller Chinchilla model consistently and significantly outperforms Gopher, GPT-3 and other larger models. Furthermore, it uses significantly less computing power and energy, and the smaller size makes it more feasible and efficient for further fine-tuning. Another interesting result is the performance score in the Massive Multitask Language Understanding (MMLU) task. For five-shot prompts, Chinchilla achieves 67.6%, compared to Gopher with 60.0% and GPT-3 with 43.9% a significant increase.


Sparrow is an LLM specifically designed for dialogue. This model was created by starting with a dialogue prompted Chinchilla LLM as the base model, to which then reinforcement learning from human feedback (RLHF) as fine-tuning steps are applied. A unique approach for this model is that its dialogue rules are formulated as natural-language rules. In total, 23 rules were formulated, starting from overall paradigms about the complete dialogue down to detailed “per-turn” rules, which are applied to a single text generation only. Example for these rules are “Do not pretend to have a body or be able to move in a body”, “Only make statements that could plausibly be true; do not say things that are obviously false” and “The agent should not repeat itself unnecessarily”.

In the fine-tuning phase, users were confronted with three tasks. In the per-turn response reference tasks, users were shown parts of a dialogue, and options how the dialogue should continue. These options were actually output from different LLMs, so the user vote determined which model is performing better. In the adversarial-probing task, the users were given one of the dialog rules, and asked to bring the model to break this rule. Calculating the rule violation rate further helped to select the best-performing model. Finally, in the model’s responses with evidence tasks, user could see which data a model uses to provide an answer, and could rate how good the model used its evidence in the given answer. For the evidence itself, the model uses a function to perform Google search queries, scrape the resulting web pages, and receiving a text of 500 characters that is then consumed for making a response.

Comparing the base dialog-prompted Chinchilla 70B model with the fine-tuned version, the trends are recognizable. The Gopher model is preferred by users over the base model. For factual questions, the supporting evidence is cited 78% correct, and he rule violation rate drops to 8%.

Open Pretrained Transformer

The Open Pretrained Transformer LLM is an open-source transformer model. The models are published in 10 different sizes, from 125M to 6.7B and 175B, and can be downloaded from its code repository. The training material includes several sources: BookCorpus, Stories, CC- News v2, The Pile, and Reddit. All input sources were carefully deduplicated, and the authors note that especially in the Pile corpuses several duplicates are present.

The models are tested with 16 different NLP tasks like OpenBook QA and SuperGLUE. The zero-shot performance is on-par with GPT3 (some differences in tasks can be determined, however), but on multi-shot tasks, the performance degrades. Another set of benchmarks were used to access the performance in dialogues, like the Empathetic Dialogue and Blended Skill Talk. The researchers conclude that OPT 175B shows a consistent persona across conversations.

True to the researchers open-source goals, they even published a complete logbook of the training steps. This interesting source details several operational details of the training, including the dealing with software and hardware errors that delayed the models training.

TK Instruct

Instruction fine-tuning is a cornerstone of increasing an LLMs performance. To understand and to compare how models perform with unforeseen tasks, the researchers created a new benchmark, called Super Natural Instructs (SNI), covering 76 task types and containing 1616 tasks. The SNI benchmark contains tasks structured as in-context instructions, also called k-shot examples. These instruction prompts include a task description, context, and optionally positive and negative examples.

Using a pretrained T5 model as the base, Instruct-TK was meta-trained and fine-tuned on the SNI dataset, and then benchmarked with unseen tasks. The comparison metric of choice is ROGUE-L, which determines the longest common subsequence between texts, e.g. comparing the text created by a model with an expected text. Benchmark results with the ROUGE-L metric, show the 11B Instruct-TK beats the non task-finetuned T5 and GPT3 by 30%, and even the task-finetuned 175B InstructGPT by 10%.

Other noteworthy findings are that during Instructs-TK training, 64 instances per task saturated the downstream performance. Apparently, there is an upper threshold value how many instances a model needs to consume before it learns to generalize a specific task and before it starts to overfit the training data. Furthermore, task-finetuning with more diverse tasks improves the performance significantly even for smaller model sizes.


Large Language Models of the second and third generation evolved along parameter complexity, training material, and instruction fine-tuning. From the models covered in this paper, following trends are observable. First, as exemplified and spearheaded by the Megatron model, effective pipeline and data parallelisms is essential for effective, scalable training. Several open-source frameworks address these training needs and could be used by all other following research. Second, as shown by BLOOM and TK-Instruct, using instruction-prompts for training or fine-tuning increases a model’s performance on several NLP and task benchmarks dramatically. There seems to be a threshold value how many instances per tasks are necessary to achieve this generalization. Second, a multi-lingual model can be task fine-tuned with English prompts only, and extend its capabilities in all other trained languages, as shown by BloomZ. Second, the Sparrow model shows how instruction-prompts can be used to define the “behavior” of a model, in this case usage for dialogue. Fourth, the Chinchilla LLM showed that compute-efficient models should focus on training with much higher amounts of input text. Their 70B parameter model with 1.4 trillion input text clearly outperforms models with 2x or 5x number of parameters.