Large Language Models: Modern Gen4 LLM Overview (LLaMA, Pythia, PaLM2 and More)

12 min readFeb 26, 2024

Large Language Models are sophisticated neural networks that produce texts. By creating one word at a time, given a context of other word, these models produce texts that rival humans. The creation of LLMs began back in 2018 and continues up to this data with ever more complex model architectures, consumed amount of texts, and parametric complexity.

Continuing the last article which focused Gen1 LLMs with a timespan from 2018–02 to 2020–06, this article covers both Gen2 and Gen3 LLMs, stopping with OpenAIs GPT3.5 Turbo which became the standard model for the widely known Chat GPT products. This article covers following models:

Meta AI

  • LLaMA
  • Galactica

Eleuther AI

  • Pythia


  • PalM-E
  • PaLM 2


  • Falcon (Technology Innovation Institute)
  • GLM (Zhipu AI)

Closed-Source LLMs

  • Bard
  • Claude
  • Jurassic-2
  • Pi
  • GPT-4

This article originally appeared at my blog


The research paper about LLama paper starts with the premise of acknowledging the insights of the Chinchilla paper, and add the further constraint of an interference budget. This constraint considers the cost to actually run a model, and the goal of this paper is to create an interference-optimal model. The result are the LLamA language model family.

To train the models, the following sources are used: Common Crawl, C4, Github, wikipedia, books, arXiv, and StackExchange. In total, this dataset is 1.4T token long. From this dataset, 1T tokens are used to train the 6.7B and 13.0B model, and the full 1.4T tokens are used for the 32.5B and 62.2B models. Several technical improvements to the training method are mentioned that increase the effectiveness. The training occurred on 2048 A100 GPU with 80GB of RAM.

The trained models are then compared with different benchmarks. The results speak for themselves — similar to the Chinchilla models, using a high amount of input training data show benefits. In the common-sense reasoning task, evaluating zero-shot examples, the LLama models score best results except in the BoolQ and WinoGrande tests. And for reading comprehension, only in the RACE-middle benchmark, a PAlM 540B model fares better. Finally, after an instruction-finetuning, the LLAaMA 65B model scores 68.9 points, a new high score for five-shot examples.

The LLaMA models also started a trend that grew rapidly. The 7B model is small enough to run on consumer hardware with a single GPU, enabling on-premise LLMs.


LLMs for generating common knowledge texts are plentiful. But what about models with scientific knowledge that can answer questions about physics equation, show the structure of an atomic molecule, or the structure of a protein molecule?

Galactica is an LLM for science. It was trained on 106B texts for various sources, especially research papers (88B) and reference material (8B). Research paper sources include among others arXiv, PMC, and the Semantic Scholar. The reference material includes general sources like Wikipedia or StackExchange, and special sources like PubChem Compound, Ribosome and NASA Exoplanet. Especially important thereby is a robust tokenization. The paper mentions several novel approaches, such as listing step-by-step reasoning tasks with a dedicated token, splitting mathematical operations into different ASCII groups, and annotating molecule and protein structures in source text.

It’s not surprising that Galactica outperforms other LLMs in benchmarks about scientific knowledge. The 120B Galactica model performs highest in all mathematical MMLU tests, it scores 41.3 against Chinchilla120B with 35.7. In scientific question answering, it wins in 50% of categories like abstract algebra and medical genetics, while other categories are dominated by the Chinchilla model.

This LLM is akin to a multi-lingual model because it can output knowledge in different desired formats. For example, when asked to render the Schwartzschild radius (the radius of the event horizon of a black hole) in the Latex language, it outputs r_{s} = \frac{2GM}{c^2}. Or when asked to present the human genome, it textually and visually renders the amino acid sequence correctly.


The OPT-IML models is a continuation of the OPT model with the specific goal to better understand how instruction fine-tuning increases a model performance when scaled in both model size and benchmark size. To understand this, the researches created a new instruction meta learning benchmark, containing accepted benchmarks and enriched with new tasks.

The curated benchmark consists of 8 meta-datasets with 100 task categories. Included benchmarks are for example Super-NaturalInstructions, PromptSource and CrossFit. This benchmark is divided into a train set with 17.9M examples, and a 145K dev and 321K test set. During training, different variations of benchmark sets were used, such as varying the task-mixing, the benchmark proportions, and task or category scaling.

A 30B and 170B OPT-IML, a finetuned OPT, was created and compared with OPT. The benchmark on standard NLP tasks showed a 6–7% score increase in zero-show and few-shot examples, and on the FLAN benchmark, a 5% to 30% increase dependent on the task category.


Note: Athough its application scope is different from all other models mentioned in this article so far, I find it too fascinating to not include here

PALM-E is a unique multimodal model that supports visual language tasks, NLP tasks, and embodied reasoning task. By hosting this model inside a robot vehicle with an arm, it can be given the task “Bring me the green bag from the desk”, and it will drive to the desk, record images, decide which bag to take, then take and return its. Other fascinating examples are given a picture of a restaurant table and ask the robot how to be helpful, it will respond with a task list like “clean the table, put back the chairs” or to present the robot with a picture of ingredients and ask him how to prepare a dish.

The PALM-E model is a combination of the PALM 540B model with the 22B Vision Transformer model. Technically, the robots state is captured as a state estimation vector (pose, size etc.), and its vison input is forwarded to the vision transformer model to get a token embedding of the picture. To capture the relationships of objects inside an image, object-centric representations and object scene representation transfer are used. At interference time, the sensor data and state are continuously injected into the LLM, and it can be queried with modal-mixed sentences, like “What is included in the following img tag”, where the image is taken from it sensor and translated into vector space. The output reflects the given input task: The model can merely describe an observation, or provide a sequence of decisions.

A mix of benchmarks was used to test PALM-E, such as visual-question-answering (VQA), image captioning, and standard language model task. However, since there is no similar embodied LLM for comparing, the results cannot be compared directly. The project webpage contains videos of a robot using PALM-E.

PaLM 2

As a continuation of the PaLM model, the researches addressed three concerns: optimize the scaling of input tokens to model size based on a tFlops compute budget, improved multi-lingual and multi-domain input training dataset, and pre-training with different language modelling task.

The training dataset includes mainly web documents, joined by a mix of books, programming source code, and conversational data. The concrete content is not explained in the research paper. Three models were created: PaLM 2-S, PaLM 2-M and PaLM 2-L. Unfortunately, its parameter sizes are not exposed.

The research paper compares PaLM2 exclusively with PaLM in several benchmarks. In question answering 1-shot tasks, PaLM 2-L improves on PaLM from an average score of 70.4 to 76.9. and in toxicity classification from 71.45 to 75.96. Finally, an instruction fine-tuned version of PalM 2-L was created and benchmarked against GPT-4: In the WinoGrande and DROP tasks, PalM 2-L wins.

Although the technical report does not detail PaLM2 internals, it nevertheless mentions other interesting concepts to better control language model behavior. For example, they included special tokens that can control toxicity at interference time, or canary tokens in the pre-training data to improve memorization. It should also be noted that although the training data is multi-lingual, the model’s performance on english tasks improved too.


The Falcon LLM family consists of three models with 7B, 40B, and 180B parameters. The researches carefully addressed three concerns with their model design. Performance scalability means to consistently monitor the models capabilities with a carefully chosen few-shot example NLP benchmarks. Data scalability ensures high-quality, deduplicated input data is available in orders of trillion tokens to ensure the parameter to input-token relation that was established by the Chinchilla model. Hardware scalability ensures that model and pipeline architecture are compatible with any parameter size and require only to increase the number of GPUs for training.

Following these considerations, the researchers test how smaller models perform with public corpus data and a mix of public data and closed sources. To their surprise, they found that models trained on public data along can outperform other models. Based on this, they created the RefineWeb corpus with a total number of 5T tokens. From this, a 600B excerpt can be downloaded. The design principles for creating this corpus are detailed in the RefinedWeb paper, but in short its a pipeline starting with URL filtering and text extraction, followed by repetition removal and finally deduplication.

The large 180B models was trained on 3.5 trillion token from the RefinedWeb corpus. Training was done on cloud infrastructure, at peak times 4096 A100 GPUs were used, and best practices for data parallelism and pipeline parallelism were utilized. The research paper clearly outlines all training and model architecture decisions.

Considering the overall performance, on NLP tasks the 180B Falcon model performs similar to the 340B PaLM2 model, with two deviations in the ANLI and RACE tasks. Compared with GPT3.5 and GPT4, in common sense tasks as well as question answering, it delivers performance in between these models. And when compared with models like Chinchilla, Megatron-NLG, PaLM and Inflection, the 180B model scores best in all tasks.


GLM is a bi-lingual english/chinese model. It feature a novel transformer architecture — the name giving General Language Model — in which the training objective is autoregressive blank filling. During training, two objectives need to be fulfilled: Filling short blanks up to a known length in a given textual context, and filing a random amount of blanks up to a given end of a sentence with surrounding context. This method leads to a robust model that is capable of text generation and generalizable to many downstream tasks.

The 130B model was trained on 96 A100 GPUs. The input were 400B tokens, stemming from sources such as 1.2T Pile, 1.0T Chinese Wudao-Corpora, and English and Chinese web page crawler. The research paper considers and reports in detail about unexpected situations during training. For example, they recognized that the FP 16Bit (floating point) precisions can lead to a training collapse that is recognizable when the gradient norm of the embedding layer spikes. And they frequently encounter spikes of the loose function because of using mixed precision (FpP6 for forward and backward propagation, and FP32 for optimizer states and master weights).

Following the advice of the Chinchilla model that model size should scale with input token, the huge input training data pays off in benchmarks. In MMLU, GLM-130B is on part with the 175B GPT-3. In the BIG lite benchmark, GLM beats all other testes models for zero-shot examples.


The Pythia LLM family consists of 12 members with a size of 70M to 12B. They ware trained on the exact same input sources, and in the very same order, to understand how parameter scaling impacts model performance. All models are published and accessible.

The training material follows the established steps of quality-criteria filtering and deduplication. To ensure consistent training for all model sizes, training practices used by other LLMs were modified. For example, the batch size of input examples is usually small, especially for models smaller then 1B. However, the researchers found no convergence issues and choose a consisted batch size of only 1024 samples. 300B from The Pile dataset were used for training. The training itself is done with the GPT-NeoX open-source library and A100 GPUs with 40 GB RAM were used.

An interesting observation is how bias evolves in LLMs with regard to training and model size. To see the effects, the researchers used specific model checkpoints as starting points, and then modified the input sources for the next training steps to contain less biased language. This intervention led the largest models to become actually anti-stereotypical in their language use — this is a novel approach to reduce bias, and could possibly also be used to reduce toxicity. Another research question address is whether training data order influences a model’s memorization, where the clear answer is no. And finally it was tests how term frequency in input material impacts question answering tasks. Even when the frequency is high, models smaller then 1B fail to learn from the input, and show a bad performance on qa tasks. Only larger models scale with higher term frequencies and can increase their performance profoundly.

Closed-Source LLMs

For the most recent commercial LLMs, no research papers are available, and the models themselves can only be used via an API. Therefore, they are only explained shortly with their most important properties:

  • Bard: Googles chatbot was based on LaMDa, then PaLM, and in late 2023 by Gemini, an even newer LLMs from Google. Its assumed to be an instruction and dialogue finetuned model. It can be accessed here:
  • Claude: An AI assistant from Anthropic. It is based on the Claud 2 LLM, but no details are published. Its open beta version can be used here:
  • Jurassic 2: Three model sizes are available: large, grande, jumbo. It instruction fine-tuned and is capable to understand and create text in different languages. It can be access via dedicated APIs, such as for summarizing and paraphrasing, or text completion
  • Pi. This personal assistant is based on the Inflection-1 LLMs. The published technical memo does not detail the models development, it only states that thousand of H100 GPUs and on a large corpus of data from several domains. In the massive multitask language understanding, Inflection 1 scores 72.7 points on average, beating GPT 3.5, PaLM, Chinchilla and LLaMA, but can not beat the more closed-source PaLM 2 and GPT-4. Use it here:
  • GPT-4. OpenAIs current model support multiple languages and is multi-modal capable. A technical report does not detail its training process or input sources, but states that reinforcements learning with human feedback and objective criteria’s were used. It can be used here:


The evolution of Gen4 LLMs follows the best-practice research results: First, a scaling of parameter size and input tokens considering a compute budget. Second, using a carefully filtered and deduplicated dataset from several domains and with several languages. Third using instruction fine-tuning and reinforcement learning for model alignment. Considering this, the Gen4 models have several aspects in common: The training process is well understood, effective data parallelism and pipeline parallelism is in place. Several different benchmarks are used, from classical NLP to a language toxicity check with HELM and the extrapolating of LLMs with BIG. Looking at new trends, we see on the one hand very detailed research papers and available checkpoints from open-source models, but on the other, a trend to provide only technical reports about LLMs focusing on benchmarks and omitting details about training material and training process. An interesting trend is domain-specificity as shown by the Galactica model with its wide capability for scientific knowledge, and the PaLM-E multi-modal model that allows a robot to perform tasks. Finally as shown especially with the closed-source LLMs it’s clear that LLMs are maturing and several companies use them in their products or offer them as services to others.