Large Language Models: Library Overview for Training, Fine-Tuning, Intererence and More

5 min readMar 4, 2024


In essence, Large Language Models are neural networks with a transformer architecture. The evolution of LLMs is a history of scaling: input data sources and tokenization, training methods and pipeline, model architecture and number of parameters, and hardware required for training and interference with large language models. For all of these concerns, dedicated libraries emerged that provide the necessary support for this continued evolution.

This article provides a concise snapshot of libraries for LLM training, fine-tuning, interference, optimizations, vector databases, and utilities.

This article was written in December 2023. The availability of the libraries may have changed until the publication date of this article.

This article originally appeared at my blog

LLM Training

The training of LLMs needs high-level scheduling of input batches, gradient calculation and gradient updates for multiple nodes and GPUs.

  • Alpa: Python project to train and run LLMs with on-premise GPU clusters.
  • Colossoal AI: Distributed training of LLMs that supports two specific forms of reinforcement learning: reward model creation and reinforcement learning with human intervention.
  • GPTNeoX: A library for efficiently scale training on several GPUs.
  • Fairscale: A PyTorch expansion for efficient data batching when training on a restricted amount of GPUs.
  • DeepSpeed: Distributed training and interference.
  • JAX: Combines two libraries into a coherent framework to execute machine learning pipelines parallelly on several nodes. It includes the Tensorflow XLA library that provides just-in-time compilation for running on GPU, TPU or CPU, and the Autograd function to compute function derivatives.
  • Megatron-LM: A framework for multi-node and model-parallel training of transformer models.
  • T5X: An integrated framework for training, evaluation and interference.

LLM Fine-Tuning

Pretrained LLMs, also called foundation models, need to be customized for specific domains and with specific training materials. This fine-tuning process produces models with desired capabilities and content.

  • llama_index: A framework for designing data ingestion pipelines to fine-tune LLMs with private data sources.
  • nanoGPT: Training and fine-tuning GPT2 models.
  • promptsource: A tool for managing versioned NLP prompts, which can be used for example during fine-tuning.
  • trlx: Fine-tuning with using reinforcement learning by using a reward function or a reward-labeled dataset.
  • xturing: Complete and coherent fine-tuning of transformer models including data ingestion, multiple GPU training, and model optimizations like INT4 and LoRa.

LLM Interference

Trained and potentially fine-tuned models needed to be loaded in memory and feed with appropriated tokenized input that matches the same structure that was used for training. The output of LLMs, which are neural networks, is a numerical representation that needs to be converted to text again. Interference libraries solve these task and allow users to work with texts.

  • Embedchain: A python library for designing retrieval-augment generation prompts using a simple API to reference external datasources and loading different LLMs as the application base.
  • ggml: LLM interference for Tensorflow models supporting CPU and GPU interaction. Subprojects for running LLMs exists, sich as llama and whisper.
  • lit-gpt: A Python library for running several open source LLMs locally, such as LLaMA, Mistral, or StableLM.
  • lit-llama: Running LLaMA models locally.
  • llama2.c: Running fine-tuned LLaMA models.
  • LLMZoo: A framework for running open-source LLMs, and it also includes feature for training/fine-tuning and evaluating LLMs, as well as datasets.
  • Transformers: Transformers is a versatile library for loading many pretrained models and incorporate them into a PyTorch or TensorFlow pipeline. This library exposes its models as objects with convenient methods that help to introspect and transform its properties.

LLM Interference Optimizations

Interference is a compute-heavy process. Utilizing optimizations based on the trained model, such as reducing the floating-point precision, greatly reduces required resources with only minimal performance impact.

  • bitsandbytes: Provides 8bit CUDA functions for training with PyTorch. It is used to parametrize the Transformer model definition.
  • peft: An acronym for parameter-efficient fine-tuning methods. This library enables the exposure and modification of trainable parameters from an LLM. It is used to define the custom quantization properties that create an abstracted model.
  • TensorRT-LLM: A LLM interference optimizer that provides a high-level Python API for loading LLMs as well as creating standalone Python and C++ runtimes for executing LLMs. It also contains an integration to the Triton Interference Servers. Earlier known as FasterTransformer.

LLM Interference Vector Databases

Vectors for single words or complete documents can be stored in a vector database. Providing embeddings in a simple-access store enables fast-retrieval during training, but are mostly used at interference time for providing textual context in prompts.

  • Chroma: Open-source project with Python bindings and API support for several large language models. The embeddings format can be tailored towards its use case: Default type is sentence transformer, but it can also work with OpenAI embeddings for GPT.
  • Milvus: Provides a Kubernetes-ready database with convenient Python bindings. Internally, the key-value store MinIO is used.
  • PineCone: An enterprise solution that offers managed GCP and AWS platforms for embeddings.
  • Postgres pgvector An open-source extension to Postgres that allows the creation of custom schema to hold word vectors.
  • Redis Enterprise An enterprise-only feature of Redis to store text, audio and video as vectors and even perform similarity comparisons with build-in commands.
  • Weaviate: Open-Source project that stores text documents and their vectors as JSON data and offers an easy-to-use API that processes GraphQL like requests for similarity searches.

LLM Cloud Environment

All major cloud providers offer dedicated platforms to support at least training and interferences, if not all LLM lifecycle phases, as a paid service.

  • Enterprise-AI: Nvidias cloud platform for developing and hosting AI models, supporting tight integration with their Triton SDK for developing interference server runners.
  • Fabric: Microsoft cloud platform covering the complete lifecycle of LLM development and interference.
  • Vertex AI: Googles cloud platform for training and hosting LLMs and other generative AI models.

LLM Utilities

This section includes other interesting projects or products I encountered during research.

  • evals: A framework for evaluating LLMs.
  • LLMOps: Meta repository about Microsoft Research LLM projects, including prompts, acceleration, and alignment.
  • Meta Group: All models published by Meta, including Metaseq for pretrained Transformer models.
  • Nebuly: Platform for getting user analytics from LLMs.
  • XManager: A platform for running and packaging machine learning projects that can be executed locally or via Google Cloud Platform.
  • Lambda Stack: A multi-repository that includes Tensorflow, Keras, PyTorch and Cuda support bundled into one package.


The landscape of LLMs is huge. This article listed more than 30 libraries that can be used for LLM training, fine-tuning, interference, interference optimization and vector databases. It also listed cloud provider applications and other projects in the context of LLMs.