Question-Answer System Architectures using LLMs

7 min readApr 29, 2024


Not long ago, question answering systems were built as complex information storage and retrieval systems. The first component processes text sources, extracts their verbatim meaning as well as specific information. Then another component extracts knowledge from these sources and represents the facts in a database or a graph datastructure. And finally, the retriever parses a user query, determines relevant parts of the processed text and its knowledge databases, and then composes a natural language answer.

Fast forward to late 2022. Powerful LLMs like GPT3.5 emerged. Being trained on massive amounts of texts, they not only have internalized knowledge, but also poses linguistic skills and can answer natural language questions more accurately like no other system before. These models understand and execute instructions to an almost human-level competence.

In a previous article, I stated my goal to create a question-answering system. And it seems that LLMs might be the right tool. This article investigates the architectures of a generic question answering system that are based on LLMs. It covers seven different approaches such as QA finetuning, prompt engineering, and retrieval augmented generation. Each approach is explained concisely, and then compared with each other to understand limits and exceptions for designing a question answering system. The following seven approaches can be distinguished:

  • Question Answering Finetuning
  • Linguistic Finetuning
  • Domain Embedding
  • Instruction Finetuning
  • Prompt Engineering
  • Retrieval Augmented Generation
  • Agents

This article originally appeared at my blog

Question-Answering Finetuning

The first LLMs from 2018, which I named Gen1 LLMs in an earlier article, where early transformer models trained typically on next token and next sentence prediction tasks. These models only have limited linguistic skills like knowledge interference. Essentially, they are learning the semantic similarity of spans of tokens as well as the ability to predict sentences from content that they were trained on. It’s this capability that can be fine-tuned for QA: Specifically crafted training material in which from a given context the correct spans need to be recognized and returned.

Linguistic Finetuning

To give an answer to a question is a linguistic skill. Gen1 LLMs are generative models that complete a given input with stochastically produced text. Without any guidelines or tools, they might produce text until their output window size is exhausted. This text might contain the answer, only to be followed by irrelevant information. In linguistic finetuning, special datasets are used to train the model on linguistic skills such as logical interference, commonsense reasoning and reading comprehension.

Domain Embedding

A pretrained LLM is a closed-book system: It can only access information that it was trained on. With domain fine-tuning, the system manifests additional material. An early prototype of this technique was shown in this OpenAi cookbook: For the target domain, text was embedded using an API, and then when using the LLM, embeddings were retrieved using semantic similarity search to formulate an answer. Although this approach evolved to retrieval-augmented generation, its still a technique to adapt a Gen2 (2020) or Gen3 (2022) LLM into a question-answering system.

Instruction Finetuning

A special type of finetuning with the aim to enhance LLMs abilities of generalizing to new, unforeseen tasks. Invented and exemplified in the Scaling Instruction-Finetuned Language Models paper, a rich tasks-set was created and applied on the T5 model. The result speaks for itself: An enormous increase in several benchmark metrics. In the context of a QA system, instruction-finetuned LLMs are expected to better understand a given question context, and their answers can be controlled with specific instructions, e.g. to give concise answers only.

Prompt Engineering

The evolution of LLMs increased model size, input material, and input tasks, resulting in powerful Gen3 and recent Gen4 (2024) models. During the training process of these models, benchmark datasets are often consumed as well, which gives a model advanced linguistic characteristics and task generalization. These models output can be controlled with specifically crafted prompts. And this leads to prompt engineering: the empirical application and optimization of prompts to an LLM so that they perform reasonably well for a specific task. For QA systems, several prompt engineering techniques appear to be usable: a) zero-shot generation for Gen4 LLMs, in which the model is just presented with the question, b) few-shot learning for Gen3 and smaller Gen4 LLMs, in which the prompt contains relevant task explanation and examples, c) Chain-of-thoughts, a meta technique that instructs an LLM to verify and check its own output, leading to better and more granular answers, as well as reducing hallucinations.

Retrieval Augmented Generation

This approach is more of an architecture that includes a LLM as only one component. Given a user question, RAG systems will access connected data sources, identify relevant documents, and then include these documents (or parts, or their summary) as the context for an LLM prompt. Further enhancing the prompt with definitive instructions like “If you cannot answer with the given context, then just say ‘I don’t know’” increases the answer quality and reduces or stops hallucinations. In a RAG system, LLMs can operate with two principles: a) the push principle, in which a similarity search component actively uses the storage system to extract relevant content and generates an LLM prompt, and b) the pull principle in which the LLM is given access to the storage system to extract relevant content autonomously.


The final approach is a very recent addition to how LLMs can be used. The envisioned use-case of agents is the self-controlled task prioritization and execution to answer a complex query, for example a question about statistics like “What is the minimum and maximum timespan of NASA Missions from 2000 to 2020”? To solve this task, data sources need to be accessed to determine the context, then relevant information needs to be extracted, and this information processed with knowledge interference to create a suitable answer. In agent frameworks, LLMs are augmented with memory and tools. The memory stores all past actions and intermediate results, providing the operations context while the agent works on the task. Tools are special-purpose programs that give access to external information (like an RAG system) or access to special skills, for example statistical calculations.

These approaches provide a landscape of options. The next section details which types of QA systems need to be distinguished, and then compares how the approaches tackle different types of tasks.

Question Answering Tasks

Questions answering systems have two properties that are characteristic:

  • Open-Domain vs Closed Domains: This characteristic determined how many knowledge domains the system is proficient in. In a closed domain, questions from only one specialized domain need to be answered. And in open domain systems, question from multiple domains is expected.
  • Closed Book vs Open Book: The distinction if the system is confined to all information it was trained on, or if it can access additional source and databases that can provide a context for the question.

Combining these characteristics leads to four different system types. Here is the analysis which type can be realized with which approach.

This table indicates several observations:

  • As long as the LLM was being trained on material that the question domain is about, it can be used as a QA system (bar the ability to produce coherent text which might not be present in Gen1 LLMs).
  • When open domain questions are considered, only Gen3 and Gen4 models are applicable, because these LLMs were trained on a wide variety of material for different knowledge areas, and even include multilingual texts.
  • Open-book capabilities can be achieved by Gen3 and Gen4 models when they are used in retrieval-augmented generation and agent systems.

With this, let’s review which approaches are applicable to which LLM.

This table shows several trends:

  • Gen1 and Gen2 models similarly need to be extended with domain material
  • Gen1 and Gen2 models need explicit finetuning to obtain question-answering skills at all, and its unclear if instruction finetuning can be used
  • Gen3 and Gen4 models provide linguistic and question-answering skills out of the box
  • Some pretrained Gen3 and Gen4 need to be instruction finetuned for further boosting their question-answering skills


A traditional question answering system consists of dedicated components: text processor, knowledge extractor, and retriever. A user inputs a query, then the retriever extracts relevant knowledge and formulates the answer. This same functionality can be achieved with LLMs too. This article explained seven approaches to use LLMs for question-answering systems: a) question-answering finetuning, b) linguistic finetuning, c) domain embedding, d) instruction finetuning, e) prompt engineering, f) retrieval augmented generation, g) agents. Furthermore, the article investigated how Gen1 to Gen4 models support these approaches, with the conclusion that the Gen3 and Gen4 models can be used as-is for closed-book, and in the context of retrieval augmented generation and agents also for open-book systems.