Introduction#

Project Lifecycle

Every LLM project goes through at least some version of this lifecycle:

_images/project-lifecycle-1.png

(Diagram taken from DeepLearning.AI, provided under the Creative Commons License)

Focus of Workshop

_images/project-lifecycle-1-annotated.png

Focus of this workshop are to look at best practices with regard to:

  • Selecting and Executing open source LLMs on Quest and on the Kellogg Linux Cluster (KLC)

  • Adapting models by using fine-tuning to improve performance and accuracy

  • Integrating with external resources at run-time to improve LLM knowledge and reduce hallucinations

Define the Use Case#

Determines the model you pick, evaluations, data

_images/project-lifecycle-2.png

Your plan should specify:

  • What data will I be using to achieve my research goal?

  • How much data do I need?

  • How will I evaluate LLM output?

  • What counts as good enough?

Types of Use Cases

LLMs support different types of use cases, often with somewhat different underlying model architectures:

_images/LLM-use-cases.png

Select a Model#

Many models to choose from

_images/project-lifecycle-3-annotated.png

Why choose open source over closed source models like GPT-4?

  • Reproducibility

  • Data privacy

  • Flexibility to adapt a model

  • Ability to share a model

  • Cost at inference time

Models vs. Code

The model is a large file of weights, code loads the weights into the correct neural network topology and repeatedly executes ahuge number vector/matrix operations (image source)

[Code for training a GPT-2 class model is only slightly longer]

_images/llm-intro.png

Models vs. Code

_images/model-v-code.drawio.png

Model Hubs

One widely used model hub is from Hugging Face:

_images/model-hub.png

Benchmarks and Leaderboards: Chatbot Arena

This is the chatbot arena leaderboard as of 2024-03-04:

_images/chatbot-leaderboard.png

Benchmarks and Leaderboards: Others

There are many other benchmarks:

_images/big-benchmarks-collection.png

Benchmarks and Leaderboards: HELM

The growing capabilities of very large LLMs have inspired new and challenging benchmarks, like HELM:

_images/helm-benchmark.png

Executing an Open LLM

Executing LLMs on a GPU is much faster than using CPU. We will show you how to access GPUs for training and inference on Quest/KLC

_images/gpu-v-cpu.jpg

Adapt the Model#

Fine-tuning

While we should always start with crafting good prompts in order to achieve the best performance we can, it may sometimes be advantageous to adapt a model to improve its performance. Fine-tuning is one way to achieve this goal.

_images/project-lifecycle-4-annotated.png

Fine-tuning

Fine-tuning can improve model performance, and reduce the need for complex prompts (saving on context use). Fine-tuning is particularly important for smaller models, and can boost performance to levels comparable to bigger models.

_images/full-fine-tuning.png

Evaluation Metrics

Evaluation metrics depend on the type of task. For information extraction tasks, metrics such as precision and recall are appropriate

_images/precision-recall.png

Application Integration#

Deployment as an Application

LLMs are usually deployed as a component of a larger application. This larger application can make use of external resources, such as collections of documents, or knowledge bases. Deployment must also take into account the computational resources that are available, such as the availability of GPUs and sufficient memory.

_images/project-lifecycle-5-annotated.png

Model Quantization

Models can consume very large amounts of memory. The largest model you can currently run on Quest has to fit into a 4 Nvidia A100s with 80GB of RAM each. This is a lot, but you have to contend for these nodes with the rest of Northwestern. One way to tackle this challeng is to quantize your model weights, lowering FP precision in order to consume less memory:

_images/FP8-scheme.png

Retrieval Augmented Generation (RAG)

No model can “know” anything about events that have occurred after its training cutoff date. One way to overcome this obstacle is to integrate external resources, such as Retrieval Augmented Generation (RAG). RAG can result in better prompt completions and fewer “hallucinations”.

_images/RAG-intro.png

Fig. 1 source#