Fine-tuning, which frequently refers to Instruction Fine-tuning, uses supervised learning to improve real world, task-specific performance of pre-trained LLMs. This could be for single or multiple new tasks. The resulting fine-tuned model can have entirely new weights (Full fine-tuning) or it can be partially updated with new weights. This form of supervised learning is usually performed with off-the-shelf labeled training data such as Amazon product reviews with their associated star ratings or manually created summaries.

Fine-tuned LAnguage Net (FLAN) is used for instruction fine-tuning various models. FLAN-T5 is a version of the T5 model that has been instruction fine-tuned using FLAN.

Metrics

Metrics to measure whether fine-tuning is having the desired effect include ROUGE and BLEU. ROUGE is primarily used as a metric for summarization tasks and BLEU is used for translation tasks. ROUGE compares how closely the fine-tuned model’s output resembles the human-produced example or reference. This can be done for single words (unigrams), two-word sequences (bigrams) or n-word sequences (n-grams). There are many limitations including the fact that the metric does not consider the sequence of words with unigrams and only approximates order with the bigrams or n-grams.

\[ROUGE{–}1\ Precision = \frac{unigram\ matches}{unigrams\ in\ output} \]

\[ROUGE{–}1\ Recall = \frac{unigram\ matches}{unigrams\ in\ reference} \]

\[ROUGE{–}1\ F1 = 2 * \frac{precision * recall}{precision + recall} \]

ROUGE-2 and ROUGE-n can be calculated in a similar manner as shown above. The ROUGE metric can also be calculated based on the Longest Common Subsequence (LCS) using the formulas below:

\[ROUGE{–}L\ Precision = \frac{len (Longest\ Common\ Subsequence)}{unigrams\ in\ output} \]

\[ROUGE{–}L\ Recall = \frac{len(Longest\ Common\ Subsequence)}{unigrams\ in\ reference} \]

\[ROUGE{–}L\ F1 = 2 * \frac{precision * recall}{precision + recall} \]

Libraries like fmeval simplify the task of calculating these metrics.

Other metrics GLUE (2018), SuperGLUE (2019), HELM MMLU (2021) and BIG-Bench provide more comprehensive assessment of model performance. For example, HELM measures model performance based on accuracy, calibration, robustness, fairness, bias, toxicity and efficiency. BIG-Bench has Lite, Basi and Hard versions since running the benchmark itself is computationally expensive.

Full fine-tuning

Full fine-tuning is expensive - it requires the same type of compute and memory to hold optimizer states, gradients, forward activations, training weights and temporary memory. As little as 500-1000 examples may be enough to improve performance for a single task. Training can also be done for multiple tasks. There is a risk of catastrophic forgetting where training the model on a new task causes it to forget what the model was performing well at prior to fine-tuning. Catastrophic forgetting may be ok depending on the application. Techniques that use regularization or partial fine-tuning are less likely to cause catastrophic forgetting. Similar to any supervised learning, the prompt and response pairs are split into training, validation and test data sets and the training aims to minimize loss. The resulting fine-tuned model requires the same amount of memory for inferencing. Further, full fine-tuning creates a full copy of original (reference) LLM for each task.

Partial fine-tuning or Parameter Efficient Fine-Tuning (PEFT)

Only a small number of weights (as little as 15-20%) of weights are modified / added during PEFT, which means it can often be performed on a single GPU. The resulting PEFT weights are only MBs in size, as opposed to the GBs required to store the entire model. Multiple approaches for PEFT exist which tradeoff between parameter efficiency, training speed, inference costs, model performance and memory efficiency. There are multiple PEFT methods available:

  1. Selective - Select subset of initial LLM parameters to fine-tune. Mixed performance
  2. Reparameterization - Reparameterize model weights using a low-rank representation. (LoRA is a reparameterization technique which is described below)
  3. Additive - Keep all original weights as-is, but add trainable layers or parameters to the model. This can be done through Adapters where you add new trainable layers to the model typically inside the encoder or decoder components after the attention or feed-forward layers. Soft prompts adds a set of trainable parameters to the prompt embeddings. Prompt tuning is one such techniques where a set of virtual tokens are prepended to the user’s prompt. This is not to be confused with prompt engineering. Different soft prompt can be added for different tasks. For large models (>10B), prompt tuning can be as effective as full fine-tuning. Virtual tokens are not real ‘words’ in the dictionary. Therefore, interpretability can be a consideration.

Low Rank Adaptation (LoRA) injects two rank decomposition matrices (aka low rank matrices) besides the frozen weights. This is typically done just in the encoder’s self attention layer. Rank sizes in the range of 4 to 32 are typical. Much larger rank decomposition matrices do not yield substantial improvements. Since the PEFT adapters are small it is easy to create and manage separate adapters for each task and switch out the reference / original model with the model with PEFT adapters. Generally, fine-tuning with LoRA results in ROUGE metrics that approach what you would get with full fine-tuning, but not equal to full fine-tuning. QLoRA extends this approach with quantization.

References

Credits: Course notes from DeepLearning.AI’s Generative AI with LLMs course