Motivation for RLHF - The goal is to have responses that are Helpful, Honest and Harmless (HHH). Combining Reinforcement Learning (RL) with human feedback is one way to improve results.
Using a tic-tac-toe example: The environment refers to the current state of the board. This current state is fed to the agent that aims to optimize for a certain outcome, which would be winning the game. Action refers to the possible moves from the current board state. One iteration through this process is called a playout or rollout. The term rollout is more commonly used with RLHF.
Applying the same concepts to the objective of improving HHH for an LLM: First step is to start with an instruct LLM (instruction fine-tuned LLM) and use it to a number of completions for a given prompt. The completions are initially ranked for the desired objective (such as helpfulness) by a human labeler based on detailed instructions provided on how to rank for the desired outcome. Typically, multiple human labelers are used to rank the same completions to get reliable results.
Replacing human labeling with a reward model: Human feedback is ideal, but not scaleable. The human feedback in the step above is used to train a reward model to rank future prompt completions. To do this, the human rankings from the step above are converted into pair-wise training data for the reward model. If there are three completions (A, B and C), which are ranked as 2, 1, and 3 respectively, with one as the highest rank, we end up with three pairs (AB, BC and AC) with corresponding scores of [0,1], [1,0] and [1,0]. We have
$$n \choose 2$$combinations where n is the number of prompt completions generated. The reward model is typically a language model like BERT that has been trained using supervised learning methods on these pairwise ranking produced by human labelers. These pairs are reordered to have the human-preferred completion first. Therefore, BA, BC and AC in our example. The reward model acts as a binary classifier - e.g. not hate or hate - and it produces logits (unnormalized probabalities) for the two classes. These logits can be converted to a probability using a softmax function. The postive case (desired outcome) would be a higher value than the negative case (undesired outcome).
Using the reward model - The flow is as follows
$$ Prompt\ dataset \rightarrow Instruct\ LLM\ to\ fine\ tune \rightarrow Reward\ Model \rightarrow RL\ Algorithm$$. The Instruct LLM to fine tune becomes the RL-updated LLM as the RL algorithm updates its weights based on the binary classification received from the reward model. This flow is considered one iteration and would continue for a number of iterations or epochs. Each pass through this would result in the weights in the RL-updated LLM getting updated.
RL algorithm - takes the output of the reward model and updates the model weights to produce more human-aligned completions. Proximal Policy Optimization (PPO), DPO and ORPO are some exmaples of RL algorithms. ORPO is the new entrant.
Reward hacking - As we continue through the RLHF process we expect the reward model’s scores to go higher as the model produces more desirable outcomes. At the same time we do not want the result to diverge too much from the original meaning and intent or just become non-sensical. Three techniques to manage this:
- Stop the training when we reach a certain reward score
- Stop the training after some set of iteration
- Use K-L divergence score to penalize completions that diverge from the original completion. Comparison is made to the reference model, which is unaltered through the training process
References
Credits: Course notes from DeepLearning.AI’s Generative AI with LLMs course