Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

2026-05-02T00:00:00+00:00

Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.

Before going into those algorithms, it is useful to start from a simpler question:

Why do we need reinforcement learning at all, if supervised fine-tuning already teaches a model to imitate good answers?

This is the question I want to unpack in this first note.

1. Supervised learning teaches imitation

Supervised learning is the most familiar training paradigm. We collect input-output pairs and train a model to imitate the target answer.

For language models, this may look like:

Instruction → Desired response

or, in a local-service search system:

Query → Structured intent / entity labels

The objective is straightforward: minimize the difference between the model output and the provided label. In SFT, the model learns to reproduce high-quality demonstrations.

This is powerful, but it has a limitation: the model is still learning from a fixed dataset. It is not directly optimizing what happens after its output is used.

In other words, supervised learning answers:

Given this input, what output should I imitate?

It does not naturally answer:

If I generate this output, will it lead to a better downstream outcome?

That distinction becomes important when we care about long-horizon behavior, user preference, reasoning quality, tool use, safety, or search-result quality.

2. Reinforcement learning optimizes decisions, not just labels

Reinforcement learning changes the framing.

Instead of learning from static input-output pairs, an agent interacts with an environment. At each step, it observes a state, chooses an action, receives feedback, and moves to a new state.

In classical RL, this could be a robot walking through a maze. In an LLM, the analogy is surprisingly natural:

State: the prompt, the conversation history, retrieved context, and previously generated tokens.
Action: the next token, a tool call, a rewrite decision, a retrieval decision, or a final answer.
Reward: a signal measuring whether the output was useful, correct, safe, preferred, or beneficial to the downstream system.
Trajectory: the full sequence of decisions leading from the initial prompt to the final response.

This is why LLM generation can be viewed as sequential decision making. The model does not produce an answer in one atomic step; it builds the answer token by token, and each token changes the context for the next decision.

3. Why LLM RL is harder than ordinary supervised fine-tuning

In supervised fine-tuning, the data distribution is fixed. The model sees examples and learns to match them.

In reinforcement learning, the model affects the data distribution. Once the policy changes, the model starts generating different outputs, which leads to different rewards, different future states, and different training signals.

This creates several difficulties:

Credit assignment If a long answer is good or bad, which token or reasoning step deserves the credit or blame?
Delayed reward The quality of an answer may only be known after the full response is generated, after a judge evaluates it, or after users interact with the result.
Exploration vs. exploitation The model must sometimes try new behaviors to discover better outputs, but too much exploration can damage quality.
Training instability Updating a large model too aggressively can cause it to drift away from its original language ability.

This is why LLM RL methods often include stabilizers such as KL penalties, clipped policy updates, reward normalization, or preference-based objectives.

4. From RLHF to modern post-training

The best-known LLM reinforcement learning recipe is RLHF: reinforcement learning from human feedback.

A typical RLHF pipeline has three stages:

Supervised fine-tuning Train the model on human-written demonstrations.
Reward modeling Ask humans to compare model outputs and train a reward model to predict human preference.
Policy optimization Fine-tune the model to maximize the reward model while preventing it from drifting too far from the original model.

This framing was popularized by InstructGPT, which used human feedback to make GPT-style models better at following instructions and more aligned with user intent. The InstructGPT paper showed that a 1.3B parameter model trained with human feedback could be preferred over a much larger 175B GPT-3 model on their prompt distribution, highlighting that post-training quality is not only about scale.

Later methods simplified or modified this pipeline. DPO, for example, removes the explicit reward-model-and-RL loop by directly optimizing on preference pairs. GRPO, introduced in DeepSeekMath, modifies PPO-style training by using group-relative comparisons, improving memory efficiency for reasoning-oriented LLM training.

5. The practical engineering view

For applied LLM systems, I find it useful to think about RL not only as an algorithm, but as a system design pattern.

The important question is not merely:

Which algorithm should I use — PPO, DPO, or GRPO?

The more practical question is:

What feedback signal do I have, and how can I convert it into better model behavior?

In real production systems, the reward signal may come from:

human preference labels,
rule-based correctness checks,
LLM-as-a-judge evaluations,
retrieval success,
user engagement,
search relevance,
safety constraints,
tool-execution success,
or downstream business metrics.

Once we have a feedback signal, the next question is how to use it safely. Sometimes SFT is enough. Sometimes preference optimization is better. Sometimes a verifier or judge model is more practical than full RL. Sometimes the best solution is not online RL at all, but offline labeling, compact-model post-training, and cache-based serving.

This is especially true in search systems, where latency, cost, evaluation stability, and serving reliability matter as much as model quality.

6. Summary

Supervised learning teaches a model to imitate examples. Reinforcement learning teaches a model to improve decisions based on feedback.

For LLMs, this distinction matters because generation is sequential, feedback is often delayed, and the best output is not always defined by a single ground-truth label.

Modern LLM post-training can be seen as a spectrum:

SFT → reward modeling → PPO-style RLHF → DPO-style preference optimization → group-relative methods such as GRPO

The key idea is not that every LLM system must use reinforcement learning. The key idea is that production LLM systems need feedback loops. RL is one powerful way to formalize and optimize those loops.

In Part II, I will go deeper into PPO, DPO, and GRPO, and explain why modern LLM RL is gradually moving from generic preference alignment toward reasoning, verification, and agentic workflows.

References

Ouyang et al., “Training language models to follow instructions with human feedback,” 2022.
Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” 2023.
Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024.

Kevin Tian / AI Specialist