<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://kevinmtian.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kevinmtian.github.io/" rel="alternate" type="text/html" /><updated>2026-05-03T02:32:20+00:00</updated><id>https://kevinmtian.github.io/feed.xml</id><title type="html">Kevin Tian / AI Specialist</title><subtitle>Kevin Tian, also publishing as Mu Tian / Tian, Mu, is a Senior Machine Learning Engineer at ByteDance/TikTok Singapore and ex-Meta Research Scientist working on LLM post-training, agentic/RAG search, multimodal foundation models, and generative vision.</subtitle><author><name>Kevin Tian</name><email>kevinmtian@gmail.com</email></author><entry><title type="html">Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making</title><link href="https://kevinmtian.github.io/posts/2026/05/02/rl-llm-part1/" rel="alternate" type="text/html" title="Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making" /><published>2026-05-02T00:00:00+00:00</published><updated>2026-05-02T00:00:00+00:00</updated><id>https://kevinmtian.github.io/posts/2026/05/02/rl-llm-part1</id><content type="html" xml:base="https://kevinmtian.github.io/posts/2026/05/02/rl-llm-part1/"><![CDATA[<p>Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.</p>

<p>Before going into those algorithms, it is useful to start from a simpler question:</p>

<blockquote>
  <p>Why do we need reinforcement learning at all, if supervised fine-tuning already teaches a model to imitate good answers?</p>
</blockquote>

<p>This is the question I want to unpack in this first note.</p>

<h2 id="1-supervised-learning-teaches-imitation">1. Supervised learning teaches imitation</h2>

<p>Supervised learning is the most familiar training paradigm. We collect input-output pairs and train a model to imitate the target answer.</p>

<p>For language models, this may look like:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Instruction → Desired response
</code></pre></div></div>

<p>or, in a local-service search system:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Query → Structured intent / entity labels
</code></pre></div></div>

<p>The objective is straightforward: minimize the difference between the model output and the provided label. In SFT, the model learns to reproduce high-quality demonstrations.</p>

<p>This is powerful, but it has a limitation: the model is still learning from a fixed dataset. It is not directly optimizing what happens after its output is used.</p>

<p>In other words, supervised learning answers:</p>

<blockquote>
  <p>Given this input, what output should I imitate?</p>
</blockquote>

<p>It does not naturally answer:</p>

<blockquote>
  <p>If I generate this output, will it lead to a better downstream outcome?</p>
</blockquote>

<p>That distinction becomes important when we care about long-horizon behavior, user preference, reasoning quality, tool use, safety, or search-result quality.</p>

<h2 id="2-reinforcement-learning-optimizes-decisions-not-just-labels">2. Reinforcement learning optimizes decisions, not just labels</h2>

<p>Reinforcement learning changes the framing.</p>

<p>Instead of learning from static input-output pairs, an agent interacts with an environment. At each step, it observes a state, chooses an action, receives feedback, and moves to a new state.</p>

<p>In classical RL, this could be a robot walking through a maze. In an LLM, the analogy is surprisingly natural:</p>

<ul>
  <li>State: the prompt, the conversation history, retrieved context, and previously generated tokens.</li>
  <li>Action: the next token, a tool call, a rewrite decision, a retrieval decision, or a final answer.</li>
  <li>Reward: a signal measuring whether the output was useful, correct, safe, preferred, or beneficial to the downstream system.</li>
  <li>Trajectory: the full sequence of decisions leading from the initial prompt to the final response.</li>
</ul>

<p>This is why LLM generation can be viewed as sequential decision making. The model does not produce an answer in one atomic step; it builds the answer token by token, and each token changes the context for the next decision.</p>

<h2 id="3-why-llm-rl-is-harder-than-ordinary-supervised-fine-tuning">3. Why LLM RL is harder than ordinary supervised fine-tuning</h2>

<p>In supervised fine-tuning, the data distribution is fixed. The model sees examples and learns to match them.</p>

<p>In reinforcement learning, the model affects the data distribution. Once the policy changes, the model starts generating different outputs, which leads to different rewards, different future states, and different training signals.</p>

<p>This creates several difficulties:</p>

<ol>
  <li>Credit assignment
 If a long answer is good or bad, which token or reasoning step deserves the credit or blame?</li>
  <li>Delayed reward
 The quality of an answer may only be known after the full response is generated, after a judge evaluates it, or after users interact with the result.</li>
  <li>Exploration vs. exploitation
 The model must sometimes try new behaviors to discover better outputs, but too much exploration can damage quality.</li>
  <li>Training instability
 Updating a large model too aggressively can cause it to drift away from its original language ability.</li>
</ol>

<p>This is why LLM RL methods often include stabilizers such as KL penalties, clipped policy updates, reward normalization, or preference-based objectives.</p>

<h2 id="4-from-rlhf-to-modern-post-training">4. From RLHF to modern post-training</h2>

<p>The best-known LLM reinforcement learning recipe is RLHF: reinforcement learning from human feedback.</p>

<p>A typical RLHF pipeline has three stages:</p>

<ol>
  <li>Supervised fine-tuning
 Train the model on human-written demonstrations.</li>
  <li>Reward modeling
 Ask humans to compare model outputs and train a reward model to predict human preference.</li>
  <li>Policy optimization
 Fine-tune the model to maximize the reward model while preventing it from drifting too far from the original model.</li>
</ol>

<p>This framing was popularized by InstructGPT, which used human feedback to make GPT-style models better at following instructions and more aligned with user intent. The InstructGPT paper showed that a 1.3B parameter model trained with human feedback could be preferred over a much larger 175B GPT-3 model on their prompt distribution, highlighting that post-training quality is not only about scale.</p>

<p>Later methods simplified or modified this pipeline. DPO, for example, removes the explicit reward-model-and-RL loop by directly optimizing on preference pairs. GRPO, introduced in DeepSeekMath, modifies PPO-style training by using group-relative comparisons, improving memory efficiency for reasoning-oriented LLM training.</p>

<h2 id="5-the-practical-engineering-view">5. The practical engineering view</h2>

<p>For applied LLM systems, I find it useful to think about RL not only as an algorithm, but as a system design pattern.</p>

<p>The important question is not merely:</p>

<blockquote>
  <p>Which algorithm should I use — PPO, DPO, or GRPO?</p>
</blockquote>

<p>The more practical question is:</p>

<blockquote>
  <p>What feedback signal do I have, and how can I convert it into better model behavior?</p>
</blockquote>

<p>In real production systems, the reward signal may come from:</p>

<ul>
  <li>human preference labels,</li>
  <li>rule-based correctness checks,</li>
  <li>LLM-as-a-judge evaluations,</li>
  <li>retrieval success,</li>
  <li>user engagement,</li>
  <li>search relevance,</li>
  <li>safety constraints,</li>
  <li>tool-execution success,</li>
  <li>or downstream business metrics.</li>
</ul>

<p>Once we have a feedback signal, the next question is how to use it safely. Sometimes SFT is enough. Sometimes preference optimization is better. Sometimes a verifier or judge model is more practical than full RL. Sometimes the best solution is not online RL at all, but offline labeling, compact-model post-training, and cache-based serving.</p>

<p>This is especially true in search systems, where latency, cost, evaluation stability, and serving reliability matter as much as model quality.</p>

<h2 id="6-summary">6. Summary</h2>

<p>Supervised learning teaches a model to imitate examples. Reinforcement learning teaches a model to improve decisions based on feedback.</p>

<p>For LLMs, this distinction matters because generation is sequential, feedback is often delayed, and the best output is not always defined by a single ground-truth label.</p>

<p>Modern LLM post-training can be seen as a spectrum:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SFT → reward modeling → PPO-style RLHF → DPO-style preference optimization → group-relative methods such as GRPO
</code></pre></div></div>

<p>The key idea is not that every LLM system must use reinforcement learning. The key idea is that production LLM systems need feedback loops. RL is one powerful way to formalize and optimize those loops.</p>

<p>In Part II, I will go deeper into PPO, DPO, and GRPO, and explain why modern LLM RL is gradually moving from generic preference alignment toward reasoning, verification, and agentic workflows.</p>

<h2 id="references">References</h2>

<ul>
  <li>Ouyang et al., “Training language models to follow instructions with human feedback,” 2022.</li>
  <li>Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” 2023.</li>
  <li>Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024.</li>
</ul>]]></content><author><name>Kevin Tian</name><email>kevinmtian@gmail.com</email></author><category term="LLM" /><category term="Reinforcement Learning" /><category term="RLHF" /><category term="Post-training" /><category term="AI Systems" /><summary type="html"><![CDATA[Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.]]></summary></entry></feed>