Posts by Tags

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

6 minute read

Published: May 02, 2026

Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

6 minute read

Published: May 02, 2026

Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

6 minute read

Published: May 02, 2026

Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

6 minute read

Published: May 02, 2026

Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

6 minute read

Published: May 02, 2026

Reinforcement learning has become one of the central ideas behind modern LLM post-training. Yet it is often discussed in a confusing way: sometimes as a mathematical framework, sometimes as an alignment recipe, sometimes as a set of algorithms such as PPO, DPO, GRPO, or RLHF.

Kevin Tian

Posts by Tags

AI Systems

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

LLM

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

Post-training

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

RLHF

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making

Reinforcement Learning

Understanding Reinforcement Learning for LLMs — Part I: From Supervised Learning to Sequential Decision Making