..

Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Researchers: Chongli Qin, Jost Tobias Springenberg
Links: arXiv paper / full blog post

Because there’s an endless amount of data available to us and we have a limited bandwidth to conserve, we might consider carefully curating the quality of what we allow in.
― Rick Rubin, The Creative Act: A Way of Being

For a comprehensive discussion of the work please refer to our blog-post at: The Emotional Scientist. Below is a shortened summary and a collection of links.

TL;DR

We study the relationship of supervised fine-tuning (SFT)―an often used imitation learning technique to train machine learning models to imitate human responses from fixed datasets―and reinforcement learning (RL). We find that SFT on curated data can be interpreted as performing a weak form of RL; by optimizing a loose lower bound on the actual RL objective. And it can be improved by tightening that bound via importance sampling.

This research is important for two reasons:

  1. It clearly outlines the connections between SFT and RL. We found over the last years that many research intuitively understand that this connection exists, and it is mentioned in some prior research. However, we felt a concise reference and more exploration of this connection in relevant settings was missing. We hope this paper provides this and can inspire more research in this direction.
  2. From an alignment perspective, the effects of fine-tuning Large Language Models (LLMs) on small, curated, datasets alone may be easier to understand than training via RL; in which the training data is transient (as it is partly generated by the model we seek to improve) and cannot be easily inspected in full.

Learning more

We provide open-source implementations that explore this connection in two settings:

We also open-source the checkpoint (fine-tuned from Qwen-2.5) from our LLM reasoning experiments:

For a more comprehensive discussion of the work please refer to the blog-post:

You can also find the full paper on arXiv:

You can reach out via mail: