Reinforcement learning is structurally harder than supervised learning, and that difficulty shows up as brittleness in practice. In most RL settings, training data is generated by the current policy, so the distribution shifts as the policy changes, unlike the fixed-data setting of supervised learning. At the same time, rewards are often sparse, delayed, or imperfect proxies for the behavior we actually want. As a result, performance is highly sensitive to reward design, exploration, credit assignment, optimization, and the interaction between data collection and policy improvement.
These problems do not disappear in large-model post-training. They are joined by a separate class of systems challenges. RL for LLMs, VLMs, and other large models requires distributed training, orchestration, rollout engines, and reliable coordination between rollout and learning. Better systems expand what is feasible, for example through higher-throughput rollouts, broader sampling, and lower-variance gradients, but they do not by themselves fix the fragile algorithmic side of RL. Some of the hardest issues sit exactly at the boundary between the two layers. Policy staleness from asynchronous rollouts is a clear example: it is created by systems choices but has to be absorbed algorithmically, through truncated importance sampling, trust-region corrections, or similar stabilization mechanisms. That entanglement is exactly why the two layers should be separable.
This matters because tooling shapes what becomes easy to study. When even a modest algorithmic change requires touching rollout code, orchestration, distributed training, and data plumbing all at once, many ideas become too expensive to test seriously. In practice, that pushes experimentation toward methods that already fit the existing stack. That is not a criticism of current frameworks; it is a natural consequence of how costly large-scale RL systems are to build and maintain.
FeynRL Is Motivated by This Gap.
FeynRL gives researchers the systems to run realistic large-model post-training, along with a structure where algorithms, rollouts, and orchestration can be iterated on independently. A new loss is a new loss, a new rollout engine is a new rollout engine, and the rest keeps working. FeynRL is also built with training stability in mind: methods include the practical stabilization details often omitted from other implementations, which are typically what separates an algorithm that works on paper from one that works at scale.
High-Level Overview
Separation of concerns is the central design principle. Algorithms, rollouts, and orchestration each live behind narrow interfaces, so you can change one without touching the others. At the same time, FeynRL is built for serious large-model post-training, with the systems needed to run real workloads at scale.
Concretely, FeynRL supports supervised fine-tuning, preference learning, and reinforcement learning in a shared structure. It includes methods such as SFT, DPO, PPO, GRPO, CISPO, and P3O, together with rollout engines such as vLLM, orchestration with Ray, distributed optimization with DeepSpeed, sync and async execution modes, and modular reward, data, and evaluation layers. Together these components let you run production-scale RL experiments out of the box, while keeping each layer independently modifiable.
Under the hood, FeynRL is organized along three axes that can be worked on independently: algorithms (the loss and update rules), rollouts (generation, rewards, and replay), and orchestration (distributed execution and weight synchronization between the training and inference engines). These layers communicate through narrow interfaces, so a new RL method is usually a new loss and update rule rather than a rewrite of the execution graph.
Sync vs Async
Where algorithms and systems interact most visibly is the training-rollout schedule, and FeynRL supports two modes:
- Sync mode: each epoch generates all rollouts, trains on them, synchronizes the updated weights to the rollout engines, and repeats.
- Async mode: generation and training run concurrently on separate GPU pools.
Async Mode
In async mode, generation and training run concurrently on separate GPU pools, with bounded queues, replay, and periodic weight synchronization making the throughput-staleness tradeoff explicit.
Experiments
We have run extensive experiments with FeynRL across a range of models, datasets, and methods. This release surfaces the first set of those results; more will follow on an ongoing basis as the work continues.
Where available, we also include the same-model framework comparison averages from the main repository. It is worth noting that while we do not apply any reward shaping or specific normalization beyond what is discussed in the repository, other frameworks do. Even so, we are already able to obtain results comparable to other common frameworks in this first release.
Qwen2.5-1.5B-Instruct on GSM8K
Training data comes from GSM8K. Evaluation uses the shared mathematical reasoning benchmark suite reported across the release, spanning GSM8K, AIME, AMC, AMO, Brumo, HMMT, and Olympiad-style sets. The framework rows below use the same comparison tables from the main repository; averages use each framework's available reported benchmarks, and the reward snapshot reports the interpolated training reward after 1 hour.
| Run | Pass@1 | Pass@16 | Reward @ 1h |
|---|---|---|---|
| Baseline | 12.0% | 26.4% | - |
| FeynRL | 12.2% | 27.0% | 0.894 |
| AReaL | 12.2% | 28.2% | 0.654 |
| PipelineRL | 10.8% | 26.5% | 0.751 |
| TRL | 11.3% | 28.6% | 0.866 |
| veRL | 10.7% | 27.6% | 0.890 |
Qwen3-4B-Thinking-2507 on DeepScaler
Training data comes from the DeepScaler preview dataset. Evaluation again uses the same benchmark suite, with
prompt formatting aligned to the model's released setup. As above, framework averages use each
framework's available reported benchmarks from the comparison tables. The FeynRL row below now reflects
the latest async-engine evaluation artifacts under
checkpoints/framework_comparisons/qwen3_4b_thinking_2507/wsp/FeynRL_async, which currently
include 8 completed benchmark runs. For Qwen3, the FeynRL reward snapshot below references the
async/overlap comparison run and reports the interpolated training reward after 4 hours.
| Run | Pass@1 | Pass@16 | Reward @ 4h |
|---|---|---|---|
| Baseline | 12.2% | 19.7% | - |
| FeynRL | 27.0% | 40.2% | 0.565 |
| AReaL | 37.3% | 53.4% | 0.502 |
Explore detailed information and logs about these experiments, along with the scripts to re-run them, below.