AI’s First Language Was Text. Its Next Is Reward.

In fall of 2025, reinforcement learning may be having its moment. It’s time in the sunshine. But while a lot of these new, emergent concepts have come into being recently (agents, context engineering, inference-time compute, etc), RL has been around for a while! Which got us curious: if reinforcement learning has been a popular topic in AI circles for a while now, what in particular about the present moment makes things different? RL wasn’t as effective before because models weren’t strong enough to earn rewards; but in a world where we’ve exhausted data for pretraining and agents are hungry to improve along the vector of real-world task completion, now might be the perfect time.

History of RL: RL Pre-LLMs

By nature, RL moves in lockstep with the outcomes an ML model is trying to achieve. 5 years ago, the desired outcomes from these models were less about getting tasks done with certain levels of accuracy and more about moving the needle on certain areas of a business. RL in service of certain business goals.

Some examples:

  • Netflix wanted to maximize engagement for their consumer base, creating the conditions such that the video is actually watched. This led them to using RL for artwork selection, automatically adjusting the artwork that showed up for a given movie (for a given consumer) based on that said consumer’s watch history.
  • JP Morgan built an AI model, which they called LOXM, to execute equities trades at maximum speed and optimal prices. The team implemented immediate RL rewards for the increment of spread cost, as well as completion, order duration, and market impact signals and terminal rewards.
  • Google DeepMind, in partnership with Trane Technologies (a building management system provider), applied reinforcement learning for HVAC chiller plant optimization in commercial buildings. This problem surface fit RL like a glove, with energy use minimization revealing itself as a natural reward function.

Challenges

Naturally, the returns to RL are constrained by the inherent power of the base model. RL at this time could move model performance up a few basis points, but did it justify the cost and complexity that went into it? The advent of LLMs would change the game, but not before another hurdle was to be crossed: the chicken and the egg problem.

What Unlocks RL: Good Enough Pretrained Models

In their piece, Netflix called out a very interesting property of reinforcement learning: the chicken and the egg problem. Reinforcement learning is great if the model is capable of earning the reward, because it’s via that reward signal that the model updates its weight. But a series of missed shots? That’s just wasted compute. As Netflix wrote:

“One challenge of image personalization is that we can only select a single piece of artwork to represent each title in each place we present it... This means that image selection is a chicken-and-egg problem operating in a closed loop: if a member plays a title it can only come from the image that we decided to present to that member.”

This exact structure of problem recurred in LLMs as well—RL is great, but the return on finite compute applied to RL only hits a plausible level once the LLM is effective enough to “earn the reward” some percentage of the time. In years prior, reward signals would’ve been too sparse to guide improvement. But years of pretraining have paid off; after exhausting virtually all the data online, LLMs have essentially become compressed versions of the internet. The models are RL-ready.

Not only RL-Ready, But RL-Hungry!

The RL unlock is coming at the perfect time. The frontier of possibility isn’t simply summarizing your documents or sales calls anymore; it’s systems that act on that synthesized information, be it by accommodating an employee’s IT request in accordance with company policy or following up with a potential buyer to maximize odds of a sale. Supervised fine-tuning wasn’t built for this world! The range of possible input-output scenarios are so vast that learning by memorizing isn’t an optional route anymore. Agents have to learn by doing.

In some ways, inference-time compute was a step forward in this direction. What makes inference-time compute work is a verifier that grades all the possible paths that could be taken against a given process-based reward model (this is a generalized model that assesses the reasoning of a given model), such that a best path can be selected. In both inference-time compute and RL, a reward model and simulation are used to improve a model’s performance– in the former its helps deliver the best possible answer to a user’s query (local improvement), in the latter it’s a means to update the model’s weights (global improvement).

RL in Practice: What’s Working & What Could Be Better

However, talking about reward models and putting them into practice are vastly different things. Up until this point, RL has been great for domains with verifiable rewards, where there is a clear, objective, ground-truth signal on what a good output is and isn’t. Some examples? Math, where either the answer is correct or incorrect, and coding, where the code either compiles or doesn’t compile. Reinforcement learning in other areas isn’t as clear cut, as the lack of an objective answer begets work to define what constitutes positive and negative outputs. The biggest constraint to RL’s ubiquity is the lack of reward models and specialized environments for every use case.

There are other challenges as well! For one, there’s overfitting risk– the model gets really good with respect to a given reward model and task, but is fragile to small variations. Sample efficiency is another issue, where these models tend to require a high number of rollouts (in some cases 100,000s) in order to truly garner enough signal to update the model weights. At the root of the sample efficiency bottleneck is that as it stands, the data from RL feeding the evolution of the model is simply 0-1 signals. Either the model got the reward or it didn’t. Increasingly so, we’re seeing methods of tapping into more of the data exhaust from model behavior (ie traces, intermediate outputs, etc) to reflect on patterns and update systems accordingly (GEPA is doing this with prompt optimization).

The ultimate challenge with RL, however, is domain transfer. Simply put, simulation isn’t equivalent to reality. As much as it tries to approximate it, the chaos of real-world situations is impossible to fully program into programmatic environments. This environment to production gap is at the heart of RL’s imperfection. As great as an agent performs in RL environments, there’s no way to fully guarantee that same performance in production.

What does this mean?

Zooming out, RL is the next key vector for improving AI systems. Pretraining was the first scaling law, but hit a plateau once we exhausted the sum totality of the Internet’s data. Inference-time compute is great, focusing on general reasoning at runtime versus the agent’s inherent capabilities itself. As we bring agents into enterprises and count on them for larger and larger swaths of human work, these RL platforms that enable them to get better at that work– not via memorization but via iteration– will become more and more crucial.

Some pithy predictions to finish us off:

  1. Crafting RL environments and shaping reward models will grow to become a key expression of human expertise. It’ll be a new medium by which that expertise is infused within an AI system.
  2. Excited for RL paradigms that don’t just focus on the LLM itself, but on the entire compound AI system (prompt, context, memory, multi-agent systems, etc) that constitutes the application.
  3. What if product-market fit could be inscribed in a reward model? There may be a state where AI application companies have independent reward environments for each customer, and where the sales process involves collecting data to inform that environment such that customer NPS can be optimized for. This would essentially take product-market fit (for a given customer) from a state that’s serendipitously stumbled upon to a hill that’s explicitly climbed, and hence make RL practitioners (in being the fuel that gets one up that hill) one of the most sought after groups in enterprise software.

Ultimately, the demand is there and will only grow over time. Perfection is asymptotic; in a world where agentic systems can always get better, facilitating the mechanics for that intentional improvement is an area of incredible importance. It will be exciting to see the creative approaches teams bring to this problem space and what it unlocks for the AI application products we all interface with!