shyamal's space

Logo


Applied AI @ OpenAI • AI Advisor to Startups • On Deck Fellow • Proud Son • Duke + Wisconsin Alum • Building for impact • Venture Scout • Neo Mentor • Duke AI Advisory Board

Dark Mode

15 August 2025

Experience × Evals: the build loop for self‑improving agents

by Shyamal Anadkat

Share on:

author’s note: this essay summarizes and synthesizes ideas from the linked pieces and adds my own perspective to extend them.

core idea: the next capability jump comes from agents that learn in ongoing experience streams, governed by industrial‑grade evals. experience generates scalable data; evals convert it into reliable capability.

from human data to experiential data

human data is hitting diminishing returns. shift to streams of interaction with grounded rewards; the signal stays informative as systems improve — continuous experience, grounded actions, and reward from the world (deepmind: era of experience).

the missing half: evals as an operating system

experience is the fuel; evals are the control system. treat evals as productized, continuous gates for deployment, ranking variants, and shaping training signals (mercor: era of evals).

evals do three jobs:

done right, evals form a contract that every agent, dataset, and training run must satisfy — continuously.

the coming “gpt‑3 moment” for rl‑native systems

the claim: reinforcement learning (broadly defined) is nearing an inflection where models, environments, and compute align to produce step‑function jumps in general capability — similar to what happened when gpt‑3 made latent scale legible to the world (mechanize: the upcoming gpt‑3 moment for rl).

what unlocks it:

compute realities: rl is inference‑heavy

rl at frontier quality behaves like test‑time scaling: heavy inference for exploration, lighter training for consolidation. plan budgets for rollouts and environment compute, not just gradients (semianalysis).

the loop: experience → evals → selection → deployment

1) generate experience in rich envs (code, tools, sims).
2) evaluate with automatic judges for correctness, efficiency, safety, uncertainty.
3) select and train on high‑value traces; adapt curriculum.
4) deploy behind safeguards, monitor drift, repeat.

design principles for experience × evals systems

a pragmatic blueprint (what to build now)

metrics that matter

pitfalls and how to avoid them

operating cadence: frequent updates, distillation, and data moats

what changes in teams

the opportunity

the “experience × evals” stack is a flywheel. experience gives you data that stays informative as you get better. evals give you gradient — direction and safety. together they form the build loop for self‑improving agents.

we’ve had our pre‑training era; we’ve had our rlhf era. the next era is streaming interaction optimized by industrial‑grade evals. build the environments, wire the judges, and let agents learn.

references

tags: Startups - Agents - RL - Evals - AGI

Related posts