Home/Guides/Atropos LLM RL
ResearchDecember 2025

Teaching LLMs Through Experience

Most LLMs are frozen in time. Atropos lets them learn from feedback, improving iteratively on the tasks that matter to you. Here is how it works, from first principles.

December 2025|20 min read|GitHub

The Problem: Static Models in a Dynamic World

Picture this: you fine-tune a model on 10,000 invoice documents. The model learns the patterns in those exact documents. But when your invoices change format, or you encounter a new vendor with strange layouts, the model has no way to adapt. It is frozen at the moment of training.

The Fundamental Limitation

Traditional fine-tuning is like teaching someone by showing them the answer key. They memorize the answers, but they do not learn how to think.

Fine-tuning gives you:
  • - What you trained on
  • - Nothing more
  • - No improvement over time
  • - Brittleness to new patterns
RL gives you:
  • + Learning from feedback
  • + Continuous improvement
  • + Adaptation to new patterns
  • + Optimization for your metric

Reinforcement learning flips the script. Instead of showing the model correct answers, you let it try things and tell it whether those things worked. The model learns not just what to output, but how to reason toward correct outputs.

The Core Insight:

You cannot teach creativity by showing examples. You teach it by letting the student try, fail, and try again with feedback.

The Key Insight: Decoupling Work from Learning

The clever part of Atropos is not the RL algorithm. It is the architecture. The framework separates three concerns that usually get tangled together:

Think of it like a factory:

  +-------------------+     +-------------------+
  |   ENVIRONMENT 1   |     |   ENVIRONMENT 2   |
  |   (OCR-VQA)       |     |   (Tool Calling)  |
  |                   |     |                   |
  |  Produces:        |     |  Produces:        |
  |  - prompts        |     |  - prompts        |
  |  - responses      |     |  - responses      |
  |  - scores         |     |  - scores         |
  +--------+----------+     +--------+----------+
           |                         |
           |    TRAJECTORIES         |
           +------------+------------+
                        |
                        v
           +------------------------+
           |    TRAJECTORY API      |
           |    (The Warehouse)     |
           |                        |
           |  Collects and queues   |
           |  all the scored work   |
           +------------+-----------+
                        |
                        v
           +------------------------+
           |       TRAINER          |
           |  (Quality Control)     |
           |                        |
           |  Pulls batches,        |
           |  updates the model     |
           +------------------------+
                        |
                        v
           +------------------------+
           |   INFERENCE SERVER     |
           |   (The Updated Model)  |
           |                        |
           |  Better at the task    |
           |  than before           |
           +------------------------+

Why does this separation matter? Consider the alternative: a monolithic system where data generation, scoring, and training all happen in the same process. The moment you want to add a new task, change your scoring logic, or scale up data collection, everything breaks.

Environments

Each environment is a self-contained unit. It knows how to generate prompts, collect model responses, and score them. Nothing else.

Run many in parallel.

Trajectory API

A FastAPI server that accepts scored trajectories from any environment. It queues them, handles batching, knows nothing about tasks.

Pure infrastructure.

Trainer

Pulls batches from the API, runs gradient updates. Does not know or care where the data came from.

Axolotl or Tinker.

This is not just elegant architecture. It is practical. You can run OCR training and tool-calling training simultaneously, feeding the same model. You can scale environments horizontally. You can swap trainers without touching environments.

Related Guides