Teaching LLMs Through Experience
Most LLMs are frozen in time. Atropos lets them learn from feedback, improving iteratively on the tasks that matter to you. Here is how it works, from first principles.
The Problem: Static Models in a Dynamic World
Picture this: you fine-tune a model on 10,000 invoice documents. The model learns the patterns in those exact documents. But when your invoices change format, or you encounter a new vendor with strange layouts, the model has no way to adapt. It is frozen at the moment of training.
The Fundamental Limitation
Traditional fine-tuning is like teaching someone by showing them the answer key. They memorize the answers, but they do not learn how to think.
- - What you trained on
- - Nothing more
- - No improvement over time
- - Brittleness to new patterns
- + Learning from feedback
- + Continuous improvement
- + Adaptation to new patterns
- + Optimization for your metric
Reinforcement learning flips the script. Instead of showing the model correct answers, you let it try things and tell it whether those things worked. The model learns not just what to output, but how to reason toward correct outputs.
The Core Insight:
You cannot teach creativity by showing examples. You teach it by letting the student try, fail, and try again with feedback.
The Key Insight: Decoupling Work from Learning
The clever part of Atropos is not the RL algorithm. It is the architecture. The framework separates three concerns that usually get tangled together:
Think of it like a factory:
+-------------------+ +-------------------+
| ENVIRONMENT 1 | | ENVIRONMENT 2 |
| (OCR-VQA) | | (Tool Calling) |
| | | |
| Produces: | | Produces: |
| - prompts | | - prompts |
| - responses | | - responses |
| - scores | | - scores |
+--------+----------+ +--------+----------+
| |
| TRAJECTORIES |
+------------+------------+
|
v
+------------------------+
| TRAJECTORY API |
| (The Warehouse) |
| |
| Collects and queues |
| all the scored work |
+------------+-----------+
|
v
+------------------------+
| TRAINER |
| (Quality Control) |
| |
| Pulls batches, |
| updates the model |
+------------------------+
|
v
+------------------------+
| INFERENCE SERVER |
| (The Updated Model) |
| |
| Better at the task |
| than before |
+------------------------+Why does this separation matter? Consider the alternative: a monolithic system where data generation, scoring, and training all happen in the same process. The moment you want to add a new task, change your scoring logic, or scale up data collection, everything breaks.
Environments
Each environment is a self-contained unit. It knows how to generate prompts, collect model responses, and score them. Nothing else.
Run many in parallel.
Trajectory API
A FastAPI server that accepts scored trajectories from any environment. It queues them, handles batching, knows nothing about tasks.
Pure infrastructure.
Trainer
Pulls batches from the API, runs gradient updates. Does not know or care where the data came from.
Axolotl or Tinker.
This is not just elegant architecture. It is practical. You can run OCR training and tool-calling training simultaneously, feeding the same model. You can scale environments horizontally. You can swap trainers without touching environments.