CALVIN (Composing Actions from Language and Vision).

CALVIN (Composing Actions from Language and Vision) is an open-source simulated benchmark for long-horizon, language-conditioned robot manipulation. It provides a multi-environment manipulation suite designed to evaluate agents that must follow natural language instructions and compose many short skills into longer instruction chains. The benchmark contains four distinct simulated manipulation environments (denoted A, B, C, D), a set of 34 base tasks, and about 1,000 language instructions (including instruction chains). Tasks are designed to require long-horizon composition (instruction chains up to length 5). Observations support flexible sensor suites (e.g., static RGB, gripper-mounted RGB, proprioceptive/gripper state), and the action interface used in published baselines includes delta-action / continuous control variants. Standard evaluation in the paper uses 500 rollouts and reports metrics such as the average length of successfully completed subtasks (max value 5) and variants of multi-task / long-horizon task completion (MTLC / LH-MTLC). The benchmark and code are provided by the authors (GitHub and project website) and the accompanying paper is available on arXiv and was published as an IEEE Robotics and Automation Letters submission.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

CALVIN (Composing Actions from Language and Vision).

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?