VERA policy provider¶
VERA (Video-to-Embodied Robot Action, MIT/CSAIL) is a two-stage, closed-loop video-to-action policy:
- Video planner (DFoT / WAN) — a diffusion model that "dreams" the next frames from the current observation (+ optional text). Embodiment-agnostic.
- Jacobian IDM — translates the dream into robot actions via a frozen visual backbone (VGGT/DINO) + a flow→action head. Embodiment-specific, data-efficient, swappable without retraining the planner.
One video planner, many IDMs — the route to zero-shot, cross-embodiment control.
The strands-robots vera provider is a thin, import-light WebSocket client +
managed GPU server, mirroring the cosmos3 service pattern. The host venv
never installs VERA's heavy/conflicting stack (PyTorch 2.6 / CUDA, VGGT, DFoT) —
that lives in the strands-vera-server container.
host venv (numpy>=2) vera-server container (torch 2.6 / CUDA)
VeraPolicy ─ VeraWebsocketClient ─ws─▶ vera.server.start_vera_server
(no vera install) DFoT/WAN planner + Jacobian IDM + /ckpts
Quick start¶
from strands_robots.policies import create_policy
# Attach to a running server (see "Server" below) ...
policy = create_policy("vera", embodiment="mimicgen", auto_launch_server=False)
chunk = policy.get_actions_sync(observation, "stack the red block on the green block")
# ... or let the provider manage the container for you:
policy = create_policy(
"vera", embodiment="mimicgen",
server_mode="docker", ckpt_root="/abs/path/vera-ckpts",
)
The MimicGen → Panda path drives a real 7-DoF arm: the WAN planner + Jacobian
IDM emit 6-DoF end-effector deltas, and the provider's IK bridge solves them onto
the Panda's joints (auto-discovering the end-effector frame — no manual wiring).
See examples/vera_mimicgen_panda/.
Embodiments¶
From VERA's adapter_factory._EMBODIMENTS:
| Embodiment | action_space | dims | views | control | ports (policy/viz) | checkpoints |
|---|---|---|---|---|---|---|
| pusht | velocity (planar) |
2, no gripper | image |
10 Hz | 8820 / 8821 | experimental — IDM du path not wired end-to-end upstream |
| mimicgen | eef_delta |
7 (6-DoF + grip) | agentview_image, robot0_eye_in_hand_image |
20 Hz | 8800 / 8801 | ✅ Wave-1 (+WAN base) |
| allegro | joint_position |
16 | 12 cameras | 15 Hz | 8802 / 8803 | 🔜 Wave-2 (code only) |
| droid | cartesian_delta |
7 | varied_1,varied_2,hand |
15 Hz | 8804 / 8805 | 🔜 Wave-2 (code only) |
Today, end-to-end: mimicgen (WAN planner; needs the frozen WAN base + a
motion tracker) is the working, faithful embodiment — it exercises the whole
eef-delta → IK path onto a real arm. pusht's server runs, but its IDM du
action path is not wired end-to-end upstream (VERA's own
configurations/dataset/pusht.yaml documents this gap), so it validates the
provider → server → action plumbing rather than producing a solving rollout —
treat it as experimental. allegro/droid are code-present but
checkpoint-absent upstream (Wave 2).
The "generalist" claim, accurately¶
VERA's architecture is cross-embodiment: one embodiment-agnostic video
planner + one cheap IDM per robot (frozen backbone, head trained from
self-play). It is not a single checkpoint that drives every robot today — a
robot is drivable iff (its action_space matches a served embodiment) and
(a checkpoint exists) and (IK/validation is done for that arm). For
eef_delta/cartesian_delta arms (mimicgen/droid), the provider includes an
IK bridge that maps the 6-DoF end-effector deltas to joint targets and
auto-discovers the end-effector frame from the compiled MuJoCo model — so any
kinematically-compatible 6/7-DoF arm can be driven once a matching IDM exists.
Checkpoints¶
hf download sizhe-lester-li/VERA --local-dir ./vera-ckpts # ~42 GB full; ~4 GB is Wave-1
export VERA_CKPT_ROOT=$PWD/vera-ckpts
MimicGen additionally needs the frozen WAN 2.1 base (text-enc + VAE + CLIP). Its IDM uses the AllTracker point tracker, which the container bundles (cloned at build time; weights auto-download). The WAN base:
The provider never auto-downloads — point it at pre-downloaded roots.
Server¶
The server holds the GPU and the two-stage model. Run it as a container:
docker build -f strands_robots/policies/vera/docker/Dockerfile -t strands-vera-server:latest .
# MimicGen (serves ws on :8800; needs the WAN base + offline resolver)
docker run --rm --gpus all --ipc=host -p 8800:8800 \
-v "$VERA_CKPT_ROOT":/ckpts:ro -v "$PWD/Wan2.1-T2V-1.3B":/wan:ro \
-e VERA_EMBODIMENT=mimicgen -e USE_OFFLINE_RESOLVE=1 \
strands-vera-server:latest
The container entrypoint maps the single mounted /ckpts root onto VERA's
per-embodiment checkpoint env vars; USE_OFFLINE_RESOLVE=1 resolves MimicGen's
wandb-run-id IDM to the locally-mounted checkpoint (via provenance.json) so the
server boots with no network. See
policies/vera/docker/.
server_mode="docker" lets the provider build/run/stop the container itself;
server_mode="subprocess" launches a local python -m vera.server... when VERA
is installed in the same env.
Configuration¶
VeraConfig maps 1:1 to VERA's server flags and is env-overridable (deploy/CI
wins over code defaults):
| kwarg | env var | maps to |
|---|---|---|
embodiment |
— | --embodiment |
server_port / vis_port |
VERA_SERVER_PORT / VERA_VIS_PORT |
--port / --vis-port |
algo_config |
VERA_ALGO_CONFIG |
--algo-config (swap to the omni planner) |
dynamics_run_id |
VERA_DYNAMICS_RUN_ID |
--dynamics-run-id |
text_prompt |
VERA_TEXT_PROMPT |
--text |
ckpt_root |
VERA_CKPT_ROOT |
container /ckpts mount |
sample_steps |
VERA_SAMPLE_STEPS |
--sample-steps |
tracker_backend |
VERA_TRACKER_BACKEND |
IDM tracker |
motion_plan_scale |
VERA_MOTION_PLAN_SCALE |
live configure |
server_mode |
VERA_SERVER_MODE |
subprocess | docker |
Wire protocol¶
The provider keeps a rolling context window of the last context_frames
camera frames (width-concatenated across views) and calls the server's chunked
infer when its local action queue drains — the same RemotePolicy contract
VERA's own eval harness uses. The server returns {"action": [H, D]}; the
provider maps each D-vector to robot actuator names (gripper binarized per the
server's gripper_dim_index/gripper_is_raw), coercing to python floats per the
Policy ABC.
QA the rollouts with a Cosmos 3 reasoner (closed loop)¶
VERA generates video-grounded actions; NVIDIA Cosmos 3 reads video and reasons in text — so it makes a natural automated QA critic for rollouts. Serve the reasoner, then have it grade a rollout MP4:
uv pip install "strands-cosmos[cosmos3]"
# serve Cosmos3-Nano on :8000 (vLLM + vllm-cosmos3); see strands-cosmos `c3-serve-reason`
python examples/vera_mimicgen_panda/critique_with_cosmos3.py examples/vera_mimicgen_panda/artifacts/mimicgen_panda.mp4
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
print(agent("Grade this robot rollout — is the motion smooth and purposeful, "
"any bugs? <video>/tmp/vera-critique/mimicgen_panda.mp4</video>"))
This closed generate → reason → fix loop surfaced (and fixed) real issues in
the MimicGen→Panda example: an initial jittery critique drove the
ik_smoothing EMA knob, and a "the arm is static" critique root-caused a
near-singular default start pose with the motion off-camera — fixed with a
tabletop-ready seed pose + camera framing. The reasoner's verdict moved
NEEDS-WORK → PASS.
Testing¶
# offline unit tests (no GPU, no vera install)
hatch run test tests/policies/vera/
# gated live integration (needs a running server)
VERA_LIVE=1 hatch run test-integ tests_integ/policies/vera/
Install¶
pip install 'strands-robots[vera]' # provider + the VERA git dep (subprocess mode)
pip install 'strands-robots[vera-sim]' # + MimicGen sim deps for the example (also pulls the experimental PushT env)
For the docker path the host needs only websockets + msgpack (the client
transport) — no vera, no torch.