Update: NVIDIA Cosmos Transfer2.5

Right after publishing my article about Cosmos Transfer1, NVIDIA released Cosmos Transfer2.5.

Below is a concise, step-by-step guide on how to run it on RunPod, along with examples from my early experiments.

Disclaimer: I’m not affiliated with RunPod in any way, and this isn’t a paid promotion. Feel free to try this Docker image with any other similar services that support NVIDIA GPU.

Step 1: Choose a GPU

For some reason RTX PRO 6000 no longer works and the script throws RuntimeError: FlashAttention only supports Ampere GPUs or newer.

I recommend selecting the H100 SXM. After testing several GPUs on the same video, I found that it runs about 25–30 % faster than H100 PCIe, while costing only ~ 12 % more per hour.

Step 2: Run the Pod

Click the Deploy On Demand button to start your Pod.
It takes about 10-15 minutes to download the Docker image and initialize everything.

Step 3: First Run

Once your Pod is ready, log in via SSH.
You can find the official connection guide 👉 here.

After connecting, go to the /workspace directory and install the remaining dependencies:

uv sync --locked || true

Step 4: Hugging Face

Before you run inference, make sure you’re logged in to Hugging Face:

$ hf auth loging --token [your token]

Not sure where to find your token? 👉 Here’s a guide.

You no longer need to pre-download any models. They will be downloaded automatically before inference. As a result, the first run may take a bit longer.

Step 5: Run Inference

Upload your training episode(s) to the container using SCP, and create your params_file file and run inference with:

$ python examples/inference.py --params_file robot_spec.json

Here’s an example params_file from my setup:

{
    "prompt_path": "input/robot_prompt.json",
    "output_dir": "outputs/robot_depth",
    "video_path": "input/observation.images.front_000002.mp4",
    "guidance": 5,
    "depth": {
        "control_weight": 0.7,
        "control_path": "input/observation.images.front_000002.depth.mp4"
    },
    "edge": {
        "control_weight": 0.3,
        "control_path": "input/observation.images.front_000002.edge.mp4"
    }
}

Example prompt_path file:

{
    "prompt": "The video is captured from a fixed top-down perspective with an overhead camera. On a white table, a single robotic arm (SO-101) made of orange PLA plastic is visible. Several green plastic blocks, each a few centimeters in width and height, are placed on the table. The SO-101 arm identifies and grasps individual green blocks with its gripper, lifts them smoothly, and stacks them one on top of another. The tower of green blocks grows progressively while the camera view remains stable overhead, clearly showing the precise robot movements.",
    "negative_prompt": "The video captures a game playing, with bad crappy graphics and cartoonish frames. It represents a recording of old outdated games. The lighting looks very fake. The textures are very raw and basic. The geometries are very primitive. The images are very pixelated and of poor CG quality. There are many subtitles in the footage. Overall, the video is unrealistic at all."
}

You can find a bit more detail on 👉 GitHub. At the time I’m writing this, NVIDIA still hasn’t released an official document on their website.

My Experiments

I can’t really call my first experiments a success. Most of the generated videos turned out a bit strange - some were almost okay, others completely missed the mark, and a few got blocked by the built-in guardrail model.

Task: Change Blocks Colors

If you watch until the middle, you’ll notice the video contains some anomalies like blocks taking strange positions, and toward the end, the work surface changes.

Task: Change Work Surface

In this video, you’ll notice an extra block, and around the middle, the perspective shifts and strange objects start to appear.

Generating a single video on the H100 SXM took roughly 50 minutes.

Conclusion

It’s clear that augmenting robot training datasets isn’t as simple as it seems — and I’m probably still missing a few key steps along the way. It definitely requires plenty of trial and error, and I suspect the model needs some fine-tuning to better align with each robot’s embodiment and setup.

While digging through the docs, I also noticed that NVIDIA mentions two Cosmos Transfer versions 2B and 12B. It looks like the larger 12B variant might be on its way to release.

Still, it’s exciting to see even partial results.

If you’ve experimented with Cosmos Transfer or similar models, I’d love to hear your experience. Drop me a comment or message — sharing insights helps the whole community move forward faster.