NVIDIA Cosmos Transfer1

In my recent video, I showed how even a small change in the environment — like switching the background — can confuse VLA models such as SmolVLA, NVIDIA Isaac Groot and others.

The fix sounds simple: just feed the model bigger datasets with lots of variations. The problem is, you can’t realistically show a robot every background or environment it might encounter in the real world.

That’s where NVIDIA Cosmos Transfer comes in. It grabs your original training episode and generates a bunch of new versions with different backgrounds, fresh environments, and small changes here and there. Suddenly, your dataset of 50–60 episodes doubles in size without you recording anything new.

However, installing and running Cosmos Transfer1 isn’t exactly “plug and play.” To start, you’ll need some serious hardware with around 80 GB of VRAM and 400+ GB of disk space. And the official instructions aren’t the most beginner-friendly.

But here’s the good news: I went through the pain of building the Docker image, troubleshooting the setup, and creating a RunPod template that makes it much easier to get started. In this post, I’ll walk you through how to run your first experiments without tearing your hair out.

Let’s dive in 👇

Disclaimer: I’m not affiliated with RunPod in any way, and this isn’t a paid promotion. Feel free to try this Docker image with any other similar services that support NVIDIA GPU.

Step 1: Choose a GPU

Sign in to RunPod and click this link — you should see a list of available GPUs to choose from. I recommend selecting the RTX PRO 6000 (96 GB).

Cosmos Transfer1 itself isn’t huge (~7B parameters), but you’re running the entire pipeline, which includes guardrail models that verify prompts and generated videos for safety, the Google T5 encoder, and other small models that also consume VRAM.

T5 is the heaviest part of the pipeline and uses more than 40 GB of VRAM. I’m hoping they swap it out for something lighter soon so I can run it locally on my RTX 5090.

Click on your selected video card.

Step 2: Add your Hugging Face Token

Scroll down to Configure Deployment -> Pod Template and click the Edit button.

In the just opened dialog, go to the Environment Variables section and add your HF_TOKEN. This logs in to Hugging Face automatically on startup.

If you’re not sure how to get your Hugging Face access token, here’s 👉 the guide.

Step 3: Run the Pod

Click the Deploy On Demand button to start your Pod and be patient. It takes some time (15–20 minutes) to download the Docker image and make your Pod ready.

Step 4: Run the Test Environment

Once your Pod is ready, log in via SSH. You can find the official guide on how to connect to your Pod using SSH 👉 here.

Once you're connected to the container, make sure you're in the /workspace directory and run the following command:

$ PYTHONPATH=$(pwd) python scripts/test_environment.py

Step 5: Download the Models

If you’ve seen all green lights in the previous step, you’re good to download the models.

$ PYTHONPATH=$(pwd) python scripts/download_checkpoints.py

The download is around 350 GB, so it may take some time.

Step 6: Run Inference

Upload your training episode(s) to the server using SCP. Then create the controlnet_specs file and run inference with the following command:

$ export CUDA_VISIBLE_DEVICES=0
$ export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
$ PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_depth \
    --fps 30 \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_depth.json \
    --offload_text_encoder_model

Here’s a sample controlnet_specs file from my experiments:

{
    "prompt": "The video is captured from a top-down perspective. A single robotic arm is positioned above the table. At the beginning, several red blocks are on the table. As the video progresses, the robotic arm begins its stacking operation. The arm moves with calculated precision, first identifying and grasping individual red blocks with its gripper. Each block is lifted smoothly and placed on top of another. The camera maintains its overhead position throughout, providing a clear view of the robotic precision and the growing tower.",
    "input_video_path" : "observation.images.front_000002.mp4",
    "vis": {
        "control_weight": 0.3
    },
    "depth": {
        "input_control": "observation.images.front_000002_depth.mp4",
        "control_weight": 0.7
    }
}

See more examples for using other control nets (vis, edge, seg, depth) 👉 GitHub Examples and 👉 Official Documentation

For me, generating a single video took about one hour, which feels quite long. The good news is that this can probably be improved by adding more GPUs.

My Experiments

You can see one of my attempts to change the color of the work surface and the blocks. As you can notice, the surface was replaced successfully, but there are still some anomalies with the cubes.

I tried regenerating this video multiple times, and the results were always about the same. A few times, the generation was even blocked by the guardrail model. It’s pretty frustrating to wait an hour just to get a message saying the output was “harmful” content.

I believe the next version will be released soon, so I’m hopeful that generation will become more stable.

Estimating the Cost

At around $2/hour on an RTX PRO 6000, that’s $2 per video. If your robot has two cameras, the cost doubles to about $4 per training episode. So, if you plan to generate 100 new episodes, expect the total to be roughly $400.

It’s hard to fathom how expensive it must be to create datasets for advanced humanoid or other robots.

Conclusion

NVIDIA Cosmos Transfer is a powerful idea — it brings generative data augmentation to robotics in a way that could really speed up training and make robotics models more robust. Still, it’s early days: the setup is tricky, and the hardware demands are high.

If you’ve experimented with Cosmos Transfer or similar models, I’d love to hear your experience. Drop me a comment or message — sharing insights helps the whole community move forward faster.