NVIDIA Cosmos Transfer1

In my recent video, I showed how even a small change in the environment — like switching the background — can confuse VLA models such as SmolVLA, NVIDIA Isaac Groot and others.
The fix sounds simple: just feed the model bigger datasets with lots of variations. The problem is, you can’t realistically show a robot every background or environment it might encounter in the real world.
That’s where NVIDIA Cosmos Transfer comes in. It grabs your original training episode and generates a bunch of new versions with different backgrounds, fresh environments, and small changes here and there. Suddenly, your dataset of 50–60 episodes doubles in size without you recording anything new.
However, installing and running Cosmos Transfer1 isn’t exactly “plug and play.” To start, you’ll need some serious hardware with around 80 GB of VRAM and 400+ GB of disk space. And the official instructions aren’t the most beginner-friendly.
But here’s the good news: I went through the pain of building the Docker image, troubleshooting the setup, and creating a RunPod template that makes it much easier to get started. In this post, I’ll walk you through how to run your first experiments without tearing your hair out.
Let’s dive in 👇
Disclaimer: I’m not affiliated with RunPod in any way, and this isn’t a paid promotion. Feel free to try this Docker image with any other similar services that support NVIDIA GPU.
Step 1: Choose a GPU
Sign in to RunPod and click this link — you should see a list of available GPUs to choose from. I recommend selecting the RTX PRO 6000 (96 GB).
Cosmos Transfer1 itself isn’t huge (~7B parameters), but you’re running the entire pipeline, which includes guardrail models that verify prompts and generated videos for safety, the Google T5 encoder, and other small models that also consume VRAM.
T5 is the heaviest part of the pipeline and uses more than 40 GB of VRAM. I’m hoping they swap it out for something lighter soon so I can run it locally on my RTX 5090.
Click on your selected video card.
Step 2: Add your Hugging Face Token
Scroll down to Configure Deployment
-> Pod Template
and click the Edit
button.
In the just opened dialog, go to the Environment Variables
section and add your HF_TOKEN
. This logs in to Hugging Face automatically on startup.
If you’re not sure how to get your Hugging Face access token, here’s 👉 the guide.
Step 3: Run the Pod
Click the Deploy On Demand
button to start your Pod and be patient. It takes some time (15–20 minutes) to download the Docker image and make your Pod ready.
Step 4: Run the Test Environment
Once your Pod is ready, log in via SSH. You can find the official guide on how to connect to your Pod using SSH 👉 here.
Once you're connected to the container, make sure you're in the /workspace
directory and run the following command:
Step 5: Download the Models
If you’ve seen all green lights in the previous step, you’re good to download the models.
The download is around 350 GB, so it may take some time.
Step 6: Run Inference
Upload your training episode(s) to the server using SCP.
Then create the controlnet_specs
file and run inference with the following command:
Here’s a sample controlnet_specs
file from my experiments:
See more examples for using other control nets (vis, edge, seg, depth) 👉 GitHub Examples and 👉 Official Documentation
For me, generating a single video took about one hour, which feels quite long. The good news is that this can probably be improved by adding more GPUs.
My Experiments
You can see one of my attempts to change the color of the work surface and the blocks. As you can notice, the surface was replaced successfully, but there are still some anomalies with the cubes.
I tried regenerating this video multiple times, and the results were always about the same. A few times, the generation was even blocked by the guardrail model. It’s pretty frustrating to wait an hour just to get a message saying the output was “harmful” content.
I believe the next version will be released soon, so I’m hopeful that generation will become more stable.
Estimating the Cost
At around $2/hour on an RTX PRO 6000, that’s $2 per video. If your robot has two cameras, the cost doubles to about $4 per training episode. So, if you plan to generate 100 new episodes, expect the total to be roughly $400.
It’s hard to fathom how expensive it must be to create datasets for advanced humanoid or other robots.
Conclusion
NVIDIA Cosmos Transfer is a powerful idea — it brings generative data augmentation to robotics in a way that could really speed up training and make robotics models more robust. Still, it’s early days: the setup is tricky, and the hardware demands are high.
If you’ve experimented with Cosmos Transfer or similar models, I’d love to hear your experience. Drop me a comment or message — sharing insights helps the whole community move forward faster.