Use runtime env var for `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES` #6

praateekmahajan · 2025-07-17T01:17:33Z

Currently its needed that an existing ray cluster is started with the following env varibale
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0

However if a user starts a ray cluster without this env variable, then the gpu allocation doesn't work out and error out here

cosmos-xenna/cosmos_xenna/ray_utils/stage_worker.py

Lines 376 to 381 in c62aa91

    
           if self._worker.allocation.gpus and gpu.get_num_gpus() == 0: 
        
               raise RuntimeError( 
        
                   "Worker is a GPU worker, but no GPUs are available. This likely means that the ray cluster was not " 
        
                   "started with 'RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0'. Xenna needs this env variable to be set " 
        
                   "before cluster creation as it works around ray's gpu allocation mechanisms." 
        
               )

I believe this PR should allow us to attach to existing cluster where that envvar wasn't set.

Earlier

RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0 ray start --num-cpus 16 --num-gpus 4 --port 1234

RAY_ADDRESS=localhost:1234 python main.py

Now

ray start --num-cpus 16 --num-gpus 4 --port 1234

RAY_ADDRESS=localhost:1234 python main.py

TODO haven't tested really for edge cases

Signed-off-by: Praateek <[email protected]>

pkuwangh · 2025-07-22T23:20:29Z

cosmos_xenna/ray_utils/cluster.py

-    os.environ["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "0"
-    # These need to be set to allow listing debug info about more than 10k actors.
-    os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT)
-    os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT)


I think you don't have to delete these?

We do! If we don't and this code path is executed first then these variables will "live" on the driver outside the ray session as well. So any subsequent (non xenna) ray pipeline if it does a ray.init(..) again it'll have these env variables.

run_xenna_pipeline() run_ray_data_pipeline() # this will fail because `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0`

I think we need this even with Xenna. I'm not sure how but without this, the following case fails for me:

Reusing cluster Start ray cluster using ray start --port 1234 (no env variables) and pass RAY_ADDRESS=localhost:1234 python xenna.py

Which means that even with this PR if we have these variables set, xenna would fail.

Start ray cluster using ray start --port 1234 (no env variables) and RAY_ADDRESS=localhost:1234 RAY_MAX_LIMIT_FROM_API_SERVER=40000 RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 python xenna.py

Maybe we unset these variables here as a solution?

For this

run_xenna_pipeline() run_ray_data_pipeline() # this will fail because `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0`

I think you better start a process for each call to make sure things are clean.
I had to do this otherwise ray metrics are quite messed up.

Thanks @abhinavg4 for catching that. So we need the RAY_MAX_LIMIT_* envvars when we start the cluster. The reason it wasn't caught before was because here we use a default value of 10k, which Ray Clusters anyway support

cosmos-xenna/cosmos_xenna/pipelines/private/monitoring.py

Line 137 in c62aa91

limit = int(os.environ.get("RAY_MAX_LIMIT_FROM_API_SERVER", "10000"))

However if we change that value to str(API_LIMIT) then my PR in the previous stage would've failed.

Essentially there are some cluster level env variables, driver level env variables and worker level env variables.
RAY_MAX_LIMIT needs to be a cluster level variable (and because of the monitoring.py also a driver level variable)
While NOSET_CUDA can be a worker level variable

Signed-off-by: Praateek <[email protected]>

praateekmahajan added 2 commits July 17, 2025 01:12

fc

131a55d

Signed-off-by: Praateek <[email protected]>

remove os environ

adbf1ad

Signed-off-by: Praateek <[email protected]>

pkuwangh reviewed Jul 22, 2025

View reviewed changes

praateekmahajan added 2 commits July 23, 2025 20:03

bugfix

ddd682c

Signed-off-by: Praateek <[email protected]>

revert newlines

8693177

Signed-off-by: Praateek <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use runtime env var for `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES` #6

Use runtime env var for `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES` #6

Uh oh!

praateekmahajan commented Jul 17, 2025 •

edited

Loading

Uh oh!

pkuwangh Jul 22, 2025

Uh oh!

praateekmahajan Jul 22, 2025

Uh oh!

abhinavg4 Jul 23, 2025

Uh oh!

abhinavg4 Jul 23, 2025

Uh oh!

pkuwangh Jul 23, 2025

Uh oh!

praateekmahajan Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if self._worker.allocation.gpus and gpu.get_num_gpus() == 0:
	raise RuntimeError(
	"Worker is a GPU worker, but no GPUs are available. This likely means that the ray cluster was not "
	"started with 'RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0'. Xenna needs this env variable to be set "
	"before cluster creation as it works around ray's gpu allocation mechanisms."
	)

Use runtime env var for RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES #6

Are you sure you want to change the base?

Use runtime env var for RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES #6

Uh oh!

Conversation

praateekmahajan commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Earlier

Now

TODO haven't tested really for edge cases

Uh oh!

pkuwangh Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

pkuwangh Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use runtime env var for `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES` #6

Use runtime env var for `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES` #6

praateekmahajan commented Jul 17, 2025 •

edited

Loading