Skip to content

Conversation

@praateekmahajan
Copy link

@praateekmahajan praateekmahajan commented Jul 17, 2025

Currently its needed that an existing ray cluster is started with the following env varibale
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0

However if a user starts a ray cluster without this env variable, then the gpu allocation doesn't work out and error out here

if self._worker.allocation.gpus and gpu.get_num_gpus() == 0:
raise RuntimeError(
"Worker is a GPU worker, but no GPUs are available. This likely means that the ray cluster was not "
"started with 'RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0'. Xenna needs this env variable to be set "
"before cluster creation as it works around ray's gpu allocation mechanisms."
)

I believe this PR should allow us to attach to existing cluster where that envvar wasn't set.

Earlier

RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0 ray start --num-cpus 16 --num-gpus 4 --port 1234

RAY_ADDRESS=localhost:1234 python main.py

Now

ray start --num-cpus 16 --num-gpus 4 --port 1234

RAY_ADDRESS=localhost:1234 python main.py

TODO haven't tested really for edge cases

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
os.environ["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "0"
# These need to be set to allow listing debug info about more than 10k actors.
os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT)
os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't have to delete these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do! If we don't and this code path is executed first then these variables will "live" on the driver outside the ray session as well. So any subsequent (non xenna) ray pipeline if it does a ray.init(..) again it'll have these env variables.

run_xenna_pipeline()
run_ray_data_pipeline() # this will fail because `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this even with Xenna. I'm not sure how but without this, the following case fails for me:

Reusing cluster
Start ray cluster using ray start --port 1234  (no env variables) and pass RAY_ADDRESS=localhost:1234 python xenna.py

Which means that even with this PR if we have these variables set, xenna would fail.

Start ray cluster using ray start --port 1234  (no env variables) and RAY_ADDRESS=localhost:1234 RAY_MAX_LIMIT_FROM_API_SERVER=40000  RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 python xenna.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we unset these variables here as a solution?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this

run_xenna_pipeline()
run_ray_data_pipeline() # this will fail because `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0`

I think you better start a process for each call to make sure things are clean.
I had to do this otherwise ray metrics are quite messed up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhinavg4 for catching that. So we need the RAY_MAX_LIMIT_* envvars when we start the cluster. The reason it wasn't caught before was because here we use a default value of 10k, which Ray Clusters anyway support

limit = int(os.environ.get("RAY_MAX_LIMIT_FROM_API_SERVER", "10000"))

However if we change that value to str(API_LIMIT) then my PR in the previous stage would've failed.

Essentially there are some cluster level env variables, driver level env variables and worker level env variables.
RAY_MAX_LIMIT needs to be a cluster level variable (and because of the monitoring.py also a driver level variable)
While NOSET_CUDA can be a worker level variable

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants