-
Notifications
You must be signed in to change notification settings - Fork 14
Use runtime env var for RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
#6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Use runtime env var for RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
#6
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
| os.environ["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "0" | ||
| # These need to be set to allow listing debug info about more than 10k actors. | ||
| os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT) | ||
| os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't have to delete these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do! If we don't and this code path is executed first then these variables will "live" on the driver outside the ray session as well. So any subsequent (non xenna) ray pipeline if it does a ray.init(..) again it'll have these env variables.
run_xenna_pipeline()
run_ray_data_pipeline() # this will fail because `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0`There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this even with Xenna. I'm not sure how but without this, the following case fails for me:
Reusing cluster
Start ray cluster using ray start --port 1234 (no env variables) and pass RAY_ADDRESS=localhost:1234 python xenna.py
Which means that even with this PR if we have these variables set, xenna would fail.
Start ray cluster using ray start --port 1234 (no env variables) and RAY_ADDRESS=localhost:1234 RAY_MAX_LIMIT_FROM_API_SERVER=40000 RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 python xenna.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we unset these variables here as a solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this
run_xenna_pipeline()
run_ray_data_pipeline() # this will fail because `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0`
I think you better start a process for each call to make sure things are clean.
I had to do this otherwise ray metrics are quite messed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abhinavg4 for catching that. So we need the RAY_MAX_LIMIT_* envvars when we start the cluster. The reason it wasn't caught before was because here we use a default value of 10k, which Ray Clusters anyway support
| limit = int(os.environ.get("RAY_MAX_LIMIT_FROM_API_SERVER", "10000")) |
However if we change that value to
str(API_LIMIT) then my PR in the previous stage would've failed.
Essentially there are some cluster level env variables, driver level env variables and worker level env variables.
RAY_MAX_LIMIT needs to be a cluster level variable (and because of the monitoring.py also a driver level variable)
While NOSET_CUDA can be a worker level variable
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Currently its needed that an existing ray cluster is started with the following env varibale
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0However if a user starts a ray cluster without this env variable, then the gpu allocation doesn't work out and error out here
cosmos-xenna/cosmos_xenna/ray_utils/stage_worker.py
Lines 376 to 381 in c62aa91
I believe this PR should allow us to attach to existing cluster where that envvar wasn't set.
Earlier
Now
TODO haven't tested really for edge cases