-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Issue Description
The min_gpu_recommender custom_experiment fails to produce recommendations due to python exceptions.
How to reproduce
Steps to reproduce the behavior:
uv venv .venv --python=3.12
. .venv/bin/activate
uv pip install ado-core==1.5.0 ado-autoconf==1.5.0
Create a point YAML called point.yaml:
entity:
model_name: llama-7b
method: lora
gpu_model: NVIDIA-A100-80GB-PCIe
tokens_per_sample: 8192
batch_size: 16
model_version: 3.1.0
experiments:
- actuatorIdentifier: custom_experiments
experimentIdentifier: min_gpu_recommenderProcess it:
run_experiment point.yaml
Observe that the experiment does not produce a recommendation:
Point: {'model_name': 'llama-7b', 'method': 'lora', 'gpu_model': 'NVIDIA-A100-80GB-PCIe', 'tokens_per_sample': 8192, 'batch_size': 16, 'model_version': '3.1.0'}
2026-02-10 08:43:41,444 INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265/
/Users/vassiliad/projects/orchestrator/fms-autoconf-k8s-controller/.venv/lib/python3.12/site-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(
(raylet) It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="8f75c1ce-a6a1-4f08-8680-514732077ab5", ...)
Validating entity ...
Executing: custom_experiments.min_gpu_recommender
(custom_experiment_executor pid=33914) Found 1 mismatches between original and current metadata:
(custom_experiment_executor pid=33914) INFO: AutoGluon Python micro version mismatch (original=3.12.7, current=3.12.11)
Result:
[request_id 7aec8b
request_index 0
entity_index 0
result_index 0
batch_size 16
generatorid unk
gpu_model NVIDIA-A100-80GB-PCIe
method lora
model_name llama-7b
model_version 3.1.0
tokens_per_sample 8192
identifier model_name.llama-7b-method.lora-gpu_model.NVID...
experiment_id custom_experiments.min_gpu_recommender
valid True
can_recommend 0
dtype: object]
(custom_experiment_executor pid=33914) 2026-02-10 08:44:14,371 WARNING MainThread root : min_gpu_recommender : recommend_min_gpus_and_workers() for {'model_name': 'llama-7b', 'method': 'lora', 'gpu_model': 'NVIDIA-A100-80GB-PCIe', 'tokens_per_sample': 8192, 'batch_size': 16, 'model_version': '3.1.0', 'gpus_per_worker': 8, 'max_gpus': 8}
cannot produce a recommendation: Unable to recommend minimum number of GPUs to avoid GPU OOM: {'Rule-Based Classifier error': '', 'Predictive Model Classifier error': "No module named 'IPython'"}
The predictive model classifier error field indicates an exception related to python:
{'Rule-Based Classifier error': '', 'Predictive Model Classifier error': "No module named 'IPython'"}
Running uv pip install ipython and retrying the point produces a recommendation:
Result:
[request_id 0f0eb2
request_index 0
entity_index 0
result_index 0
batch_size 16
generatorid unk
gpu_model NVIDIA-A100-80GB-PCIe
method lora
model_name llama-7b
model_version 3.1.0
tokens_per_sample 8192
identifier model_name.llama-7b-method.lora-gpu_model.NVID...
experiment_id custom_experiments.min_gpu_recommender
valid True
can_recommend 1
gpus 2
workers 1
dtype: object]
Expected behaviour
The experiment should work out of the box.
Screenshots/Logs
See above
Python/ado/system info
Please include the output of:
python --version == 3.12.11
ado version = 1.5.0
Your OS = MacOS
Additional information
I grabbed my example from: https://github.com/IBM/ado/blob/main/plugins/custom_experiments/autoconf/examples/simple.yaml
And then I changed the model_version to 3.1.0 because that's the most recent one.
- This should have been caught in a unit-test.
- The example should have been updated to use model_version==3.1.0