Skip to content

bug(ado-autoconf): min_gpu_recommender unable to make recommendations due to missing ipython package #521

@VassilisVassiliadis

Description

@VassilisVassiliadis

Issue Description

The min_gpu_recommender custom_experiment fails to produce recommendations due to python exceptions.

How to reproduce

Steps to reproduce the behavior:

uv venv .venv --python=3.12
. .venv/bin/activate
uv pip install ado-core==1.5.0 ado-autoconf==1.5.0

Create a point YAML called point.yaml:

entity:
  model_name: llama-7b
  method: lora
  gpu_model: NVIDIA-A100-80GB-PCIe
  tokens_per_sample: 8192
  batch_size: 16
  model_version: 3.1.0
experiments:
  - actuatorIdentifier: custom_experiments
    experimentIdentifier: min_gpu_recommender

Process it:

run_experiment point.yaml

Observe that the experiment does not produce a recommendation:

Point: {'model_name': 'llama-7b', 'method': 'lora', 'gpu_model': 'NVIDIA-A100-80GB-PCIe', 'tokens_per_sample': 8192, 'batch_size': 16, 'model_version': '3.1.0'}
2026-02-10 08:43:41,444	INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265/
/Users/vassiliad/projects/orchestrator/fms-autoconf-k8s-controller/.venv/lib/python3.12/site-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
(raylet) It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="8f75c1ce-a6a1-4f08-8680-514732077ab5", ...)
Validating entity ...
Executing: custom_experiments.min_gpu_recommender
(custom_experiment_executor pid=33914) Found 1 mismatches between original and current metadata:
(custom_experiment_executor pid=33914) 	INFO: AutoGluon Python micro version mismatch (original=3.12.7, current=3.12.11)
Result:
[request_id                                                      7aec8b
request_index                                                        0
entity_index                                                         0
result_index                                                         0
batch_size                                                          16
generatorid                                                        unk
gpu_model                                        NVIDIA-A100-80GB-PCIe
method                                                            lora
model_name                                                    llama-7b
model_version                                                    3.1.0
tokens_per_sample                                                 8192
identifier           model_name.llama-7b-method.lora-gpu_model.NVID...
experiment_id                   custom_experiments.min_gpu_recommender
valid                                                             True
can_recommend                                                        0
dtype: object]

(custom_experiment_executor pid=33914) 2026-02-10 08:44:14,371 WARNING   MainThread           root           : min_gpu_recommender : recommend_min_gpus_and_workers() for {'model_name': 'llama-7b', 'method': 'lora', 'gpu_model': 'NVIDIA-A100-80GB-PCIe', 'tokens_per_sample': 8192, 'batch_size': 16, 'model_version': '3.1.0', 'gpus_per_worker': 8, 'max_gpus': 8}
cannot produce a recommendation: Unable to recommend minimum number of GPUs to avoid GPU OOM: {'Rule-Based Classifier error': '', 'Predictive Model Classifier error': "No module named 'IPython'"}

The predictive model classifier error field indicates an exception related to python:

{'Rule-Based Classifier error': '', 'Predictive Model Classifier error': "No module named 'IPython'"}

Running uv pip install ipython and retrying the point produces a recommendation:

Result:
[request_id                                                      0f0eb2
request_index                                                        0
entity_index                                                         0
result_index                                                         0
batch_size                                                          16
generatorid                                                        unk
gpu_model                                        NVIDIA-A100-80GB-PCIe
method                                                            lora
model_name                                                    llama-7b
model_version                                                    3.1.0
tokens_per_sample                                                 8192
identifier           model_name.llama-7b-method.lora-gpu_model.NVID...
experiment_id                   custom_experiments.min_gpu_recommender
valid                                                             True
can_recommend                                                        1
gpus                                                                 2
workers                                                              1
dtype: object]

Expected behaviour

The experiment should work out of the box.

Screenshots/Logs

See above

Python/ado/system info

Please include the output of:

python --version == 3.12.11
ado version = 1.5.0
Your OS = MacOS

Additional information

I grabbed my example from: https://github.com/IBM/ado/blob/main/plugins/custom_experiments/autoconf/examples/simple.yaml

And then I changed the model_version to 3.1.0 because that's the most recent one.

  • This should have been caught in a unit-test.
  • The example should have been updated to use model_version==3.1.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions