-
Notifications
You must be signed in to change notification settings - Fork 221
Convert GPU Driver installation to Tool, Add amd-smi #4080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
adityagesh
commented
Oct 28, 2025
- Convert GPU Driver installation to a Tool. GPU Feature in Azure is overloaded and has multiple responsibilities.
- Add support for AMD GPU Driver installation
- Add basic amd-smi tool
7ef7268 to
09513d7
Compare
GPU feature was responsible for both capability and driver installation. This change is refactoring without any functional changes. The driver installation responsibility is now converted to a tool
_is_nvidia is initialized, but never used
0081041 to
4636902
Compare
4636902 to
9b0fa7a
Compare
505a16d to
eeee25b
Compare
eeee25b to
056479b
Compare
760b044 to
ccc7398
Compare
ccc7398 to
86355c1
Compare
203c42c to
73e3f98
Compare
| gpu_driver_factory = Factory[GpuDriver](GpuDriver) | ||
|
|
||
| driver_class = gpu_driver_factory.create_by_type_name( | ||
| compute_sdk, node=node, **kwargs | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lisa/tools/gpu_drivers.py:89:49: error: Only concrete class can be given where "Type[GpuDriver]" is expected
Found 1 error in 1 file (checked 511 source files)
nox > Command mypy -p lisa failed with exit code 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove abstract decorators, just raise NotImplementError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh okay, I thought the comment regarding abstract was only for the get_dependencies method
20d47d0 to
7a950e6
Compare
7a950e6 to
315488f
Compare
lisa/tools/gpu_drivers.py
Outdated
| GpuDriver # type: ignore[type-abstract] | ||
| ) | ||
|
|
||
| driver_class = gpu_driver_factory.create_by_type_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
driver_class -> driver, it creates an instance, not a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to change the variable name
| compute_sdk = _get_supported_driver(node) | ||
|
|
||
| # Create GPU driver instance using virtual tool pattern | ||
| gpu_driver = node.tools.create(GpuDriver, compute_sdk=compute_sdk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tools.get. It creates the object too, but will cache it for next use.
lisa/tools/gpu_drivers.py
Outdated
| return AmdSmi | ||
|
|
||
| @property | ||
| def driver_name(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks the driver name is duplicated with type_name, should it be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
175d62f to
347778f
Compare