Discover GPU vendor from CDI spec before injecting GPU for --gpus option #28008

shiv-tyagi · 2026-02-02T09:23:58Z

This gets the --gpus option working in podman run (or podman create) commands with AMD gpus by discovering the vendor from CDI specs to construct device id for the GPU to inject into the container.

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
the sign-off email address. See CONTRIBUTING.md
for more information.
Referenced issues using Fixes: #00000 in commit message (if applicable)
Tests have been added/updated (or no tests are needed)
Documentation has been updated (or no documentation changes are needed)
All commits pass make validatepr (format/lint checks)
Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

Yes

--gpus option now supports AMD GPUs

pkg/specgenutil/specgen.go

Luap99

I am not sure we want to expand the gpu option further, but if other maitnainers want this then this must at least bve implemented in a way where cdi's are resolved on the server side no the client.

Luap99 · 2026-02-02T14:38:02Z

pkg/specgenutil/specgen.go

 	devices := c.Devices
-	for _, gpu := range c.GPUs {
-		devices = append(devices, "nvidia.com/gpu="+gpu)
+	if len(c.GPUs) > 0 {


The problem is that cannot be handled here, FillOutSpecGen() is called on the client side, i.e. podman-remote and the cdi configs only ever exists on the server.

That is also why the bloat check is failing because this would drag all the cdi code into the remote client which we should avoid.

Understood. I will look into moving this code to server-side.

I have moved the GPUs to CDI device resolution logic to server side. Please have a look.

Thank you so much for taking out your time to see this, btw.

Luap99 · 2026-02-02T14:40:01Z

pkg/specgen/gpus.go

+	for _, knownVendor := range knownGPUVendors {
+		for _, vendor := range vendors {
+			if vendor == knownVendor {
+				logrus.Debugf("Discovered GPU vendor from CDI specs: %s", vendor)
+				return vendor, nil
+			}
+		}
+	}


I am not sure the UX of this would make sense, what if you have both an amd and nvidia card in the same system.

I think the --gpu option is just a bad idea in general.
I am not sure if we should continue expanding the --gpu option at all for this. Users should really just directly use the --device cdi syntac directly.

I think keeping the --gpus option is very helpful for users to just request GPUs without worrying to know what cdi is and how it works.

Also, having both NVIDIA and AMD gpus on the same machine is a rare case and the logic implemented here will favour nvidia (it's okay).

Another argument in favour of these changes, it is better to support AMD gpus at least on the machines where there are only AMD gpus rather than not supporting them at all :-)

We can work out the logic for machines with mixed vendors in future, but it would be really nice to start supporting AMD gpus for now.
(our customers love podman <3 😃)

rather than not supporting them at all :-)

I mean my point is we do support cdi so if they have cdi files we already support them.
But sure I get your point, multiple gpu vendor use case can be limited to them having to use the full proper cdi syntax.

I also think that handling the common / simplest case here (i.e. a single vendor's CDI spec) here is fine -- given that using --device=vendor1.com/gpu=ID1 --device=vendor2.com/gpu=ID2 is already possible. With this in mind, does it make sense to log a warning if multiple supported vendors are detected?

As a follow-up: Does it make sense to make the chosen vendor configurable, or is that not control that we expect users to actually want to use?

I rather keep this straight forward, for any complex scenario people should use the cdi syntax on --device. It may not be optimal but I think the need support for amd and nvidia card in the same system will be rare enough so it should not matter.

Luap99 · 2026-02-02T16:50:56Z

are these here the right official docs for AMD cdi? https://instinct.docs.amd.com/projects/container-toolkit/en/latest/container-runtime/quick-start-guide.html#regenerating-and-validating-cdi-specifications

We should likely add links to the docs for both nvidia and amd in the man page on how to generate the cdi files because this won't work without them.

I do a proper review later

shiv-tyagi · 2026-02-03T05:25:29Z

are these here the right official docs for AMD cdi? https://instinct.docs.amd.com/projects/container-toolkit/en/latest/container-runtime/quick-start-guide.html#regenerating-and-validating-cdi-specifications

Yes, @Luap99. This should do for now. If you say, I can add it in the docs in this PR itself. Also, once this is merged, we might organize our docs to have end to end guide on how to get things working with podman.

ninja-quokka

Nice work @shiv-tyagi

I think this looks nice, please update your PR description with a release note under the Does this PR introduce a user-facing change? section though, I think this change is worthy of it and users with AMD GPUs would like to know.

docs/source/markdown/options/gpus.md

elezar · 2026-02-03T16:08:17Z

pkg/specgen/gpus.go

+	registry, err := cdi.NewCache(
+		cdi.WithSpecDirs(cdiSpecDirs...),
+		cdi.WithAutoRefresh(false),
+	)


What is required to ensure that the same registry that is used to perform injection is used to resolve the vendor? Since these are both instantiated at different points in time, it may be that their view of the world (in terms of valid vendors) is not consistent.

This seems to be a valid point. I will look into this.

I guess this is already kinda broken in DevicesFromPath which is called inside a loop so we get a new cache each time which seems wrong. So rework wise it would be best to move this up the call stack outside the loop and then we can reuse this for all lookups.

We now use the same registry. The GPUs are resolved to CDI devices just before other CDI devices are injected into the container. And the same CDI cache is used to discover the vendor which is used to inject the devices.

elezar · 2026-02-03T16:08:23Z

pkg/specgenutil/specgen.go

-	for _, gpu := range c.GPUs {
-		devices = append(devices, "nvidia.com/gpu="+gpu)
-	}
+	s.GPUs = c.GPUs


Since we are not expliciltly updating s.Devices based on s.GPUs here, we need to ensure that s.ResolveGPUsToCDIDevices is called before we call DevicesFromPath for the generated devices. How is this done?

DevicesFromPath is called inside SpecGenToOCI which is called after the s.ResolveGPUsToCDIDevices in the MakeContainer call, which would ensure the device resolution precedes the actual injection.

I think logically it would be niecer to have this moved directly above the DevicesFromPath() calls and just directly call InjectDevices() on the path without first appending it to the devices string list. That way it would not have to do the isCDIDevice() condition of DevicesFromPath()

I have reworked the logic. Now we pass the GPUs directly into ContainerMiscConfig using libpod.WithGPUs (which is similar to libpod.WithCDI) where they get resolved to full device ids just before other CDI devices are injected. DevicesFromPath is completely skipped this way.

Thanks, I think this looks a lot better. The one question I have (which is probably out of scope for this PR) is how the injection we're doing here differs from the

func DevicesFromPath(g *generate.Generator, devicePath string, config *config.Config) error {

implementations that also exist in the codebase (https://github.com/shiv-tyagi/podman/blob/ddf6b4c0bcd964dd7ce28e97e051eea8a851e200/pkg/specgen/generate/config_linux.go#L27 and https://github.com/shiv-tyagi/podman/blob/ddf6b4c0bcd964dd7ce28e97e051eea8a851e200/pkg/specgen/generate/config_freebsd.go#L20-L21)? Do those also need to be extended to support GPUs? (I'm happy to move this to another issue if that requires further/separate discussion).

I think the DevicesFromPath only receives the devices mentioned in containers.conf (see this). I am not sure if that has anything to do with GPU support.

The ones we pass using --device option are extracted early using

podman/pkg/specgen/generate/container_create.go

Line 328 in c0869aa

func ExtractCDIDevices(s *specgen.SpecGenerator) []libpod.CtrCreateOption {

and processed separately in libpod as we are doing.

I think that even though DevicesFromPath does its own injection, I would still argue that this is out of scope for this PR since these devices are always fully specified (as discussed in #16232). It may make sense to see if we could move to a state where we at least reuse the same registry in a follow-up, but I don't have a good feeling for the complexity of that.

I have created #28060 as a follow-up to this one. Let's discuss this there.

packit-as-a-service · 2026-02-03T20:06:20Z

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

Honny1

I only checked the code structure and found just a small nit. I also have a question about Apple GPU access through libkrun. Will it work, or could it break the Apple GPU? I haven't dug into it yet, but that was the first thing that came to mind.

Honny1 · 2026-02-04T09:45:27Z

libpod/util.go

+		return "", fmt.Errorf("CDI cache cannot be nil")
+	}
+
+	knownGPUVendors := []string{


I would use map[string]struct{} to reduce nested loops.

I guess it would be even simpler to just inline that as switch case statement

If it is just a nit, I would want to keep it like this to make sure the order in which known vendors are searched looks for nvidia first.

Using map/switch for knownGPUVendors would change it to return the first known vendor among the seen vendors, which can be either depending on who is first in cdiCache.ListVendors().

for example, if cdiCache.ListVendors() returns - [ "some-vendor.com", "amd.com", "nvidia.com"], it will pick "amd.com" as it is the first vendor which is known whereas I want it to look for nvidia first and then amd.

cc @elezar

Traditionally, the --gpus flag has only supported NVIDIA GPUs, but was recently changed. I would argue that many users still expect --gpus to give them an NVIDIA GPU and we should not change this behaviour now.

Even IF we decide that AMD GPUs should be preferred, we should make this explicit and deterministic and not rely on the ordering of the underlying maps.

Yes, agreed. +1 for doing it in order with NVIDIA followed by AMD.

shiv-tyagi · 2026-02-04T10:18:36Z

Will it work, or could it break the Apple GPU?

This only makes the --gpus option work with AMD gpus, earlier that option only used to work with NVIDIA gpus. The changes in this PR are specific to CDI device related stuff. And from the docs on this page, it looks like Apple gpus are passed to containers through direct mounting (not even in CDI syntax through --device flag). So apple gpus should work fine as they do today.

Honny1 · 2026-02-04T10:23:31Z

@shiv-tyagi Thanks for the explanation.

Luap99

just some small comments, I think the design of moving this into libpod is the best way to go about it.
cc @mheon

Luap99 · 2026-02-04T19:07:58Z

libpod/container_internal_common.go

+		cdiDevices := c.config.CDIDevices
+		cdiDevices = append(cdiDevices, gpuCDIDevices...)


FYI slice handling in go sucks

This is not safe, because append may modify the underlying slice backing array directly and as long as cdiDevices has enough capacity
So this here could alter c.config.CDIDevices which would be quite unexpected.

I think the proper thing here is

cdiDevices := make([]string, 0, len(c.config.CDIDevices) + len(gpuCDIDevices)) cdiDevices = append(cdiDevices, c.config.CDIDevices...) cdiDevices = append(cdiDevices, gpuCDIDevices...)

Got it. I have fixed this. Thanks.

Luap99 · 2026-02-04T19:08:42Z

libpod/util.go

+		return "", fmt.Errorf("CDI cache cannot be nil")
+	}
+
+	knownGPUVendors := []string{


I guess it would be even simpler to just inline that as switch case statement

Luap99 · 2026-02-04T19:09:10Z

libpod/util.go

+// by discovering the vendor from the provided CDI cache.
+func gpusToCDIDevices(gpus []string, cdiCache *cdi.Cache) ([]string, error) {
+	if len(gpus) == 0 {
+		return []string{}, nil


this can return nil instead of empty slice which should avoid an empty allocation.

Done. Thanks for spotting.

Luap99 · 2026-02-04T19:10:21Z

libpod/util.go

+		return nil, fmt.Errorf("could not discover GPU vendor from CDI cache: %w", err)
+	}
+
+	cdiDevices := []string{}


I wonder the the prealloc linter is not flagging this, this can be preallocated to aviod extra memory copies/allocations.

cdiDevices := make([]string, 0, len(gpus))

Done. Thanks.

shiv-tyagi · 2026-02-09T09:55:56Z

@Luap99 @mheon

Can you please review this PR again and let me know what needs to be done next from my side for this to move further?

Thanks in advance!

elezar · 2026-02-09T16:49:38Z

libpod/container_internal_common.go

+
+		gpuCDIDevices, err := gpusToCDIDevices(c.config.GPUs, registry)
+		if err != nil {
+			return nil, nil, fmt.Errorf("converting GPU identifiers to CDI devices: %w", err)
+		}
+		cdiDevices := make([]string, 0, len(c.config.CDIDevices)+len(gpuCDIDevices))
+		cdiDevices = append(cdiDevices, c.config.CDIDevices...)
+		cdiDevices = append(cdiDevices, gpuCDIDevices...)
+		if _, err := registry.InjectDevices(g.Config, cdiDevices...); err != nil {


Would adding a helper that encapsulates this logic for a given config and registry here make things easier to follow?

func getCDIDeviceNames(registry *cdi.Cache, c *ContainerConfig) ([]string, error) { if len(c.GPUs) == 0 { return c.CDIDevices, nil } gpuCDIDevices, err := gpusToCDIDevices(c.GPUs, registry) if err != nil { return nil, fmt.Errorf("converting GPU identifiers to CDI devices: %w", err) } return append(c.CDIDevices, gpuCDIDevices...), nil }

(Note that as a side note: I have left out the pre-allocation of the slice. Is such an optimisation worth it in go, or does the compiler generally do the right thing in simple cases such as this?)

and then we update the code here:

Suggested change

gpuCDIDevices, err := gpusToCDIDevices(c.config.GPUs, registry)

if err != nil {

return nil, nil, fmt.Errorf("converting GPU identifiers to CDI devices: %w", err)

}

cdiDevices := make([]string, 0, len(c.config.CDIDevices)+len(gpuCDIDevices))

cdiDevices = append(cdiDevices, c.config.CDIDevices...)

cdiDevices = append(cdiDevices, gpuCDIDevices...)

if _, err := registry.InjectDevices(g.Config, cdiDevices...); err != nil {

cdiDevices, err := getCDIDeviceNames(registry, c.config)

if _, err := registry.InjectDevices(g.Config, cdiDevices...); err != nil {

(Note that as a side note: I have left out the pre-allocation of the slice. Is such an optimisation worth it in go, or does the compiler generally do the right thing in simple cases such as this?)

Worth it, well in many cases no if you measure performance of the whole program. But it can certainly add up if you adds 1000 of elements and cause a bunch of memory reallocations instead of just once allocating the right size. And really doing this does not hurt it optimises the amount of allocations we need.

The bigger issue as I described in my review comments above is that assigning a slice with := is dangerous as hell in go, just like a map this is not a deep copy, you only copy the pointer to the backing array and if that slice has enough capacity append will not create a new backing array.

What do you expect to happen here? Likely not what we wanted.

package main import "fmt" func main() { sliceA := make([]string, 0, 2) sliceA = append(sliceA, "1") sliceB := sliceA sliceA = append(sliceA, "A") sliceB = append(sliceB, "B") fmt.Println("A ", sliceA) fmt.Println("B ", sliceB) }

https://go.dev/play/p/Ukjfci3BPzT

And then if you have callers modify elements in place you have the same issue.

looking again into the stdlib I guess we do have slices.Concat(s1, s2) now so that can be preferred do join two slices instead of the manual make allocation, but never blindly append into an existing slice that you do not plan to modify

Thanks @Luap99 for a detailed explanation. I have added a helper as suggested by @elezar and used the slices.Concat() inside it as suggested by you. Please check.

elezar · 2026-02-09T16:52:03Z

libpod/util.go

+// It returns the vendor domain (e.g., "nvidia.com", "amd.com") that should
+// be used to construct fully qualified CDI device names.
+// Returns an error if no known GPU vendor is found.
+func discoverGPUVendorFromCDI(cdiCache *cdi.Cache) (string, error) {


As a note: if we were to define a local interface:

type vendorLister interface { ListVendors() []string }

here we can test this function without creating CDI specs on disk.

Thanks for the suggestion. I added the interface and used it in the function and the tests.

elezar · 2026-02-09T17:01:21Z

docs/source/markdown/options/gpus.md

 GPU devices to add to the container ('all' to pass all GPUs) Currently only
-Nvidia devices are supported.
+NVIDIA and AMD devices are supported.


Not a blocker, but what about updating this to the following:

Suggested change

GPU devices to add to the container ('all' to pass all GPUs) Currently only

Nvidia devices are supported.

NVIDIA and AMD devices are supported.

Start the container with GPU support. Where `ENTRY` can be `all` to request all GPUs, or a vendor-specific identifier. Currently NVIDIA and AMD devices are supported. If both NVIDIA and AMD devices are present the NVIDIA devices will be preferred and a CDI device name must be specified using the `--device` flag to request a set of GPUs from a *specific* vendor.

I like that suggested wording

Thanks @elezar. I have updated this.

shiv-tyagi · 2026-02-10T12:39:11Z

I have addressed the recent comments @elezar and @Luap99. Please let me know if there is anything else. I will be happy to do it. Thanks!

…ption Signed-off-by: Shiv Tyagi <Shiv.Tyagi@amd.com>

Luap99

LGTM

elezar

Thanks @shiv-tyagi. I think this is good to go.

I've also created #28060 as a follow-up for some of our discussions.

jankaluza reviewed Feb 2, 2026

View reviewed changes

pkg/specgenutil/specgen.go Outdated Show resolved Hide resolved

shiv-tyagi force-pushed the vendor-detection branch from 4f051d4 to 44cc798 Compare February 2, 2026 12:45

shiv-tyagi requested a review from jankaluza February 2, 2026 12:50

shiv-tyagi force-pushed the vendor-detection branch from 44cc798 to 2b50a66 Compare February 2, 2026 14:27

Luap99 reviewed Feb 2, 2026

View reviewed changes

shiv-tyagi force-pushed the vendor-detection branch from 2b50a66 to 3fb8e70 Compare February 2, 2026 16:28

shiv-tyagi mentioned this pull request Feb 3, 2026

Detect vendor in cdi specs to generate deviceIDs for --gpus containerd/containerd#12839

Open

shiv-tyagi force-pushed the vendor-detection branch 2 times, most recently from ab24545 to ca6e32c Compare February 3, 2026 06:02

ninja-quokka reviewed Feb 3, 2026

View reviewed changes

elezar reviewed Feb 3, 2026

View reviewed changes

docs/source/markdown/options/gpus.md Outdated Show resolved Hide resolved

elezar reviewed Feb 3, 2026

View reviewed changes

shiv-tyagi force-pushed the vendor-detection branch from ca6e32c to a38eec9 Compare February 3, 2026 19:09

shiv-tyagi force-pushed the vendor-detection branch from a38eec9 to 176b4aa Compare February 4, 2026 03:10

Honny1 reviewed Feb 4, 2026

View reviewed changes

Luap99 reviewed Feb 4, 2026

View reviewed changes

shiv-tyagi force-pushed the vendor-detection branch from 176b4aa to ddf6b4c Compare February 5, 2026 08:29

elezar reviewed Feb 9, 2026

View reviewed changes

shiv-tyagi force-pushed the vendor-detection branch 2 times, most recently from 6677acf to 8013d3d Compare February 10, 2026 12:35

Discover vendor from cdi spec before injecting CDI device for --gpu o…

054793d

…ption Signed-off-by: Shiv Tyagi <Shiv.Tyagi@amd.com>

shiv-tyagi force-pushed the vendor-detection branch from 8013d3d to 054793d Compare February 10, 2026 13:27

Luap99 approved these changes Feb 10, 2026

View reviewed changes

elezar approved these changes Feb 10, 2026

View reviewed changes

elezar mentioned this pull request Feb 10, 2026

WIP: Rework devices from path #28060

Draft

6 tasks

		cdiDevices := c.config.CDIDevices
		cdiDevices = append(cdiDevices, gpuCDIDevices...)

Discover GPU vendor from CDI spec before injecting GPU for --gpus option #28008

Are you sure you want to change the base?

Discover GPU vendor from CDI spec before injecting GPU for --gpus option #28008

Conversation

shiv-tyagi commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Does this PR introduce a user-facing change?

Uh oh!

Uh oh!

Luap99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiv-tyagi Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luap99 commented Feb 2, 2026

Uh oh!

shiv-tyagi commented Feb 3, 2026

Uh oh!

ninja-quokka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiv-tyagi Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiv-tyagi Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

packit-as-a-service bot commented Feb 3, 2026

Uh oh!

Honny1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiv-tyagi commented Feb 2, 2026 •

edited

Loading

shiv-tyagi Feb 2, 2026 •

edited

Loading

shiv-tyagi Feb 3, 2026 •

edited

Loading

shiv-tyagi Feb 10, 2026 •

edited

Loading

shiv-tyagi Feb 5, 2026 •

edited

Loading

shiv-tyagi commented Feb 4, 2026 •

edited

Loading

elezar Feb 9, 2026 •

edited

Loading