Skip to content

Conversation

@giuseppe
Copy link
Member

Add a new splitfdstream package that enables efficient container layer transfer between storage instances by separating tar metadata from file content. File content is passed as file descriptors.

@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 4cca2a3 to 3b6f591 Compare February 10, 2026 14:59
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 6% of the code base for an extremely niche use case :/

I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.

@giuseppe
Copy link
Member Author

  • 6% of the code base for an extremely niche use case :/

I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.

what other work would block this feature?

I don't love either the amount of extra code, but it is almost all in new APIs/packages that are experimental and not leaking too much to the rest of the library.

We went around this problem for a long time, and in the end this seems like an ~ok solution to not expose too much of the containers storage internals augmenting the tar format that is already plugged everywhere.

@giuseppe
Copy link
Member Author

to add more context: the goal of this feature is to be able to copy a layer, similarly to what we would do with Diff, but having the possibility to deduplicate files using reflinks. We need two separate copies because OS files can't be kept on the containers storage for different reasons, a podman system reset -f should not break the system, and OS files need to be relabeled.

There are other interesting use cases enabled with a RPC like facilitate usage from different languages (Rust in our case), inject into a build container to grab R/O access to layers/images, but this is out of scope now.

I realize complexity is not trivial just to get access to the raw files, but what alternatives do we have?

From the store API PoV we could just extend it to return a map $LAYER_FILE -> $PATH_TO_THE_FILE, but how would we use it from Rust?

Any suggestions?

@mtrmac
Copy link
Contributor

mtrmac commented Feb 10, 2026

(Just to avoid a misunderstanding, I didn’t really read the PR. I have, at this point, ~no opinion on RPC vs. Go API.)

return &tarReaderIterator{
tr: tr,
trBuf: trBuf,
buffer: make([]byte, 1<<20),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like 1<<20 should be in a const somewhere with a rationale?


// splitFDStreamSocketDescriptor is the fd for the Unix socket used to
// receive file descriptors via SCM_RIGHTS in the re-exec child.
const splitFDStreamSocketDescriptor = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding this seems a bit fragile, can't we pass it as an env var or a CLI arg?

Comment on lines 34 to 35
// FDs are streamed to the child process one-at-a-time over a Unix socket
// using SCM_RIGHTS, avoiding EMFILE from inheriting too many FDs at exec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be truly one at a time, see also https://github.com/cgwalters/jsonrpc-fdpass?tab=readme-ov-file#41-fd-batching

In my initial PoC work here I handled this by just sending the splitfdstream inline data itself as a fd (could be O_TMPFILE or memfd or pipe), and then sending all the associated fds over jsonrpc-fdpass which automatically handles buffering.

Ensure we processing just 1 recvmsg() at a time (only ~200 fds on Linux) seems really reasonable and unlikely to get anywhere close to modern fd limits.


So if we do choose jsonrpc-fdpass, I think we also need to standardize this "splitfdstream serialization".

return err
}

// Try copy_file_range - kernel-level copy, more efficient than userspace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprising we weren't doing this before.


// ReadLine reads bytes until a newline is encountered, using recvmsg.
// Any FDs received are closed since line-reading doesn't expect them.
func (p *FDPasser) ReadLine() ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would one want this?


// JSON-RPC 2.0 protocol constants
const (
JSONRPCVersion = "2.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend the store with splitfdstream capabilities exposed via a
UNIX socket for JSON-RPC communication.

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member Author

still PoC, so you can avoid a review as it is not yet ready but we can use to analyze the complexity.

Moved to use https://github.com/cgwalters/jsonrpc-fdpass-go

Now the code size is almost 50% than the last revision

Implement the SplitFDStreamDriver interface for the overlay driver,
enabling efficient layer operations with reflink support.

Signed-off-by: Giuseppe Scrivano <[email protected]>
@packit-as-a-service
Copy link

Packit jobs failed. @containers/packit-build please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants