Extend `permute_2D_sparse_data` with optional pre-allocated output buffers (#5461) by TroyGarden · Pull Request #5461 · pytorch/FBGEMM

TroyGarden · 2026-03-09T05:56:17Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2435

1. Context

In TrainPipelineSparseDist, input distribution runs on a separate data_dist_stream. Memory snapshot analysis revealed that KJT allocations happen inside torch.ops.fbgemm.permute_2D_sparse_data (called from jagged_tensor.py). These allocations on data_dist_stream require record_stream when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks).

By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to data_dist_stream), we eliminate the cross-stream allocation and the need for record_stream, recovering the ~2 GB memory overhead.

2. Approach

Optional output parameters: Added three optional tensor parameters (permuted_lengths_out, permuted_indices_out, permuted_weights_out) to the existing permute_2D_sparse_data operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default None), behavior is identical to today — fully backward compatible.
Schema extension: Updated the TORCH_LIBRARY_FRAGMENT schema for both permute_2D_sparse_data and permute_sparse_data (legacy alias) with the three new Tensor? parameters defaulting to None.
CPU and CUDA implementations: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic.
Python meta implementation: Updated the abstract/meta implementation in sparse_ops.py with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly.
Unit test: Added test_permute_indices_with_preallocated_output using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share data_ptr() with the pre-allocated buffers (zero-copy).
Benchmark: Added permute_2d_benchmark in TorchRec comparing default vs pre-allocated allocation paths.

3. Results

benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50)

short name	GPU Runtime (P90)	CPU Runtime (P90)	GPU Peak Mem alloc (P90)	GPU Peak Mem reserved (P90)	GPU Mem used (P90)	Malloc retries (P50/P90/P100)	CPU Peak RSS (P90)
permute_2d_default	0.10 ms	0.48 ms	0.01 GB	0.02 GB	1.07 GB	0.0 / 0.0 / 0.0	1.15 GB
permute_2d_preallocated	0.15 ms	0.10 ms	0.01 GB	0.02 GB	1.07 GB	0.0 / 0.0 / 0.0	1.30 GB

CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs).
GPU runtime: identical (~0.1 ms) — kernel execution unchanged.

repro commands

buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \
  --num_features=170 --batch_size=128 --mean_pooling_factor=50

trace - manifold folder

name	trace	memory
permute_2d_default	Perf Doctor / Perfetto	memory
permute_2d_preallocated	Perf Doctor / Perfetto	memory

4. Analysis

Backward compatibility: All new parameters default to None/std::nullopt. Existing callers (including permute_sequence_embeddings, permute_2D_sparse_data_input1D, and all TorchRec call sites) are unchanged and pass through the default path.
CPU speedup source: The ~5x CPU runtime improvement comes from eliminating at::empty() calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost.
GPU runtime unchanged: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation.
No validation on pre-allocated buffers: The implementation trusts callers to provide correctly sized buffers. This is consistent with other _out patterns in PyTorch/FBGEMM and avoids runtime overhead.

5. Changes

sparse_ops_cpu.cpp: Updated permute_2D_sparse_data and permute_sparse_data schema registration with 3 new optional Tensor? params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated permute_sequence_embeddings_cpu and permute_2D_sparse_data_input1D_cpu call sites to pass std::nullopt.
sparse_permute_2d.cu: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated permute_2D_sparse_data_input1D_cuda call site.
sparse_permute_embeddings.cu: Updated permute_sequence_embeddings_cuda call site to pass std::nullopt for the 3 new params.
sparse_ops.h: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by TORCH_FN macro used in FBGEMM_OP_DISPATCH).
sparse_ops.py: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility.
permute_indices_test.py: Added test_permute_indices_with_preallocated_output — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification.
permute_2d_benchmark.py (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support.
torchrec/sparse/tests/BUCK: Added permute_2d_benchmark python_binary target.

Reviewed By: q10

Differential Revision: D95757955

meta-codesync · 2026-03-09T05:56:25Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95757955.

…ffers (meta-pytorch#3846) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: pytorch/FBGEMM#5461 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Differential Revision: D95757955

…ffers (meta-pytorch#3846) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: pytorch/FBGEMM#5461 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (meta-pytorch#3846) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: pytorch/FBGEMM#5461 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (meta-pytorch#3846) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: pytorch/FBGEMM#5461 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2435 Pull Request resolved: pytorch#5461 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (meta-pytorch#3846) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2435 X-link: pytorch/FBGEMM#5461 Pull Request resolved: meta-pytorch#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2435 Pull Request resolved: pytorch#5461 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (meta-pytorch#3846) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2435 X-link: pytorch/FBGEMM#5461 Pull Request resolved: meta-pytorch#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

…ffers (pytorch#5461) Summary: X-link: facebookresearch/FBGEMM#2435 X-link: meta-pytorch/torchrec#3846 ## 1. Context In `TrainPipelineSparseDist`, input distribution runs on a separate `data_dist_stream`. Memory snapshot analysis revealed that KJT allocations happen **inside** `torch.ops.fbgemm.permute_2D_sparse_data` (called from `jagged_tensor.py`). These allocations on `data_dist_stream` require `record_stream` when the tensors are later consumed on the default stream, which delays memory reclamation by the CUDA caching allocator (~2 GB overhead observed in production benchmarks). By allowing callers to pass in pre-allocated output buffers (allocated on the main stream before switching to `data_dist_stream`), we eliminate the cross-stream allocation and the need for `record_stream`, recovering the ~2 GB memory overhead. ## 2. Approach 1. **Optional output parameters**: Added three optional tensor parameters (`permuted_lengths_out`, `permuted_indices_out`, `permuted_weights_out`) to the existing `permute_2D_sparse_data` operator. When provided, the op writes into the pre-allocated buffers instead of allocating new ones. When not provided (default `None`), behavior is identical to today — fully backward compatible. 2. **Schema extension**: Updated the `TORCH_LIBRARY_FRAGMENT` schema for both `permute_2D_sparse_data` and `permute_sparse_data` (legacy alias) with the three new `Tensor?` parameters defaulting to `None`. 3. **CPU and CUDA implementations**: At each allocation point in both CPU and CUDA kernels, added a conditional: use the provided buffer if present, otherwise allocate as before. No changes to kernel launch parameters or compute logic. 4. **Python meta implementation**: Updated the abstract/meta implementation in `sparse_ops.py` with the same conditional allocation pattern, ensuring PT2/torch.compile FakeTensor tracing works correctly. 5. **Unit test**: Added `test_permute_indices_with_preallocated_output` using hypothesis to verify correctness on both CPU and GPU, and that returned tensors share `data_ptr()` with the pre-allocated buffers (zero-copy). 6. **Benchmark**: Added `permute_2d_benchmark` in TorchRec comparing default vs pre-allocated allocation paths. ## 3. Results * benchmark (GB200, num_features=170, batch_size=128, mean_pooling_factor=50) |short name|GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)| |--|--|--|--|--|--|--|--| |permute_2d_default|0.10 ms|0.48 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.15 GB| |permute_2d_preallocated|0.15 ms|0.10 ms|0.01 GB|0.02 GB|1.07 GB|0.0 / 0.0 / 0.0|1.30 GB| CPU runtime: 0.48 ms -> 0.10 ms (~5x faster with pre-allocated outputs). GPU runtime: identical (~0.1 ms) — kernel execution unchanged. * repro commands ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:permute_2d_benchmark -- \ --num_features=170 --batch_size=128 --mean_pooling_factor=50 ``` * trace - [manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955) |name|trace|memory| |--|--|--| |permute_2d_default|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_default-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_default-rank0.pickle)| |permute_2d_preallocated|[Perf Doctor](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces) / [Perfetto](https://www.internalfb.com/intern/kernelhub/perfetto?trace_path=tree/permanent_traces/DIFF/D95757955/trace-permute_2d_preallocated-rank0.json.gz&bucket=torchrec_benchmark_traces)|[memory](https://www.internalfb.com/pytorch_memory_visualizer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D95757955/memory-permute_2d_preallocated-rank0.pickle)| ## 4. Analysis 1. **Backward compatibility**: All new parameters default to `None`/`std::nullopt`. Existing callers (including `permute_sequence_embeddings`, `permute_2D_sparse_data_input1D`, and all TorchRec call sites) are unchanged and pass through the default path. 2. **CPU speedup source**: The ~5x CPU runtime improvement comes from eliminating `at::empty()` calls inside the operator. These calls go through PyTorch's allocator dispatch, which has non-trivial overhead for small tensors. Pre-allocating outside the hot path amortizes this cost. 3. **GPU runtime unchanged**: The CUDA kernel itself is identical — only the host-side allocation is skipped. GPU compute time is dominated by the permutation kernel, not memory allocation. 4. **No validation on pre-allocated buffers**: The implementation trusts callers to provide correctly sized buffers. This is consistent with other `_out` patterns in PyTorch/FBGEMM and avoids runtime overhead. ## 5. Changes 1. **`sparse_ops_cpu.cpp`**: Updated `permute_2D_sparse_data` and `permute_sparse_data` schema registration with 3 new optional `Tensor?` params. Updated CPU implementation to use provided buffers at 3 allocation points. Updated `permute_sequence_embeddings_cpu` and `permute_2D_sparse_data_input1D_cpu` call sites to pass `std::nullopt`. 2. **`sparse_permute_2d.cu`**: Updated CUDA implementation signature and 3 allocation points with the same use-if-provided pattern. Updated `permute_2D_sparse_data_input1D_cuda` call site. 3. **`sparse_permute_embeddings.cu`**: Updated `permute_sequence_embeddings_cuda` call site to pass `std::nullopt` for the 3 new params. 4. **`sparse_ops.h`**: Updated CPU and CUDA declarations with 3 new optional params (no default values — required by `TORCH_FN` macro used in `FBGEMM_OP_DISPATCH`). 5. **`sparse_ops.py`**: Updated Python meta/abstract implementation with conditional allocation logic for PT2/torch.compile compatibility. 6. **`permute_indices_test.py`**: Added `test_permute_indices_with_preallocated_output` — hypothesis-based test covering CPU and GPU, correctness and zero-copy verification. 7. **`permute_2d_benchmark.py`** (new): Benchmark comparing default vs pre-allocated allocation paths with memory snapshot support. 8. **`torchrec/sparse/tests/BUCK`**: Added `permute_2d_benchmark` python_binary target. Reviewed By: q10 Differential Revision: D95757955

meta-cla bot added the cla signed label Mar 9, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 9, 2026

TroyGarden force-pushed the export-D95757955 branch from a880f90 to 548d6d4 Compare March 9, 2026 16:49

TroyGarden force-pushed the export-D95757955 branch from 548d6d4 to 928c793 Compare March 10, 2026 01:29

TroyGarden force-pushed the export-D95757955 branch from 928c793 to b03d61d Compare March 11, 2026 00:14

TroyGarden force-pushed the export-D95757955 branch from b03d61d to 7d3046c Compare March 11, 2026 00:15

TroyGarden force-pushed the export-D95757955 branch from 7d3046c to 9a20d7e Compare March 11, 2026 00:17

TroyGarden force-pushed the export-D95757955 branch from 9a20d7e to 6ee1735 Compare March 11, 2026 00:21

meta-codesync bot changed the title ~~Extend permute_2D_sparse_data with optional pre-allocated output buffers~~ Extend permute_2D_sparse_data with optional pre-allocated output buffers (#5461) Apr 3, 2026

TroyGarden force-pushed the export-D95757955 branch from 6ee1735 to e24ac2f Compare April 3, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend `permute_2D_sparse_data` with optional pre-allocated output buffers (#5461)#5461

Extend `permute_2D_sparse_data` with optional pre-allocated output buffers (#5461)#5461
TroyGarden wants to merge 1 commit intopytorch:mainfrom
TroyGarden:export-D95757955

TroyGarden commented Mar 9, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TroyGarden commented Mar 9, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Context

2. Approach

3. Results

4. Analysis

5. Changes

Uh oh!

meta-codesync bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TroyGarden commented Mar 9, 2026 •

edited by meta-codesync bot

Loading