Skip to content

Add TurboSSDInferenceModule for HSTU serving integration#5560

Open
goldcoderZ wants to merge 2 commits intopytorch:mainfrom
goldcoderZ:export-D98830743
Open

Add TurboSSDInferenceModule for HSTU serving integration#5560
goldcoderZ wants to merge 2 commits intopytorch:mainfrom
goldcoderZ:export-D98830743

Conversation

@goldcoderZ
Copy link
Copy Markdown
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2525

Add a serving-friendly wrapper module that enables Video Retrieval HSTU
models (VDD, New2, IFU) to use FBGEMM SSD TBE with TurboSSD v2 features
instead of EmbeddingDB.

TurboSSDInferenceModule provides:

  1. Single-call forward: auto-prefetch + lookup (vs. EmbeddingDB's
    synchronous SSD reads). Uses GPU HBM cache with LRU eviction.

  2. streaming_update() + load_snapshot(): streaming delta updates and
    zero-downtime snapshot transitions, matching EmbeddingDB's streaming
    delta table feature.

  3. Factory method from_embedding_specs(): auto-sizes HBM cache based on
    target hit rate and optional HBM budget. Useful for capacity planning
    on H100 (96 GB) and MI350X (288 GB).

  4. estimate_hbm_gb(): static method for HBM capacity planning without
    instantiating the module.

This is the integration layer between SSD TBE and the TGIF serving
framework. The module can replace SSDEmbeddingDBSplitTableBatchedEmbedding
BagsCodegen in the model graph via DIShardingPass.

Differential Revision: D98830743

Summary:

X-link: facebookresearch/FBGEMM#2521

Add streaming delta update support to the SSD TBE inference operator
(SSDIntNBitTableBatchedEmbeddingBags), closing the gap with EmbeddingDB's
streaming delta table feature.

Two new public methods:

1. streaming_update(indices, weights) — writes updated embedding rows to
   RocksDB and invalidates corresponding HBM cache entries so subsequent
   prefetch() calls reload from SSD. Uses vectorized set-associative
   cache invalidation.

2. load_snapshot(ssd_storage_directory, ...) — flushes the current
   RocksDB, opens a new instance at a fresh directory, and fully
   invalidates the HBM cache. Enables zero-downtime snapshot transitions.

Also adds AMD/ROCm awareness:
- IS_ROCM detection flag
- Constructor warns when running on ROCm (streaming API works but
  prefetch/forward C++ kernels are NVIDIA-only due to ASSOC=32 vs
  kWarpSize=64 mismatch)
- Docstring documents AMD support status per method

These are the minimal primitives needed for online training models with
streaming embedding updates (e.g., 45-min publish intervals).

Differential Revision: D98827795
Summary:
X-link: facebookresearch/FBGEMM#2525

Add a serving-friendly wrapper module that enables Video Retrieval HSTU
models (VDD, New2, IFU) to use FBGEMM SSD TBE with TurboSSD v2 features
instead of EmbeddingDB.

TurboSSDInferenceModule provides:

1. Single-call forward: auto-prefetch + lookup (vs. EmbeddingDB's
   synchronous SSD reads). Uses GPU HBM cache with LRU eviction.

2. streaming_update() + load_snapshot(): streaming delta updates and
   zero-downtime snapshot transitions, matching EmbeddingDB's streaming
   delta table feature.

3. Factory method from_embedding_specs(): auto-sizes HBM cache based on
   target hit rate and optional HBM budget. Useful for capacity planning
   on H100 (96 GB) and MI350X (288 GB).

4. estimate_hbm_gb(): static method for HBM capacity planning without
   instantiating the module.

This is the integration layer between SSD TBE and the TGIF serving
framework. The module can replace SSDEmbeddingDBSplitTableBatchedEmbedding
BagsCodegen in the model graph via DIShardingPass.

Differential Revision: D98830743
@meta-cla meta-cla bot added the cla signed label Mar 31, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 31, 2026

@goldcoderZ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98830743.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant