Fix explicit sharding for Deepseek by Shuwen-Fang · Pull Request #3595 · AI-Hypercomputer/maxtext

Shuwen-Fang · 2026-04-07T21:04:05Z

Description

This PR fixes explicit sharding for deepseek by specifying correct sharding for expert weights, and add tests with ds3-test for different parallelisms.

Tests

Verified pytest tests/integration/smoke/train_smoke_test.py::Train -v passed
Added additional test coverage for explicit sharding with deepseek

Tested the following configs to validate correctness with deepseek3-test:

ici_fsdp_parallelism=2 \
ici_expert_parallelism=8 \
use_ring_of_experts=false \

ici_fsdp_parallelism=2 \
ici_expert_parallelism=8 \
use_ring_of_experts=false \

ici_fsdp_parallelism=4 \
ici_tensor_parallelism=4 \

ici_fsdp_parallelism=16 \

ici_fsdp_parallelism=8\
ici_tensor_transpose_parallelism=2 \

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-07T22:42:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Shuwen-Fang · 2026-04-08T16:35:48Z

tests/integration/smoke/train_smoke_test.py

+        ]
+    )
+
+  @parameterized.named_parameters(


@NuojCheng these tests depend on the hardware type but should work for all topologies with >= 2 devices. wdyt?

Yes. I think we have v6e-4 for CI tests

NuojCheng · 2026-04-08T16:48:05Z

src/maxtext/layers/moe.py

-
+    
    w0_kernel = jnp.asarray(self.wi_0[...], self.dtype)
+    print("shuwen w0 kernel init:", jax.typeof(w0_kernel))


remember to remove

NuojCheng · 2026-04-08T16:51:10Z

src/maxtext/layers/moe.py

+      w1_bias = maybe_shard_with_name(w1_bias, w1_bias_ns, self.config.shard_mode)
+    if wo_bias is not None:
+      wo_bias = maybe_shard_with_name(wo_bias, wo_bias_ns, self.config.shard_mode)
+


Just FYI we did similar things but in a different style in attention_op.py, see

maxtext/src/maxtext/layers/attention_op.py

Lines 1487 to 1498 in 82ece9d

def _maybe_shard_with_pspec(inputs, pspec: jax.sharding.PartitionSpec | None):

# decoder_segment_ids can be None

if pspec is None:

return None

sharding = NamedSharding(self.mesh, pspec)

return maybe_shard_with_name(

inputs,

sharding,

shard_mode=self.config.shard_mode,

debug_sharding=self.config.debug_sharding,

extra_stack_level=1,

)

. It is optional to make the change

NuojCheng · 2026-04-08T16:51:39Z

src/maxtext/trainers/pre_train/train_compile.py


 def main(argv: Sequence[str]) -> None:
-  jax.config.update("jax_default_prng_impl", "unsafe_rbg")
+  jax.config.update("jax_default_prng_impl", "unsafe_rbg") #threefry2x32


What does it mean?

will remove, i encountered some errors with unsafe_rbg when running on vm

NuojCheng · 2026-04-08T16:52:58Z

tests/integration/smoke/train_smoke_test.py


+  def test_tiny_config_explicit_shardmode_deepseek(self):
+    test_tmpdir = os.environ.get("TEST_TMPDIR")  # pylint: disable=unused-variable
+    # Tests the Dense Matmul codepath


let's remove dense matmul test, just keep sparse matmul

gobbleturk · 2026-04-09T17:44:10Z

tests/integration/smoke/train_smoke_test.py

        ]
    )

+  def test_tiny_config_explicit_shardmode_deepseek(self):


should we use train_compile test instead (for speed?)

gobbleturk · 2026-04-09T17:44:37Z

tests/integration/smoke/train_smoke_test.py

+      ("fsdp_expert_no_roe", ["ici_fsdp_parallelism=-1", "ici_expert_parallelism=2", "use_ring_of_experts=False"]),
+      ("fsdp", ["ici_fsdp_parallelism=-1"]),
+  )
+  def test_parallelism_configs(self, parallelism_args):


I think this can be train_compile instead as well (compile only should be faster than running on real TPUs)

@NuojCheng I think train compile should work in this case. Was there a reason for using train instead of train compile in the rest of this test?

I think either one works fine. We can merge this one and migrate all explicit sharding tests to train compile in another PR.

sounds good

Shuwen-Fang requested review from Obliviour, SujeethJinesh, gpolovets1, jacoguzo, jrplatin, mailvijayasingh, mitalisi, notabee, parambole and shauryagup as code owners April 7, 2026 22:38

Shuwen-Fang requested review from Lumosis, patemotter and shuningjin as code owners April 7, 2026 22:38

Shuwen-Fang force-pushed the explicitpp branch from e747574 to 53b9d67 Compare April 7, 2026 22:42

update

287dce8

Shuwen-Fang force-pushed the explicitpp branch from 53b9d67 to 287dce8 Compare April 7, 2026 22:48

Shuwen-Fang changed the title ~~temp~~ Fix explicit sharding for moe models Apr 7, 2026

Update

c491bf3

Shuwen-Fang self-assigned this Apr 8, 2026

remove dense matmul changes

18f2e98

Shuwen-Fang changed the title ~~Fix explicit sharding for moe models~~ Fix explicit sharding for Deepseek Apr 8, 2026

update

0a8693c

Shuwen-Fang commented Apr 8, 2026

View reviewed changes

formatting

644a941

NuojCheng reviewed Apr 8, 2026

View reviewed changes

update

c49646f

Shuwen-Fang requested a review from NuojCheng April 8, 2026 17:35

fixes

de6fa70

NuojCheng approved these changes Apr 9, 2026

View reviewed changes

gobbleturk reviewed Apr 9, 2026

View reviewed changes

gobbleturk approved these changes Apr 9, 2026

View reviewed changes

Shuwen-Fang added the pull ready label Apr 9, 2026



		w0_kernel = jnp.asarray(self.wi_0[...], self.dtype)
		print("shuwen w0 kernel init:", jax.typeof(w0_kernel))

	def _maybe_shard_with_pspec(inputs, pspec: jax.sharding.PartitionSpec \| None):
	# decoder_segment_ids can be None
	if pspec is None:
	return None
	sharding = NamedSharding(self.mesh, pspec)
	return maybe_shard_with_name(
	inputs,
	sharding,
	shard_mode=self.config.shard_mode,
	debug_sharding=self.config.debug_sharding,
	extra_stack_level=1,
	)

Conversation

Shuwen-Fang commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shuwen-Fang commented Apr 7, 2026 •

edited

Loading

codecov bot commented Apr 7, 2026 •

edited

Loading