[water] Use distributed shape when accessing allocated LDS #667

tyb0807 · 2026-01-02T13:53:32Z

When accessing shared memory allocated by wave.allocate, the read/write ops are supposed to operate on the distributed shape and not the logical shape. This PR implements the logic to handle this when lowering read/write ops to vector.load/store, transforming logical indices to distributed indices.

Fixes #659.

When accessing shared memory allocated by wave.allocate, the read/write ops are supposed to operate on the distributed shape and not the logical shape. This PR implements the logic to handle this when lowering read/write ops to vector.load/store, transforming logical indices to distributed indices. Signed-off-by: tyb0807 <[email protected]>

ftynse

Have you thought of alternatives that don't require jumping through hoops in lowering?

ftynse · 2026-01-05T08:08:17Z

water/include/water/Dialect/Wave/IR/WaveOps.td

+      mlir::ArrayAttr indexAttr = getIndexAttr();
+      if (indexAttr && !indexAttr.empty())
+        return false;  // Regular allocations have index expressions.
+
+      wave::WaveExprListAttr distributedShape = getDistributedShape();
+
+      // Empty symbol list in distributed_shape.
+      if (!distributedShape.getSymbols().empty())
+        return false;
+
+      // Distributed shape rank must be 1 (flattened).
+      if (distributedShape.getMap().getNumResults() != 1)
+        return false;
+
+      // Distributed shape must be constant.
+      if (!distributedShape.getMap().isConstant())
+        return false;
+
+      return true;


Don't put more than a couple LoC in as inlined code in a .td.

ftynse · 2026-01-05T08:10:24Z

water/lib/Dialect/Wave/IR/WaveOps.cpp

+  if (!llvm::hasSingleElement(indexAttr.getValue())) {
+    return op->emitError() << "'index' attribute must contain exactly one "
+                              "dictionary for this op, got "
+                           << indexAttr.size();
+  }


I don't see a test for this error message.

ftynse · 2026-01-05T08:11:10Z

water/lib/Dialect/Wave/IR/WaveOps.cpp

+
+  auto indexDict = dyn_cast<DictionaryAttr>(indexAttr[0]);
+  if (!indexDict)
+    return success(); // Empty dictionary is valid


This logic contradicts the message above that claims "must contain exactly one" whereas this clearly accepts there being zero dictionaries.

ftynse · 2026-01-05T08:11:29Z

water/lib/Dialect/Wave/IR/WaveOps.cpp

+  if (!indexDict)
+    return success(); // Empty dictionary is valid
+
+  // Check count matches


Systematically use full stops at the end of the sencence.

ftynse · 2026-01-05T08:13:53Z

water/lib/Dialect/Wave/IR/WaveOps.cpp

+    return op->emitError() << "number of index expressions ("
+                           << indexDict.size()
+                           << ") must match logical shape rank ("
+                           << tensorShape.size() << ")";


This is a low-value diagnostic that suppresses the higher-value diagnostic below. If there is a missing symbol, it will just say "size mismatch" without saying what is missing, leaving the user figure out something that code should have done. You can instead keep the diagnostic below and add a diagnostic for symbols that are present in the index expression but not used. Arguably, this should be an error and not a warning since nothing bad will happen due to the unnecessary symbol being present

ftynse · 2026-01-05T08:34:10Z

water/lib/Dialect/Wave/Transforms/LowerReadWriteOps.cpp

+        // Check if the source is a MemRefType with shared memory.
+        if (auto sourceMemrefType =
+                dyn_cast<MemRefType>(sourceValue.getType())) {
+          if (auto addrSpace = sourceMemrefType.getMemorySpace()) {
+            if (auto gpuAddrSpace =
+                    dyn_cast<gpu::AddressSpaceAttr>(addrSpace)) {
+              if (gpuAddrSpace.getValue() == gpu::AddressSpace::Workgroup) {
+                SmallVector<int64_t> shape(sourceMemrefType.getShape().begin(),
+                                           sourceMemrefType.getShape().end());
+                return shape;
+              }
+            }
+          }


This long block is similar to the one above and could have been turned into a function/lambda

ftynse · 2026-01-05T08:36:37Z

water/lib/Dialect/Wave/Transforms/LowerReadWriteOps.cpp

+                      }
+                    }
+                  }
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }


I counted 11 levels of indentation???????????

ftynse · 2026-01-05T08:44:39Z

water/test/Dialect/Wave/lower-wave-to-mlir.mlir

 func.func @lower_alloc_view() attributes {wave.hyperparameters = #wave.hyperparameters<{BLOCK_M = 4, BLOCK_K = 28, M = 128, N=128, K= 128}>}  {
-  // CHECK: %[[BUFF:.*]] = memref.alloc() : memref<256xi8, #gpu.address_space<workgroup>>
-  %parent = wave.allocate { distributed_shape = #wave.expr_list<[] -> (256)> }
+  // CHECK: %[[BUFF:.*]] = memref.alloc() : memref<2097152xi8, #gpu.address_space<workgroup>>


I would much rather add a new test.

ftynse · 2026-01-05T08:46:09Z

water/test/Dialect/Wave/ops.mlir

+  // Child allocations are exempt from rank constraints (this should be valid)
+  // distributed_shape has rank 1, logical shape has rank 2
+  // CHECK: %[[PARENT:.*]] = wave.allocate {distributed_shape = #wave.expr_list<[] -> (256)>}
+  // CHECK: wave.allocate in %[[PARENT]]
+  %buf = wave.allocate in %parent : !wave.tensor<[@M, @K] of i8, <shared>>
+    { distributed_shape = #wave.expr_list<[#wave.symbol<"BLOCK_M">] -> (BLOCK_M)>, offset = 128}
+    : !wave.tensor<[@M, @K] of bf16, <shared>>


This doesn't make sense to me, why is it okay to have a lower-rank distributed shape here?

ftynse · 2026-01-05T08:47:22Z

water/lib/Dialect/Wave/Transforms/LowerReadWriteOps.cpp

+                if (auto sourceMemrefType =
+                        dyn_cast<MemRefType>(sourceValue.getType())) {
+                  if (auto addrSpace = sourceMemrefType.getMemorySpace()) {
+                    if (auto gpuAddrSpace =
+                            dyn_cast<gpu::AddressSpaceAttr>(addrSpace)) {
+                      if (gpuAddrSpace.getValue() ==
+                          gpu::AddressSpace::Workgroup) {
+                        memoryOperand = sourceValue;


This yet again looks identical to the code I've seen above.

ftynse

.

tyb0807 · 2026-01-05T18:21:01Z

Have you thought of alternatives that don't require jumping through hoops in lowering?

Yes, and I came to conclusion that doing this transformation during lowering might be the least worst.

Just to clarify the context/problem: when lowering wave.read/wave.write ops that access LDS memory from wave.allocate, the allocated memref has a distributed shape that is different from the logical shape (WaveTensorType shape). The type converter creates an unrealized_conversion_cast to bridge this mismatch, but the lowered read/write ops need to use the distributed memref for correct memory access.

Considered alternatives are:

Create a pre-lowering pass that propagates distributed_shape attribute from wave.allocate to read/write ops that access the allocated LDS memory. This can either use dataflow analysis or not, though in this case I think there's exactly one source (the wave.allocate) and no conflicts possible, so dataflow analysis would be an overkill. Anyway, the problem is not "how to propagate the attribute", but even with the attribute, lowering still needs to look through the cast to get the actual memref value. So the attribute becomes just a "signal" that the cast exists, which is not really useful. And we still need to "jump through hoops" to get that memref value in this approach.
Create a post-lowering pass to reconcile the casts, i.e. walk unrealized_conversion_cast ops and replace uses with source memrefs. This is also messy, as we can't just replaceAllUsesWith, as we need to recreate each vector op with correct memref type. Furthermore, we would need to handle different flavors of vector read/write ops (e.g. vector.load, vector.store, vector.transfer_read, vector.transfer_write, vector.maskedload, vector.maskedstore). I don't think there's a trait/interface that group these (and only these) ops, right?
Make wave.read/write ops accept memref type, then we still need to separate the lowering into phases: convert wave.allocate to memref.alloc first, then the main lowering pass converts read/write ops, which now have memref memory operands. This doesn't sound any less work tbh.

What other alternatives you have in mind? Please shed some light on this matter!

tyb0807 · 2026-01-05T22:04:47Z

Ok, the more I think about this, the more I feel the cleanest solution would be:

Create a pre-lowering pass ResolveDistributedAllocations to transform wave.allocate to produce memref-typed result.
Make wave.read/write ops accept memref-typed memory operand, so the pass above creates valid IR.
Bonus: create wave.view op which takes a memory value create by wave.allocate and returns a memref type of its distributed shape. This can only handle child allocations though, so we'll still need wave.allocate to be able to produce memref-typed result to handle normal allocations. Or we can have a no-op wave.view for normal allocations, where distributed shape = logical shape.

WDYT?

ftynse · 2026-01-06T08:49:53Z

Yes, and I came to conclusion that doing this transformation during lowering might be the least worst.

Perhaps. This is something that a commit/PR message can explain, so the reviewers and future contributors can understand the reasoning, not guess it.

What other alternatives you have in mind? Please shed some light on this matter!

I did not really think about it, just reviewing what I see. Given the conjunction of conditions in the code that are required for the lowering to work, it looks excessively fragile, which makes me worried about its scalability and longer-term health. Hence the question: it may well be the best available approach, or there may be others, I just don't know and don't have enough information to judge, so I'm requesting it. If I had to figure it out myself, I might as well write the code at that point.

The specific part that appears problematic to me, though it was present in the code already, is having to "look through" the unrealized conversion cast. This feels like leaking the abstraction of the dialect conversion, which may be the wrong one here as we have the same wave tensor type converted to different upstream types based on usage.

Ok, the more I think about this, the more I feel the cleanest solution would be: [...]

Indeed, this looks like the least "internally complex" approach. It can be similar to, or even integrate with, the one we have for register-resident tensors. Note that it is possible to call setType on a value, the only time when you need to re-create an operation is when its number of results changes. Note also that calling setType from within dialect conversion will likely cause deep problems with the infra, so it will have to be a separate pass.

This doesn't sound any less work tbh.

It may well be the same amount of work, but is consistent with the approach taken for register-resident tensors that are converted to vectors, so less overall complexity in the system (one approach rather than two approaches). I don't necessarily insist on one approach or another, I haven't thought too much about the problem, but the more complex the logic is, the stronger should be the arguments behind it.

tyb0807 · 2026-01-08T06:45:34Z

Superseded by #677, #684 and #686.

tyb0807 requested review from ftynse and tgymnich January 2, 2026 13:53

tyb0807 force-pushed the dist_shape branch from 4cfbcf0 to a9aa80e Compare January 2, 2026 13:53

tyb0807 force-pushed the dist_shape branch from a9aa80e to aaa961a Compare January 2, 2026 14:09

tyb0807 mentioned this pull request Jan 3, 2026

[wave2water] E2E execution of matmul kernel via water middle-end #672

Open

ftynse reviewed Jan 5, 2026

View reviewed changes

ftynse requested changes Jan 5, 2026

View reviewed changes

tyb0807 closed this Jan 5, 2026

tyb0807 reopened this Jan 5, 2026

tyb0807 closed this Jan 8, 2026

tyb0807 deleted the dist_shape branch January 8, 2026 06:45

[water] Use distributed shape when accessing allocated LDS #667

[water] Use distributed shape when accessing allocated LDS #667

Uh oh!

Conversation

tyb0807 commented Jan 2, 2026

Uh oh!

ftynse left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ftynse left a comment

Choose a reason for hiding this comment

Uh oh!

tyb0807 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tyb0807 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ftynse commented Jan 6, 2026

Uh oh!

tyb0807 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyb0807 commented Jan 5, 2026 •

edited

Loading

tyb0807 commented Jan 5, 2026 •

edited

Loading