Enable nested loop ctrl2data flow transforms #57

ShangkunLi · 2025-06-24T14:40:39Z

In this pr:

Enable transforming the control flow of nested loop into data flow
Fix some errors in the previous code when adding grant_predicate operation. (specifically, in previous code, it adds the grant_predicate operation on the grant_always value).
Add some tests for this new ctrl2data pass (2-level nested loop: bert_node1, 3-level nested loop: bert_node28)
Clean up some CMakeLists.txt

lib/NeuraDialect/Transforms/TransformCtrlToDataFlowPass.cpp

tancheng · 2025-06-24T17:04:05Z

lib/NeuraDialect/Transforms/TransformCtrlToDataFlowPass.cpp

+  // operation.
+  Location loc =
+      block->empty() ? block->getParent()->getLoc() : block->front().getLoc();
+  if (has_block_args) {


What we are specializaing here using has_block_args?

Because we need to handle those blocks that do not have block arguments, like bb4 in this example.

module { func.func @_Z10bert_node1PA1_A1_A1_A1_A128_bPA1_A128_S1_(%arg0: memref<?x1x1x1x1x128xi8>, %arg1: memref<?x1x128x1x1x128xi8>) attributes {accelerator = "neura", llvm.linkage = #llvm.linkage<external>} { %0 = "neura.constant"() <{value = 1 : index}> : () -> index %1 = "neura.constant"() <{value = 128 : index}> : () -> index %2 = "neura.constant"() <{value = 0 : index}> : () -> index %3 = "neura.cast"(%2) <{cast_type = "index_to_int"}> : (index) -> i64 neura.br %3 : i64 to ^bb1 ^bb1(%4: i64): // 2 preds: ^bb0, ^bb5 %5 = "neura.cast"(%4) <{cast_type = "int_to_index"}> : (i64) -> index %6 = "neura.icmp"(%5, %1) <{cmpType = "slt"}> : (index, index) -> i1 neura.cond_br %6 : i1 then to ^bb2 else to ^bb6 ^bb2: // pred: ^bb1 %7 = "neura.cast"(%2) <{cast_type = "index_to_int"}> : (index) -> i64 neura.br %7 : i64 to ^bb3 ^bb3(%8: i64): // 2 preds: ^bb2, ^bb4 %9 = "neura.cast"(%8) <{cast_type = "int_to_index"}> : (i64) -> index %10 = "neura.icmp"(%9, %1) <{cmpType = "slt"}> : (index, index) -> i1 neura.cond_br %10 : i1 then to ^bb4 else to ^bb5 ^bb4: // pred: ^bb3 %11 = neura.load_indexed %arg0[%2, %2, %2, %2, %2, %9 : index, index, index, index, index, index] memref<?x1x1x1x1x128xi8> : i8 neura.store_indexed %11 to %arg1[%2, %2, %5, %2, %2, %9 : index, index, index, index, index, index] memref<?x1x128x1x1x128xi8> : i8 %12 = "neura.add"(%9, %0) : (index, index) -> index %13 = "neura.cast"(%12) <{cast_type = "index_to_int"}> : (index) -> i64 neura.br %13 : i64 to ^bb3 ^bb5: // pred: ^bb3 %14 = "neura.add"(%5, %0) : (index, index) -> index %15 = "neura.cast"(%14) <{cast_type = "index_to_int"}> : (index) -> i64 neura.br %15 : i64 to ^bb1 ^bb6: // pred: ^bb1 "neura.return"() : () -> () } }

The pred block of bb4 is bb3, and we can jump from bb3 to bb4 through the cond_br. So in this implementation, we grant predicate each result in bb4 with the cond of bb3 (i.e., %10). The transformed code looks like

%18 = "neura.icmp"(%17, %3) <{cmpType = "slt"}> : (!neura.data<index, i1>, !neura.data<index, i1>) -> !neura.data<i1, i1> %19 = "neura.not"(%18) : (!neura.data<i1, i1>) -> !neura.data<i1, i1> %20 = neura.load_indexed %arg0[%5, %5, %5, %5, %5, %17 : !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>] memref<?x1x1x1x1x128xi8> : !neura.data<i8, i1> %21 = neura.grant_predicate %20, %18 : !neura.data<i8, i1>, !neura.data<i1, i1> -> !neura.data<i8, i1> neura.store_indexed %21 to %arg1[%5, %5, %10, %5, %5, %17 : !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>] memref<?x1x128x1x1x128xi8> : !neura.data<i8, i1> %22 = "neura.add"(%17, %1) : (!neura.data<index, i1>, !neura.data<index, i1>) -> !neura.data<index, i1> %23 = neura.grant_predicate %22, %18 : !neura.data<index, i1>, !neura.data<i1, i1> -> !neura.data<index, i1> %24 = "neura.cast"(%23) <{cast_type = "index_to_int"}> : (!neura.data<index, i1>) -> !neura.data<i64, i1> %25 = neura.grant_predicate %24, %18 : !neura.data<i64, i1>, !neura.data<i1, i1> -> !neura.data<i64, i1>

Your test contains store_indexed, which is derived from the bert_nodexx.mlir, right? We didn't have a test with store_indexed (except those bert xxx).

has_block_args is robust/enough? what about a BB has block args and also has non-block-arg live-in?

Why there is %23 = neura.grant_predicate %22 -> neura.cast(%23)? The dataflow within BB shouldn't need that grant_predicate, right?

Hmmm, I see the problem. Will fix it soon.

tancheng · 2025-06-24T17:05:16Z

test/affine2neura/bert/bert_node0/bert_node0.mlir

+
+

One additional line is enough.

tancheng · 2025-06-24T17:07:26Z

Fix some errors in the previous code when adding grant_predicate operation. (specifically, in previous code, it adds the grant_predicate operation on the grant_always value).

Can you use an example to explain this in the PR description?

lib/NeuraDialect/Transforms/TransformCtrlToDataFlowPass.cpp

ShangkunLi · 2025-06-25T02:33:39Z

Fix some errors in the previous code when adding grant_predicate operation. (specifically, in previous code, it adds the grant_predicate operation on the grant_always value).

Can you use an example to explain this in the PR description?

Sure! In the previous implementation, the transformed branch_without_arg.mlir looks like

func.func @test(%arg0: i64) -> f32 attributes {accelerator = "neura"} {
  %0 = "neura.constant"() <{predicate = true, value = 0 : i64}> : () -> !neura.data<i64, i1>
  %1 = "neura.constant"() <{predicate = true, value = 1.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %2 = "neura.grant_always"(%1) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %3 = "neura.constant"() <{predicate = true, value = 2.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %4 = "neura.grant_always"(%3) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %5 = "neura.constant"() <{predicate = true, value = 3.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %6 = "neura.grant_once"(%5) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %7 = "neura.constant"() <{predicate = true, value = 4.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %8 = "neura.grant_once"(%7) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %9 = "neura.icmp"(%arg0, %0) <{cmpType = "eq"}> : (i64, !neura.data<i64, i1>) -> !neura.data<i1, i1>
  %10 = "neura.grant_once"(%9) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %11 = neura.grant_predicate %6, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %12 = neura.grant_predicate %8, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %13 = "neura.not"(%10) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %14 = neura.grant_predicate %2, %13 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %15 = "neura.not"(%10) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %16 = neura.grant_predicate %4, %15 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %17 = "neura.fadd"(%14, %16) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %18 = "neura.fmul"(%11, %12) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %19 = "neura.phi"(%17, %18) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  "neura.return"(%19) : (!neura.data<f32, i1>) -> ()
}

You can see that %14 and %16 "grant_predicate" two grant_always values -- %2 and %4.

In the new implementation, the transformed ir looks like

func.func @test(%arg0: i64) -> f32 attributes {accelerator = "neura"} {
  %0 = "neura.constant"() <{predicate = true, value = 0 : i64}> : () -> !neura.data<i64, i1>
  %1 = "neura.constant"() <{predicate = true, value = 1.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %2 = "neura.grant_always"(%1) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %3 = "neura.constant"() <{predicate = true, value = 2.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %4 = "neura.grant_always"(%3) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %5 = "neura.constant"() <{predicate = true, value = 3.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %6 = "neura.grant_once"(%5) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %7 = "neura.constant"() <{predicate = true, value = 4.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %8 = "neura.grant_once"(%7) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %9 = "neura.icmp"(%arg0, %0) <{cmpType = "eq"}> : (i64, !neura.data<i64, i1>) -> !neura.data<i1, i1>
  %10 = "neura.grant_once"(%9) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %11 = neura.grant_predicate %6, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %12 = neura.grant_predicate %8, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %13 = "neura.not"(%10) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %14 = "neura.fadd"(%2, %4) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %15 = neura.grant_predicate %14, %13 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %16 = "neura.fmul"(%11, %12) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %17 = "neura.phi"(%15, %16) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  "neura.return"(%17) : (!neura.data<f32, i1>) -> ()
}

We grant predicate the result %14 to ensure the correctness.

tancheng · 2025-06-25T07:01:20Z

We grant predicate the result %14 to ensure the correctness.

Hmm, which line to which line in ur code enables this feature?
I feel the original IR looks fine and aligned in the sense that all BB's live-in requiring phi, your changes make live-out be predicated.
- Don't you think keeping the original form, but performing additional fusion to fuse const + grant_always + grant_predicate sounds a better solution?

Enbale nested loop ctrl2data & non-argument cond_br

3afe1a4

tancheng reviewed Jun 24, 2025

View reviewed changes

lib/NeuraDialect/Transforms/TransformCtrlToDataFlowPass.cpp Show resolved Hide resolved

ShangkunLi closed this Jun 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable nested loop ctrl2data flow transforms #57

Enable nested loop ctrl2data flow transforms #57

Uh oh!

ShangkunLi commented Jun 24, 2025

Uh oh!

Uh oh!

tancheng Jun 24, 2025

Uh oh!

ShangkunLi Jun 25, 2025

Uh oh!

tancheng Jun 25, 2025

Uh oh!

ShangkunLi Jun 25, 2025

Uh oh!

tancheng Jun 24, 2025

Uh oh!

tancheng commented Jun 24, 2025

Uh oh!

Uh oh!

ShangkunLi commented Jun 25, 2025

Uh oh!

tancheng commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable nested loop ctrl2data flow transforms #57

Enable nested loop ctrl2data flow transforms #57

Uh oh!

Conversation

ShangkunLi commented Jun 24, 2025

Uh oh!

Uh oh!

tancheng Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

ShangkunLi Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

tancheng Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

ShangkunLi Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

tancheng Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

tancheng commented Jun 24, 2025

Uh oh!

Uh oh!

ShangkunLi commented Jun 25, 2025

Uh oh!

tancheng commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants