Skip to content

Conversation

@ShangkunLi
Copy link
Collaborator

In this pr:

  • Enable transforming the control flow of nested loop into data flow
  • Fix some errors in the previous code when adding grant_predicate operation. (specifically, in previous code, it adds the grant_predicate operation on the grant_always value).
  • Add some tests for this new ctrl2data pass (2-level nested loop: bert_node1, 3-level nested loop: bert_node28)
  • Clean up some CMakeLists.txt

// operation.
Location loc =
block->empty() ? block->getParent()->getLoc() : block->front().getLoc();
if (has_block_args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are specializaing here using has_block_args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to handle those blocks that do not have block arguments, like bb4 in this example.

module {
  func.func @_Z10bert_node1PA1_A1_A1_A1_A128_bPA1_A128_S1_(%arg0: memref<?x1x1x1x1x128xi8>, %arg1: memref<?x1x128x1x1x128xi8>) attributes {accelerator = "neura", llvm.linkage = #llvm.linkage<external>} {
    %0 = "neura.constant"() <{value = 1 : index}> : () -> index
    %1 = "neura.constant"() <{value = 128 : index}> : () -> index
    %2 = "neura.constant"() <{value = 0 : index}> : () -> index
    %3 = "neura.cast"(%2) <{cast_type = "index_to_int"}> : (index) -> i64
    neura.br %3 : i64 to ^bb1
  ^bb1(%4: i64):  // 2 preds: ^bb0, ^bb5
    %5 = "neura.cast"(%4) <{cast_type = "int_to_index"}> : (i64) -> index
    %6 = "neura.icmp"(%5, %1) <{cmpType = "slt"}> : (index, index) -> i1
    neura.cond_br %6 : i1 then to ^bb2 else to ^bb6
  ^bb2:  // pred: ^bb1
    %7 = "neura.cast"(%2) <{cast_type = "index_to_int"}> : (index) -> i64
    neura.br %7 : i64 to ^bb3
  ^bb3(%8: i64):  // 2 preds: ^bb2, ^bb4
    %9 = "neura.cast"(%8) <{cast_type = "int_to_index"}> : (i64) -> index
    %10 = "neura.icmp"(%9, %1) <{cmpType = "slt"}> : (index, index) -> i1
    neura.cond_br %10 : i1 then to ^bb4 else to ^bb5
  ^bb4:  // pred: ^bb3
    %11 = neura.load_indexed %arg0[%2, %2, %2, %2, %2, %9 : index, index, index, index, index, index] memref<?x1x1x1x1x128xi8> : i8
    neura.store_indexed %11 to %arg1[%2, %2, %5, %2, %2, %9 : index, index, index, index, index, index] memref<?x1x128x1x1x128xi8> : i8
    %12 = "neura.add"(%9, %0) : (index, index) -> index
    %13 = "neura.cast"(%12) <{cast_type = "index_to_int"}> : (index) -> i64
    neura.br %13 : i64 to ^bb3
  ^bb5:  // pred: ^bb3
    %14 = "neura.add"(%5, %0) : (index, index) -> index
    %15 = "neura.cast"(%14) <{cast_type = "index_to_int"}> : (index) -> i64
    neura.br %15 : i64 to ^bb1
  ^bb6:  // pred: ^bb1
    "neura.return"() : () -> ()
  }
}

The pred block of bb4 is bb3, and we can jump from bb3 to bb4 through the cond_br. So in this implementation, we grant predicate each result in bb4 with the cond of bb3 (i.e., %10). The transformed code looks like

%18 = "neura.icmp"(%17, %3) <{cmpType = "slt"}> : (!neura.data<index, i1>, !neura.data<index, i1>) -> !neura.data<i1, i1>
%19 = "neura.not"(%18) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
%20 = neura.load_indexed %arg0[%5, %5, %5, %5, %5, %17 : !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>] memref<?x1x1x1x1x128xi8> : !neura.data<i8, i1>
%21 = neura.grant_predicate %20, %18 : !neura.data<i8, i1>, !neura.data<i1, i1> -> !neura.data<i8, i1>
neura.store_indexed %21 to %arg1[%5, %5, %10, %5, %5, %17 : !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>, !neura.data<index, i1>] memref<?x1x128x1x1x128xi8> : !neura.data<i8, i1>
%22 = "neura.add"(%17, %1) : (!neura.data<index, i1>, !neura.data<index, i1>) -> !neura.data<index, i1>
%23 = neura.grant_predicate %22, %18 : !neura.data<index, i1>, !neura.data<i1, i1> -> !neura.data<index, i1>
%24 = "neura.cast"(%23) <{cast_type = "index_to_int"}> : (!neura.data<index, i1>) -> !neura.data<i64, i1>
%25 = neura.grant_predicate %24, %18 : !neura.data<i64, i1>, !neura.data<i1, i1> -> !neura.data<i64, i1>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Your test contains store_indexed, which is derived from the bert_nodexx.mlir, right? We didn't have a test with store_indexed (except those bert xxx).
  • has_block_args is robust/enough? what about a BB has block args and also has non-block-arg live-in?
  • Why there is %23 = neura.grant_predicate %22 -> neura.cast(%23)? The dataflow within BB shouldn't need that grant_predicate, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I see the problem. Will fix it soon.

Comment on lines +39 to +40


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional line is enough.

@tancheng
Copy link
Contributor

  • Fix some errors in the previous code when adding grant_predicate operation. (specifically, in previous code, it adds the grant_predicate operation on the grant_always value).

Can you use an example to explain this in the PR description?

@ShangkunLi
Copy link
Collaborator Author

  • Fix some errors in the previous code when adding grant_predicate operation. (specifically, in previous code, it adds the grant_predicate operation on the grant_always value).

Can you use an example to explain this in the PR description?

Sure! In the previous implementation, the transformed branch_without_arg.mlir looks like

func.func @test(%arg0: i64) -> f32 attributes {accelerator = "neura"} {
  %0 = "neura.constant"() <{predicate = true, value = 0 : i64}> : () -> !neura.data<i64, i1>
  %1 = "neura.constant"() <{predicate = true, value = 1.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %2 = "neura.grant_always"(%1) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %3 = "neura.constant"() <{predicate = true, value = 2.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %4 = "neura.grant_always"(%3) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %5 = "neura.constant"() <{predicate = true, value = 3.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %6 = "neura.grant_once"(%5) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %7 = "neura.constant"() <{predicate = true, value = 4.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %8 = "neura.grant_once"(%7) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %9 = "neura.icmp"(%arg0, %0) <{cmpType = "eq"}> : (i64, !neura.data<i64, i1>) -> !neura.data<i1, i1>
  %10 = "neura.grant_once"(%9) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %11 = neura.grant_predicate %6, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %12 = neura.grant_predicate %8, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %13 = "neura.not"(%10) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %14 = neura.grant_predicate %2, %13 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %15 = "neura.not"(%10) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %16 = neura.grant_predicate %4, %15 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %17 = "neura.fadd"(%14, %16) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %18 = "neura.fmul"(%11, %12) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %19 = "neura.phi"(%17, %18) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  "neura.return"(%19) : (!neura.data<f32, i1>) -> ()
}

You can see that %14 and %16 "grant_predicate" two grant_always values -- %2 and %4.

In the new implementation, the transformed ir looks like

func.func @test(%arg0: i64) -> f32 attributes {accelerator = "neura"} {
  %0 = "neura.constant"() <{predicate = true, value = 0 : i64}> : () -> !neura.data<i64, i1>
  %1 = "neura.constant"() <{predicate = true, value = 1.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %2 = "neura.grant_always"(%1) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %3 = "neura.constant"() <{predicate = true, value = 2.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %4 = "neura.grant_always"(%3) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %5 = "neura.constant"() <{predicate = true, value = 3.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %6 = "neura.grant_once"(%5) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %7 = "neura.constant"() <{predicate = true, value = 4.000000e+00 : f32}> : () -> !neura.data<f32, i1>
  %8 = "neura.grant_once"(%7) : (!neura.data<f32, i1>) -> !neura.data<f32, i1>
  %9 = "neura.icmp"(%arg0, %0) <{cmpType = "eq"}> : (i64, !neura.data<i64, i1>) -> !neura.data<i1, i1>
  %10 = "neura.grant_once"(%9) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %11 = neura.grant_predicate %6, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %12 = neura.grant_predicate %8, %10 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %13 = "neura.not"(%10) : (!neura.data<i1, i1>) -> !neura.data<i1, i1>
  %14 = "neura.fadd"(%2, %4) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %15 = neura.grant_predicate %14, %13 : !neura.data<f32, i1>, !neura.data<i1, i1> -> !neura.data<f32, i1>
  %16 = "neura.fmul"(%11, %12) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  %17 = "neura.phi"(%15, %16) : (!neura.data<f32, i1>, !neura.data<f32, i1>) -> !neura.data<f32, i1>
  "neura.return"(%17) : (!neura.data<f32, i1>) -> ()
}

We grant predicate the result %14 to ensure the correctness.

@tancheng
Copy link
Contributor

We grant predicate the result %14 to ensure the correctness.

  • Hmm, which line to which line in ur code enables this feature?
  • I feel the original IR looks fine and aligned in the sense that all BB's live-in requiring phi, your changes make live-out be predicated.
    • Don't you think keeping the original form, but performing additional fusion to fuse const + grant_always + grant_predicate sounds a better solution?

@ShangkunLi ShangkunLi closed this Jun 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants