Skip to content

Conversation

@ShangkunLi
Copy link
Collaborator

@ShangkunLi ShangkunLi commented Jan 9, 2026

Hi~ @tancheng,

I am trying to implement the ConvertAffineToTaskflow pass this week. The problem that I encountered these days is that we cannot exhaust all the affine structures in this pass. I have already written a 1.7K-line conversion pass to convert the following two cases (multi-nested and irregular-loop) successfully.

But when I try to add a new case, the conversion pass cannot process that structure, and more specific handling code needs to be added.

For linalg dialect, it’s a good idea to implement such a pass as there are only data dependencies between different tasks. However, for affine.for (especially for imperfect-nested loops), the nested structures are too hard for us to analyze.

So in this pr, I just put the transformed ir of multi-nested and irregular-loop in the tests. Just to make sure that the defined dialect is correct and can represent such structures.

More discussions are needed for converting from high-level representations.

@ShangkunLi ShangkunLi requested a review from guosran January 9, 2026 13:08
Copy link
Contributor

@tancheng tancheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show a unsupported example? I thought we anyway can represent any example with a naive way (w/o analyzing dependency, i.e., assume all data are dependent).

@@ -0,0 +1,73 @@
// RUN: mlir-neura-opt %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the test/multi-cgra/taskflow/irregular-loop/irregular-loop.cpp compiled using lit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just use the mlir-neura-opt to parse the input ir, to make sure the syntax is correct.

@ShangkunLi
Copy link
Collaborator Author

Can you show a unsupported example? I thought we anyway can represent any example with a naive way (w/o analyzing dependency, i.e., assume all data are dependent).

Here is a blocked_gemm example.

module attributes {} {
  func.func @_Z6bbgemmPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) attributes {llvm.linkage = #llvm.linkage<external>} {
    // Task 1
    affine.for %arg3 = 0 to 8 {  // Loop 1
      affine.for %arg4 = 0 to 64 {  // Loop 2
        affine.for %arg5 = 0 to 64 step 8 {  // Loop 3
          affine.for %arg6 = 0 to 64 step 8 {  // Loop 4
            %0 = affine.load %arg0[%arg6 + %arg3 + %arg4 * 64] : memref<?xi32>
            // Task 2
            affine.for %arg7 = 0 to 8 {  // Loop 5
              %1 = affine.load %arg1[%arg6 * 64 + %arg5 + %arg7 + %arg3 * 64] : memref<?xi32>
              %2 = arith.muli %0, %1 : i32
              %3 = affine.load %arg2[%arg5 + %arg7 + %arg4 * 64] : memref<?xi32>
              %4 = arith.addi %3, %2 : i32
              affine.store %4, %arg2[%arg5 + %arg7 + %arg4 * 64] : memref<?xi32>
            }
          }
        }
      }
    }
    return
  }
}

The expected output is like:

//Task 1
%ctrl_out, %data_out = taskflow.task (<ins>){
    affine.for %arg3 = 0 to 8 {
      affine.for %arg4 = 0 to 64 {
        affine.for %arg5 = 0 to 64 step 8 {
          affine.for %arg6 = 0 to 64 step 8 {
            %0 = affine.load %arg0[%arg6 + %arg3 + %arg4 * 64] : memref<?xi32>
            taskflow.emit %arg6, %0
          }
    taskflow.yield
}

%ctrl = taskflow.drive %ctrl_out
%data = taskflow.channel %data_out
//Task 2
taskflow.task (%ctrl, %data, <other_ins>){
  affine.for %arg7 = 0 to 8 {
              %1 = affine.load %arg1[%arg6 * 64 + %arg5 + %arg7 + %arg3 * 64] : memref<?xi32>
              %2 = arith.muli %0, %1 : i32
              %3 = affine.load %arg2[%arg5 + %arg7 + %arg4 * 64] : memref<?xi32>
              %4 = arith.addi %3, %2 : i32
              affine.store %4, %arg2[%arg5 + %arg7 + %arg4 * 64] : memref<?xi32>
            }
}

Difficulties for automated conversion:

  1. We need to automatically segment the master task (from loop 1-4) & the slave task (loop 5)
  2. We need to insert the taskflow.emit operation properly in the master task to trigger the slave task
  3. For tasks with RAW dependencies, we need to insert the taskflow.channel op to denote such dependencies (that's why I changed the traits of taskflow.task from NoMemoryEffect to IsolatedFromAbove to denote the RAW dependencies explicitly)

@ShangkunLi
Copy link
Collaborator Author

Can you show a unsupported example? I thought we anyway can represent any example with a naive way (w/o analyzing dependency, i.e., assume all data are dependent).

And I don't get what you mean by "assume all data are dependent"? In such a case, how can we guarantee the RAW dependency in a taskflow (task in a dataflow) manner?

@tancheng
Copy link
Contributor

okay, let's discuss this later.

BTW, task1 would run on one CGRA and task2 would run on another CGRA? Or task1 is on a controller?

@ShangkunLi
Copy link
Collaborator Author

okay, let's discuss this later.

BTW, task1 would run on one CGRA and task2 would run on another CGRA? Or task1 is on a controller?

Task 1 will run on one CGRA and task 2 will run on another. The controller only handles the perfect nested part (like a counter).

@tancheng
Copy link
Contributor

okay, let's discuss this later.
BTW, task1 would run on one CGRA and task2 would run on another CGRA? Or task1 is on a controller?

Task 1 will run on one CGRA and task 2 will run on another. The controller only handles the perfect nested part (like a counter).

So in this case, CGRA1 might be low utilized as it only perform load?

@ShangkunLi
Copy link
Collaborator Author

okay, let's discuss this later.
BTW, task1 would run on one CGRA and task2 would run on another CGRA? Or task1 is on a controller?

Task 1 will run on one CGRA and task 2 will run on another. The controller only handles the perfect nested part (like a counter).

So in this case, CGRA1 might be low utilized as it only perform load?

Correct. And an automated conversion could be extremely complex in such a case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants