PTO Runtime - Task Runtime Execution Framework

Modular runtime for building and executing task dependency runtimes on Ascend devices with coordinated AICPU and AICore execution. Three independently compiled programs work together through clearly defined APIs.

Architecture Overview

The PTO Runtime consists of three separate programs that communicate through well-defined APIs:

┌─────────────────────────────────────────────────────────────┐
│                    Python Application                        │
│              (examples/scripts/run_example.py)                   │
└─────────────────────────┬───────────────────────────────────┘
                          │
         ┌────────────────┼────────────────┐
         │                │                │
    Python Bindings   (ctypes)      Device I/O
    bindings.py
         │                │                │
         ▼                ▼                ▼
┌──────────────────┐  ┌──────────────────┐
│   Host Runtime   │  │   Binary Data    │
│ (src/platform/   │  │  (AICPU + AICore)│
│  a2a3/host/)     │  └──────────────────┘
├──────────────────┤         │
│ DeviceRunner     │         │
│ Runtime          │         │
│ MemoryAllocator  │    Loaded at runtime
│ C API            │         │
└────────┬─────────┘         │
         │                   │
         └───────────────────┘
                 │
                 ▼
    ┌────────────────────────────┐
    │  Ascend Device (Hardware)   │
    ├────────────────────────────┤
    │ AICPU: Task Scheduler       │  (src/platform/a2a3/aicpu/)
    │ AICore: Compute Kernels     │  (src/platform/a2a3/aicore/)
    └────────────────────────────┘

Setup

Cloning the Repository

Simply clone the repository:

git clone <repo-url>
cd simpler

The pto-isa dependency will be automatically cloned when you first run an example that needs it.

PTO ISA Headers

The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.

The test framework automatically handles PTO_ISA_ROOT setup:

Checks if PTO_ISA_ROOT is already set
If not, clones pto-isa to examples/scripts/_deps/pto-isa on first run
Passes the resolved path to the kernel compiler

Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:

python examples/scripts/run_example.py -k examples/host_build_graph_example/kernels \
                                       -g examples/host_build_graph_example/golden.py \
                                       -p a2a3sim

Manual Setup (if auto-setup fails or you prefer manual control):

# Clone pto-isa manually
mkdir -p examples/scripts/_deps
git clone --branch master https://gitcode.com/cann/pto-isa.git examples/scripts/_deps/pto-isa

# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isa

Using a Different Location: If you already have pto-isa elsewhere, just set the environment variable:

export PTO_ISA_ROOT=/path/to/your/pto-isa

Troubleshooting:

If git is not available: Clone pto-isa manually and set PTO_ISA_ROOT
If clone fails due to network: Try again or clone manually
For CI/CD: Either rely on auto-clone or pre-clone in CI steps

Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.

Platforms

PTO Runtime supports multiple target platforms:

Platform	Description	Requirements
`a2a3`	Real Ascend hardware	CANN toolkit (ccec, aarch64 cross-compiler)
`a2a3sim`	Thread-based host simulation	gcc/g++ only (no Ascend SDK needed)

builder = RuntimeBuilder(platform="a2a3")      # Real hardware
builder = RuntimeBuilder(platform="a2a3sim")   # Simulation

The simulation platform (a2a3sim) uses host threads to emulate AICPU/AICore execution, enabling development and testing without Ascend hardware. Kernel .text sections are loaded into mmap'd executable memory for direct invocation.

Three Components

1. Host Runtime (`src/platform/a2a3/host/`)

C++ library - Device orchestration and management

DeviceRunner: Singleton managing device operations
Runtime: Task dependency runtime data structure
MemoryAllocator: Device tensor memory management
pto_runtime_c_api.h: Pure C API for Python bindings
Compiled to shared library (.so) at runtime

Key Responsibilities:

Allocate/free device memory
Host ↔ Device data transfer
AICPU kernel launching and configuration
AICore kernel registration and loading
Runtime execution workflow coordination

2. AICPU Kernel (`src/platform/a2a3/aicpu/`)

Device program - Task scheduler running on AICPU processor

kernel.cpp: Kernel entry points and handshake protocol
Runtime-specific executor in src/runtime/host_build_graph/aicpu/
Compiled to device binary at build time

Key Responsibilities:

Initialize handshake protocol with AICore cores
Identify initially ready tasks (fanin=0)
Dispatch ready tasks to idle AICore cores
Track task completion and update dependencies
Continue until all tasks complete

3. AICore Kernel (`src/platform/a2a3/aicore/`)

Device program - Computation kernels executing on AICore processors

kernel.cpp: Task execution kernels (add, mul, etc.)
Runtime-specific executor in src/runtime/host_build_graph/aicore/
Compiled to object file (.o) at build time

Key Responsibilities:

Wait for task assignment via handshake buffer
Read task arguments and kernel address
Execute kernel using PTO ISA
Signal task completion
Poll for next task or quit signal

API Layers

Three layers of APIs enable the separation:

Layer 1: C++ API (`src/platform/a2a3/host/device_runner.h`)

DeviceRunner& runner = DeviceRunner::Get();
runner.Init(device_id, num_cores, aicpu_bin, aicore_bin, pto_isa_root);
runner.AllocateTensor(bytes);
runner.CopyToDevice(device_ptr, host_ptr, bytes);
runner.Run(runtime);
runner.Finalize();

Layer 2: C API (`src/platform/a2a3/host/pto_runtime_c_api.h`)

int DeviceRunner_Init(device_id, num_cores, aicpu_binary, aicpu_size,
                      aicore_binary, aicore_size, pto_isa_root);
int DeviceRunner_Run(runtime_handle, launch_aicpu_num);
int InitRuntime(runtime_handle);
int FinalizeRuntime(runtime_handle);
int DeviceRunner_Finalize();

Layer 3: Python API (`python/bindings.py`)

Runtime = bind_host_binary(host_binary)
runtime = Runtime()
runtime.initialize()
launch_runtime(runtime, aicpu_thread_num=1, block_dim=1,
               device_id=device_id, aicpu_binary=aicpu_bytes,
               aicore_binary=aicore_bytes)
runtime.finalize()

Directory Structure

pto-runtime/
├── src/
│   ├── platform/                       # Platform-specific implementations
│   │   ├── a2a3/                       # Ascend A2/A3 platform
│   │   │   ├── host/                   # Host runtime program
│   │   │   │   ├── device_runner.h/cpp  # Device management
│   │   │   │   ├── memory_allocator.h/cpp # Memory allocation
│   │   │   │   ├── function_cache.h    # Kernel binary cache
│   │   │   │   └── pto_runtime_c_api.h/cpp # C API for bindings
│   │   │   ├── aicpu/                  # AICPU kernel (device program)
│   │   │   │   ├── kernel.cpp          # Entry points & handshake
│   │   │   │   └── device_log.h/cpp    # Device logging
│   │   │   ├── aicore/                 # AICore kernel (device program)
│   │   │   │   ├── kernel.cpp          # Task execution kernels
│   │   │   │   └── aicore.h            # AICore header
│   │   │   └── common/                 # Shared structures
│   │   │       └── kernel_args.h       # Kernel argument structures
│   │   │
│   │   └── a2a3sim/                    # Thread-based simulation platform
│   │       ├── host/                   # Simulation host runtime
│   │       │   ├── device_runner.h/cpp  # Thread-based device emulation
│   │       │   ├── memory_allocator.h/cpp # Host memory allocation
│   │       │   └── pto_runtime_c_api.h/cpp # Same C API as a2a3
│   │       ├── aicpu/                  # Simulation AICPU
│   │       ├── aicore/                 # Simulation AICore
│   │       └── common/                 # Shared structures
│   │
│   └── runtime/                        # Runtime implementations
│       └── host_build_graph/           # Host-built graph runtime
│           ├── build_config.py         # Build configuration
│           ├── host/
│           │   └── runtime_maker.cpp    # C++ runtime builder & validator
│           ├── aicpu/
│           │   └── aicpu_executor.cpp # Task scheduler implementation
│           ├── aicore/
│           │   └── aicore_executor.cpp # AICore task executor
│           └── runtime/
│               └── runtime.h/cpp       # Task runtime and handshake structures
│
├── python/                             # Language bindings
│   ├── bindings.py                      # ctypes wrapper (C → Python)
│   ├── runtime_builder.py              # Python runtime builder
│   ├── runtime_compiler.py              # Multi-platform runtime compiler
│   ├── kernel_compiler.py               # Kernel compiler
│   ├── elf_parser.py                   # ELF binary parser
│   └── toolchain.py                    # Toolchain configuration
│
├── examples/                           # Working examples
│   ├── scripts/                        # Test framework scripts
│   │   ├── run_example.py                   # Main test runner
│   │   ├── code_runner.py              # Test execution engine
│   │   └── README.md                   # Test framework documentation
│   │
│   ├── host_build_graph_example/       # Host-built graph example (a2a3)
│   │   ├── README.md                   # Example documentation
│   │   ├── golden.py                   # Input generation and expected output
│   │   └── kernels/
│   │       ├── kernel_config.py        # Kernel configuration
│   │       ├── aiv/                    # AIV kernels
│   │       │   ├── kernel_add.cpp
│   │       │   ├── kernel_add_scalar.cpp
│   │       │   └── kernel_mul.cpp
│   │       └── orchestration/
│   │           └── example_orch.cpp    # Orchestration kernel
│   │
│   └── host_build_graph_sim_example/   # Simulation example (a2a3sim)
│       ├── README.md                   # Example documentation
│       ├── golden.py                   # Input generation and expected output
│       └── kernels/                    # Simulation kernels (plain C++)
│
└── tests/                              # Test suite
    └── test_runtime_builder.py         # Runtime builder tests

Developer Guidelines

Each developer role has a designated working directory:

Role	Directory	Responsibility
Platform Developer	`src/platform/`	Platform-specific logic and abstractions
Runtime Developer	`src/runtime/`	Runtime logic (host, aicpu, aicore, common)
Codegen Developer	`examples/`	Code generation examples and kernel implementations

Rules:

Stay within your assigned directory unless explicitly requested otherwise
Create new subdirectories under your assigned directory as needed
When in doubt, ask before making changes to other areas

Building

Prerequisites

CMake 3.15+
CANN toolkit with:
- ccec compiler (AICore Bisheng CCE)
- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
Standard C/C++ compiler (gcc/g++) for host
Python 3 with development headers

Environment Setup

source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest

Build Process

The RuntimeCompiler class handles compilation of all three components separately. Use the platform parameter to select the target platform:

from runtime_compiler import RuntimeCompiler

# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")

# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")

# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs)    # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs)      # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs)        # → .so file

Toolchains used:

AICore: Bisheng CCE (ccec compiler) → .o object file (a2a3 only)
AICPU: aarch64 cross-compiler → .so shared object (a2a3 only)
Host: Standard gcc/g++ → .so shared library
HostSim: Standard gcc/g++ for all targets (a2a3sim)

Each component is compiled independently with its own toolchain, allowing modular development.

Usage

Quick Start - Python Example

from bindings import bind_host_binary
from runtime_compiler import RuntimeCompiler

# Compile all binaries
compiler = RuntimeCompiler()
aicore_bin = compiler.compile("aicore", [...include_dirs...], [...source_dirs...])
aicpu_bin = compiler.compile("aicpu", [...include_dirs...], [...source_dirs...])
host_bin = compiler.compile("host", [...include_dirs...], [...source_dirs...])

# Load and initialize runtime
Runtime = bind_host_binary(host_bin)
runtime = Runtime()
runtime.initialize()  # C++ builds runtime and allocates tensors

# Execute runtime on device
launch_runtime(runtime,
               aicpu_thread_num=1,
               block_dim=1,
               device_id=9,
               aicpu_binary=aicpu_bin,
               aicore_binary=aicore_bin)

runtime.finalize()  # Verify and cleanup

Running the Example

Use the test framework to run examples:

# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
  -k examples/host_build_graph_example/kernels \
  -g examples/host_build_graph_example/golden.py \
  -p a2a3

# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
  -k examples/host_build_graph_sim_example/kernels \
  -g examples/host_build_graph_sim_example/golden.py \
  -p a2a3sim

This example:

Compiles AICPU, AICore, and Host binaries using RuntimeCompiler
Loads the host runtime library
Initializes DeviceRunner with compiled binaries
Creates a task runtime: f = (a + b + 1)(a + b + 2) with 4 tasks and dependencies
Executes on device (AICPU scheduling, AICore computing)
Validates results against golden output

Expected output:

=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
  f: PASS (16384/16384 elements matched)

============================================================
TEST PASSED
============================================================

Execution Flow

1. Python Setup Phase

Python run_example.py
  │
  ├─→ RuntimeCompiler.compile("host", ...) → host_binary (.so)
  ├─→ RuntimeCompiler.compile("aicpu", ...) → aicpu_binary (.so)
  ├─→ RuntimeCompiler.compile("aicore", ...) → aicore_binary (.o)
  │
  └─→ bind_host_binary(host_binary)
       └─→ RuntimeLibraryLoader(host_binary)
            └─→ CDLL(host_binary) ← Loads .so into memory

2. Initialization Phase

runner.init(device_id, num_cores, aicpu_binary, aicore_binary, pto_isa_root)
  │
  ├─→ DeviceRunner_Init (C API)
  │    ├─→ Initialize CANN device
  │    ├─→ Allocate device streams
  │    ├─→ Load AICPU binary to device
  │    ├─→ Register AICore kernel binary
  │    └─→ Create handshake buffers (one per core)
  │
  └─→ DeviceRunner singleton ready

3. Runtime Building Phase

runtime.initialize()
  │
  └─→ InitRuntime (C API)
       └─→ InitRuntimeImpl (C++)
            ├─→ Compile kernels at runtime (CompileAndLoadKernel)
            │    ├─→ KernelCompiler calls ccec
            │    ├─→ Load .o to device GM memory
            │    └─→ Update kernel function address table
            │
            ├─→ Allocate device tensors via MemoryAllocator
            ├─→ Copy input data to device
            ├─→ Build task runtime with dependencies
            └─→ Return Runtime pointer

4. Execution Phase

launch_runtime(runtime, aicpu_thread_num=1, block_dim=1, device_id=device_id,
               aicpu_binary=aicpu_bytes, aicore_binary=aicore_bytes)
  │
  └─→ launch_runtime (C API)
       │
       ├─→ Copy Runtime to device memory
       │
       ├─→ LaunchAiCpuKernel (init kernel)
       │    └─→ Execute on AICPU: Initialize handshake
       │
       ├─→ LaunchAiCpuKernel (main scheduler kernel)
       │    └─→ Execute on AICPU: Task scheduler loop
       │         ├─→ Find initially ready tasks
       │         ├─→ Loop: dispatch tasks, wait for completion
       │         └─→ Continue until all tasks done
       │
       ├─→ LaunchAicoreKernel
       │    └─→ Execute on AICore cores: Task workers
       │         ├─→ Wait for task assignment
       │         ├─→ Execute kernel
       │         └─→ Signal completion, repeat
       │
       └─→ rtStreamSynchronize (wait for completion)

5. Validation Phase

runtime.finalize()
  │
  └─→ FinalizeRuntime (C API)
       └─→ FinalizeRuntimeImpl (C++)
            ├─→ Copy results from device to host
            ├─→ Verify correctness (compare with expected values)
            ├─→ Free all device tensors
            ├─→ Delete runtime
            └─→ Return success/failure

Handshake Protocol

AICPU and AICore cores coordinate via handshake buffers (one per core):

struct Handshake {
    volatile uint32_t aicpu_ready;   // AICPU→AICore: scheduler ready
    volatile uint32_t aicore_done;   // AICore→AICPU: core ready
    volatile uint64_t task;          // AICPU→AICore: task pointer
    volatile int32_t task_status;    // Task state: 1=busy, 0=done
    volatile int32_t control;        // AICPU→AICore: 1=quit
};

Flow:

AICPU finds a ready task
AICPU writes task pointer to handshake buffer and sets aicpu_ready
AICore polls buffer, sees task, reads from device memory
AICore sets task_status = 1 (busy) and executes
AICore sets task_status = 0 (done) and aicore_done
AICPU reads result and continues

Components in Detail

Host Runtime (`src/platform/a2a3/host/`)

DeviceRunner: Singleton managing device operations

Allocate/free device tensor memory
Copy data between host and device
Launch AICPU and AICore kernels
Manage handshake buffers
Coordinate runtime execution

Runtime: Task dependency runtime

Add tasks with arguments and function IDs
Add dependencies between tasks (fanin/fanout)
Query task information and dependency structure
Calculate topologically ready tasks

MemoryAllocator: Device memory management

Allocate blocks from device GM memory
Track allocations automatically
Free with automatic cleanup on finalization

pto_runtime_c_api: Pure C interface

Enables Python ctypes bindings
Wraps C++ classes as opaque pointers
Error codes: 0=success, negative=failure
All memory management in C++

AICPU Kernel (`src/platform/a2a3/aicpu/`)

kernel.cpp: Kernel entry points

Initialization kernel: Sets up handshake protocol
Main scheduler kernel: Task scheduling loop
Handshake initialization and management

execute.cpp: Task scheduler

Ready task identification
Task dispatch to cores
Dependency tracking and updates
Loop until completion

AICore Kernel (`src/platform/a2a3/aicore/`)

kernel.cpp: Computation kernels

Task execution implementations
Kernel function pointers indexed by func_id
Memory access and PTO ISA operations
Handshake buffer polling

Features

Dynamic Kernel Compilation

Compile and load kernels at runtime without rebuilding:

// In host code
runner.CompileAndLoadKernel(func_id, "path/to/kernel.cpp", core_type);

This compiles the kernel source using ccec, loads the binary to device memory, and registers it for task dispatch.

Python Bindings

Full Python API with ctypes:

No C++ knowledge required
NumPy integration for arrays
Easy data transfer between host and device

Modular Design

Three programs compile independently
Clear API boundaries
Develop components in parallel
Runtime linking via binary loading

Configuration

Compile-time Configuration (Runtime Limits)

In src/runtime/host_build_graph/runtime/runtime.h:

#define RUNTIME_MAX_TASKS 1024     // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16        // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512     // Maximum successors per task

Runtime Configuration

runner.init(
    device_id=0,              # Device ID (0-15)
    num_cores=3,              # Number of cores for handshake
    aicpu_binary=...,         # AICPU .so binary
    aicore_binary=...,        # AICore .o binary
    pto_isa_root="/path/to/pto-isa"  # PTO-ISA headers location
)

Notes

Device IDs: 0-15 (typically device 9 used for examples)
Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
Kernel compilation: Requires ASCEND_HOME_PATH environment variable
Memory management: MemoryAllocator automatically tracks allocations
Python requirement: NumPy for efficient array operations

Logging

Device logs written to ~/ascend/log/debug/device-<id>/

Kernel uses macros:

DEV_INFO: Informational messages
DEV_DEBUG: Debug messages
DEV_WARN: Warnings
DEV_ERROR: Error messages

Testing

./ci.sh

References

src/platform/a2a3/host/ - Host runtime implementation (real hardware)
src/platform/a2a3/aicpu/ - AICPU scheduler implementation
src/platform/a2a3/aicore/ - AICore kernel implementation
src/platform/a2a3sim/ - Thread-based simulation platform
src/runtime/host_build_graph/ - Host-built graph runtime
examples/host_build_graph_example/ - Hardware example (a2a3)
examples/host_build_graph_sim_example/ - Simulation example (a2a3sim)
python/ - Python bindings and compiler

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.ai-instruction		.ai-instruction
.github/workflows		.github/workflows
examples		examples
python		python
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
ci.sh		ci.sh

ChaoWao/simpler

Folders and files

Latest commit

History

Repository files navigation

PTO Runtime - Task Runtime Execution Framework

Architecture Overview

Setup

Cloning the Repository

PTO ISA Headers

Platforms

Three Components

1. Host Runtime (src/platform/a2a3/host/)

2. AICPU Kernel (src/platform/a2a3/aicpu/)

3. AICore Kernel (src/platform/a2a3/aicore/)

API Layers

Layer 1: C++ API (src/platform/a2a3/host/device_runner.h)

Layer 2: C API (src/platform/a2a3/host/pto_runtime_c_api.h)

Layer 3: Python API (python/bindings.py)

Directory Structure

Developer Guidelines

Building

Prerequisites

Environment Setup

Build Process

Usage

Quick Start - Python Example

Running the Example

Execution Flow

1. Python Setup Phase

2. Initialization Phase

3. Runtime Building Phase

4. Execution Phase

5. Validation Phase

Handshake Protocol

Components in Detail

Host Runtime (src/platform/a2a3/host/)

AICPU Kernel (src/platform/a2a3/aicpu/)

AICore Kernel (src/platform/a2a3/aicore/)

Features

Dynamic Kernel Compilation

Python Bindings

Modular Design

Configuration

Compile-time Configuration (Runtime Limits)

Runtime Configuration

Notes

Logging

Testing

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Uh oh!

Languages

1. Host Runtime (`src/platform/a2a3/host/`)

2. AICPU Kernel (`src/platform/a2a3/aicpu/`)

3. AICore Kernel (`src/platform/a2a3/aicore/`)

Layer 1: C++ API (`src/platform/a2a3/host/device_runner.h`)

Layer 2: C API (`src/platform/a2a3/host/pto_runtime_c_api.h`)

Layer 3: Python API (`python/bindings.py`)

Host Runtime (`src/platform/a2a3/host/`)

AICPU Kernel (`src/platform/a2a3/aicpu/`)

AICore Kernel (`src/platform/a2a3/aicore/`)

Packages