This repository is actively maintained and will be regularly updated with new content, improvements, and contributions.

Table of Contents

1. RISC-V
1. PULP
- 2.1 RISC-V Cores
- 2.2 Peripherals
- 2.3. Interconnects
- 2.4. Platforms
  - 2.4.1. Single core
  - 2.4.2. Multi-core (Cluster-based)
  - 2.4.3. Multi-Cluster
- 2.5. Accelerator (HWPEs)
- 2.6. Silicon Proven designs
- 2.7. Software
- 2.8. Useful libraries
  - 2.8.1. Deploying DNNs on PULP
  - 2.8.2. Digital Signal Processing
  - 2.8.3. PULP-DroNet
- 2.9. How we can develop an application on PULP
  - 2.9.1. PULP SDK
  - 2.9.2. PULP FreeRTOS
  - 2.9.3. PULP Runtime
  - 2.9.4. Snitch Runtime
  - 2.9.5. CVA6
- 2.10. Some Case Studies
  - 2.10.1. GAP8
  - 2.10.2. GAP9

This report will look at the PULP platform, designed to enable energy-efficient computing systems based on the RISC-V instruction set architecture. The report will cover various topics, including the RISC-V cores that can be used with PULP, as well as the types of peripherals and interconnects available. The report will also explore the different platforms that PULP supports, including single-core and multi-core systems, and examine how accelerator modules can optimize performance. Additionally, the report will cover the software tools and libraries available for PULP, including those designed for deploying deep neural networks and for digital signal processing. The report will also provide a detailed overview of the PULP software development kit and the various runtime environments available for PULP-based systems. Finally, the report will include several case studies showcasing the real-world applications of PULP, including an examination of the GAP8 and GAP9 systems.

RISC-V

RISC-V is an open-source instruction set architecture (ISA) designed to be simple, modular, and extensible. It is based on the Reduced Instruction Set Computer (RISC) philosophy, which emphasizes using a small set of simple instructions that can be executed quickly and efficiently.

The RISC-V ISA is designed to be highly flexible and customizable, allowing for a wide range of implementations to meet specific application requirements. It supports various data types and memory models and can be extended to include custom instructions or additional features as needed.

One of the critical advantages of RISC-V is its open-source nature, which allows anyone to use, modify, or distribute the ISA without any licensing fees or restrictions. It is possible to customize the implementations using the free opcodes in RISC-V. This makes it a highly accessible platform for academic research, prototyping, and low-volume production.

In this video, Frank Gürkaynak presented a short overview of the basics of RISC-V.

Given the increasing popularity of the RISC-V architecture, it is not surprising that several open-source platforms have emerged to support it. One such platform is PULP, which provides developers with tools and resources to create energy-efficient computing systems based on the RISC-V architecture.

PULP

PULP (Parallel Ultra Low Power) is an open-source platform for designing energy-efficient, parallel computing systems. It is based on the RISC-V instruction set architecture (ISA) and incorporates various hardware and software components to support a range of parallel processing applications.

In addition to the processor cores, the PULP platform comprises a range of other hardware components, including memory controllers, communication interfaces, and accelerators, all designed to support efficient parallel processing. It also includes various software tools and libraries, including compilers, debuggers, and performance analysis tools, designed to facilitate development on the platform.

The PULP platform is designed to be highly flexible and customizable, allowing it to be adapted to a wide range of application domains, including machine learning, signal processing, and Internet of Things (IoT) applications. It is also designed to be energy-efficient, focusing on minimizing power consumption while maintaining high performance.

This tutorial covers the following subjects about PULP:

RISC-V Cores
Peripherals
Interconnects
Platforms
- Single Core
- Multi-core (Cluster-based)
  - GAP8
  - GAP9
- Multi-Cluster
Accelerators (HWPEs)

Figure 1: An overview of PULP

Silicon Proven Designs
Software
Useful libraries
- PULP-NN
- Dory
- QuantLab
- PULP-TraniLib
- PULP-DSP
- PULP-Dronet
How can we develop an application on PULP

One of the critical components of the PULP platform is the range of available RISC-V cores, which provide the foundation for energy-efficient computing systems.

2.1. RISC_V CORES

As shown in Table 1, the PULP platform comprises several RISC-V processors, including RI5CY, Ariane, and Snitch. Here is a comparison of different processors on the PULP website. Also, you can find more details by clicking on each processor’s name.

Table 1: Processors available in PULP

Processor	Bits/Stages	Description
CV32E40P (RI5CY)	32bit / 4-stage	A 4-stage 32-bit core that implements RV32IMC, with an optional 32-bit FPU supporting the F extension and instruction set extensions for digital signal processing (DSP) operations, including hardware loops, SIMD extensions, bit manipulation and post-increment instructions.
Ibex (Zero-riscy)	32bit / 2-stage	An area-optimized 2-stage 32-bit core for control applications implementing RV32-IMC.
Micro-riscy	32bit / 2-stage	A minimal area 2-stage 32-bit core with 16 registers and no hardware multiplier implementing RV32-EC.
CVA6 (Ariane)	64bit / 6-stage	A 6-stage, single issue, in-order 64-bit CPU which fully implements I, M, C and D extensions as specified in Volume I: User-Level ISA V 2.1 as well as the draft privilege extension 1.10. It implements three privilege levels, M, S, and U, to fully support a Unix-like (Linux, BSD, etc.) operating system. It has a configurable size, separate TLBs, a hardware PTW and branch prediction (branch target buffer, branch history table and a return address stack). The primary design goal was to reduce critical path length to about 20 gate delays.
Snitch	32bit / 1-stage	A single-stage, single-issue 32-bit RISC-V integer core tuned for high energy efficiency. It aims to maximise the compute/control ratio by making the FPU external to the core and the dominant part of the design and mitigating the effects of deep pipelines and dynamic scheduling.

One of the most widely used RISC-V cores in the PULP platform is the RI5CY core, a 32-bit core optimized for energy efficiency and high performance. This core includes various extensions (Xpulp) to RISC-V for DSP applications.

Post–incrementing load/store instructions.
Hardware Loops (lp.start, lp.end, lp.count)
ALU instructions
- Bit manipulation (count, set, clear, leading bit detection)
- Fused operations: (add/sub-shift)
- Immediate branch instructions
Multiply Accumulate (32x32 bit and 16x16 bit)
SIMD instructions (2x16 bit or 4x8 bit) with scalar replication option
- add, min/max, dot product, shuffle, pack (copy), vector comparison.

Here you can find more details about DSP ISA Extensions for an Open-Source RISC V Implementation.

An overall summary of the cores is provided in Figure 2.

Figure 2: A summary of the available cores

Here you can find more details about OpenPiton+Ariane, which is the first Linux-booting open-source RISC-V Manycore.

Additionally, Prof. Luca Benini discussed Ariane.

In addition to the range of RISC-V cores available in PULP-based systems, various peripherals can be integrated to provide additional functionality and enhance overall performance.

2.2. Peripherals

The PULP team have developed customized accelerators, AXI-compatible interconnect solutions, DMA engines, and peripherals to communicate with the environment, including GPIO, SPI, I2S, JTAG, and many more. More details are available here:

2.3. Interconnects

The PULP project utilizes several types of interconnects to facilitate communication between the various processing elements in its system-on-chip (SoC) designs, including Logarithmic interconnect, APB-Peripheral BUS, and AXI4-interconnect.

More information is available here.

2.4. Platforms

We can divide PULP platforms into three categories:

2.4.1 Single core

The simplest PULP-based systems are microcontrollers that can be configured to use any 32-bit RISC-V cores they have developed (RI5CY, Zero-riscy, Micro-riscy) to add memory and some peripherals, as shown in Figure 3. Advanced versions also allow Accelerators to be added to the system.

Figure 3: single core components

M, R5, A, I, and O represent Memory, RISC-V Core, Accelerator, Input, and Output, respectively. PULPissimo and PULPino are two single-core MCUs in the PULP project. Let’s compare these two MCUs on the PULP website.

2.4.1.1 PULPino

A minimal single-core RISC-V SoC, the first open-source release that has attracted a lot of attention.

Figure 4 PULPino architecture

For more details about PULPino, please visit this link.

2.4.1.2 PULPissimo

An advanced version of their microcontroller. The main change is the presence of the logarithmic interconnect between the core and the memory subsystem, allowing multiple access ports. These are then used by an integrated uDMA that can copy data directly between peripherals, memory, and optional accelerators called Hardware Processing Engines (HWPEs).

Figure 5 PULPissimo architecture

In this tutorial, Davide Schiavone explains the architecture of PULPissimo, the differences among individual PULP cores, Xpulp extensions, and much more.

2.4.2 Multi-core (Cluster-based)

Figure 6 cluster-based components

The more advanced systems are based on clusters of 32-bit RISC-V cores with direct access to a small and fast scratchpad memory (Tightly Coupled Data Memory). The cluster is supported by an SoC that houses a larger second-level memory, peripherals for input and output, and a complete PULPissimo class microcontroller for power management and basic operations in later versions.

Figure 7 Multi-core architecture

Most of their research is based on developing architectures based on these systems. Mia Wallace, Honey Bunny, Fulmine, Mr. Wolf and Vega are all such systems, and the source code for the latest system has been released as OPENPULP on their GitHub page.

During the ACACES20 summer school, Prof. Luca Benini discussed this multi-core platform and other concepts related to the cluster, including barriers, DMA, Memory, and interconnects.

Two prominent examples of PULP-based commercially available platforms are GAP8 and GAP9, which offer a range of features and benefits for specific design requirements and performance goals.

2.4.2.1 GAP8

GAP8 (Greenwaves Application Processor) is a low-power, high-performance application processor developed by Greenwaves Technologies. It is designed specifically for the efficient execution of machine learning and signal-processing algorithms in embedded systems. (More details are available here)

GAP8 is based on the RISC-V open-source instruction set architecture and is optimized for low power consumption and high performance. It features a multi-core design with eight processing cores, each with its own local memory and shared L2 cache and includes hardware accelerators for commonly used signal processing and machine learning operations.

The processor is designed for many embedded systems, including sensor nodes, wearables, and other low-power IoT devices. Its low power consumption and high performance make it well-suited for applications such as image and audio processing, gesture recognition, and environmental monitoring.

Greenwaves Technologies also provides a comprehensive software development kit (SDK) for GAP8, including optimized machine learning and signal processing libraries, a C/C++ compiler, and tools for debugging and profiling applications. The SDK also includes support for the PULP operating system, allowing developers to leverage the full capabilities of the PULP ecosystem.

Figure 8 GAP8

2.4.2.2 GAP9

GAP9 is the latest version of the Greenwaves Application Processor (GAP) developed by Greenwaves Technologies. Like its predecessor, GAP8, GAP9 is a low-power, high-performance application processor optimized for the efficient execution of machine learning and signal processing algorithms in embedded systems.

GAP9 features a multi-core design with nine RISC-V cores, including an additional processing core compared to GAP8. Each core has its own local memory, shared L2 cache, and hardware accelerators for signal processing and machine learning operations. GAP9 also features new hardware modules for image processing, including a multi-channel, multi-resolution image sensor interface and hardware support for convolutional neural networks (CNNs).

Figure 9 GAP9

2.4.3 Multi-Cluster

Figure 10 Multi-Cluster components

They have also expanded their work to handle larger workloads, where a PULP system containing multiple clusters is connected to a regular computing node. In this scenario, the PULP cluster is used as an energy-efficient accelerator for DSP loads. Their HERO platform release is such a system.

Figure 11 Multi-Cluster architecture

More materials on different types of multi-cluster architectures are available on

Prof. Luca Benini shared insights gained from designing open-source RISC-V hardware and software for energy-efficient computing, moving from tiny, parallel, ultra-low-power chips to high-performance many-core chipsets.

For more information, you can watch this video where Frank Gürkaynak provided more details about PULP-based chips.

2.5. Accelerator (HWPEs)

Hardware Processing Engines (HWPEs) are special-purpose, memory-coupled accelerators that can be inserted in the SoC or cluster of a PULP system to amplify its performance and energy efficiency in particular tasks.

Unlike most accelerators in literature, HWPEs do not rely on an external DMA to feed them with input and extract output, and they are not (necessarily) tied to a single core. Instead, they operate directly on the same memory shared by other PULP system elements (e.g., the L1 TCDM in a PULP cluster or the shared L2 in PULPissimo). Their control is memory-mapped and accessed through a peripheral bus or interconnect. HW-based execution on an HWPE can be readily intermixed with software code because all that needs to be exchanged between the two is a set of pointers and, if necessary, a few parameters. (More details are available here)

The following hardware accelerators are available (find the latest papers and accelerators on their website):

HWCE: Convolution engine
XNE: Binary Neural Network Inference
RBE: Convolutions, flexible precision for weights and activations
NE16: Convolutions, flexible precision for weights
FFT Accelerator
RedMulE: floating-point GEMM accelerator
IMA: In-Memory Computing
SNE: Digital SNN Accelerator for Sparse Event-Based Convolutions (uses a modified HWPE infrastructure)

Prof. Benini provided an overview of what is written on PULP concepts in the previous sections

2.6. Silicon Proven designs

They have a long tradition of taping out ASICs at ETH Zurich; check their Chip Gallery. They have designed and tested over 40 PULP-related designs in several technologies (more details are available here).

Figure 12 chips

2.7. Software

Figure 13 PULP software

The PULP Microcontroller Software Interface Standard (PMSIS) provides the Board Support Package (BSP), the Application Programming Interface (API), and the drivers for running applications on PULP-based Microcontrollers (MCUs). It is developed and expanded based on the old pulp-rt used, e.g., for the Mr. Wolf processor.

The GCC and LLVM compilers used for PULP are based on the GNU GCC and LLVM, respectively, supporting the PULP ISA based on the RISC-V standard ISA and specific extensions such as Xpulpv0, Xpulpv1, Xpulpv2, and XpulpNN, which have distinct features and application domains.

PULPOS is an optimized software library for operating system functionalities, including tasking, memory management, and interrupts. Alternatively, FreeRTOS is also ported for PULP, including drivers. The Hardware Abstraction Layer (HAL) is a set of functions that hide the register-level details of the memory map, allowing for common programming entry points for typical hardware modules.

2.8. Useful libraries

2.8.1. Deploying DNNs on PULP

2.8.1.1. PULP NN

PULP NN is a multicore computing library for Quantized Neural Network (QNN) inference on PULP clusters of RISC-V-based processors. It includes optimized kernels such as convolution, matrix multiplication, pooling, normalization, and other common state-of-the-art QNN kernels. It fully exploits the Xpulp ISA extension and the cluster's parallelism to achieve high performance and energy efficiency on PULP-based devices. It has been tested on GWT GAP8.

Work on L1 memory; data exchange with outer memory levels is managed at a higher level
Exploit parallelism + vectorization capabilities of PULP RI5CY/CV32E40P cores
Try to transform all linear operators into a GEMM (Generalized Matrix Multiplication) form

GEMM-based convolution is a technique for implementing convolutional neural networks (CNNs) using matrix multiplication operations. It is based on the observation that the computation performed by the convolution operation can be expressed as a matrix multiplication between the input data and a set of learnable weights, followed by a bias term and an activation function. "GEMM" stands for "general matrix multiplication", a fundamental operation in linear algebra. Expressing the convolution operation as a matrix multiplication can be efficiently implemented using hardware or software optimized for GEMM operations.

Target Height/Width/Channel (HWC) data layout
Open-source code

More details are available here

2.8.1.2. DORY

DORY (Deployment Oriented to memoRY) is an automatic tool to deploy DNNs on PULP platforms. DORY abstracts the DNN tiling problem as a Constraint Programming (CP) problem, maximizing L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, DORY augments the CP formulation with heuristics that promote performance-effective tile sizes based on the PULP-NN or other custom DNN backends to maximize speed. For more details, visit here. Additionally, Alessio is discussing the PULP Virtual Platform and DORY, an automated tool for deploying DNNs on memory-constrained devices.

2.8.1.3. QuantLab

QuantLab (Deployment-Oriented to memoRY) is a tool for training, comparing, and deploying quantized neural networks (QNNs). It was developed on top of the PyTorch deep learning framework and is a purely command-line-based tool. For more details, please refer to here.

You can find the QuantLab video here.

In this tutorial, Matteo presents QuantLab, a PyTorch-based software tool designed to train, optimize, and prepare quantized neural networks for deployment on PULP platforms.

In this talk, Dr. Francesco Conti discusses an Open-Source Flow for DNNs on ultra-low-power RISC-V Cores.

2.8.1.4. PULP-TrainLib

PULP-TrainLib is the first Deep Neural Network training library for the PULP Platform. PULP-TrainLib features a wide set of performance-tunable DNN layer primitives for training, together with optimizers, losses, and activation functions. To enable on-device training, PULP-TrainLib is equipped with AutoTuner, a pre-deployment tool that selects the fastest configuration for each DNN layer based on the training step to be performed and the shapes of the layer tensors. To facilitate the deployment of training tasks on the target PULP device, PULP-TrainLib is equipped with the TrainLib Deployer, a code generator that generates a project folder containing all the necessary files and code to run a DNN training task on PULP.

2.8.1.5. Deeploy: DNN Compiler for Heterogeneous SoCs

Deeploy Deeploy is an ONNX-to-C compiler that generates low-level optimized C Code for multi-cluster, heterogeneous SoCs. Its goal is to enable configurable deployment flows from a bottom-up compiler perspective, modeling target hardware in a fine-grained and modular manner.

2.8.2. Digital Signal Processing

2.8.2.1. PULP DSP

PULP DSP provides optimized functions for digital signal processing, such as dot product, matrix multiplication, convolution, fast Fourier transform, etc., for various data types (8-, 16-, 32-bit integer and fixed-point, and single-precision floating-point). The optimized implementations utilize SIMD instructions, hardware loops, parallel clusters, and other features. It has been tested on Mr. Wolf, featuring Ibex and CV32E40P cores and pulp-open. It can also be run on GWT GAP8 featuring CV32E40P cores. For more details, please visit the repository and refer to the documentation, where you can also find a documentation on how to use the library and advice on how to optimize codes on PULP.

2.8.3. PULP-DroNet

PULP-DroNet is a deep learning-powered visual navigation engine that enables autonomous navigation of a pocket-sized quadrotor in a previously unseen environment. Thanks to PULP-DroNet, the nano-drone can explore its environment, avoiding collisions with dynamic obstacles, in complete autonomy — **no human operator, no ad-hoc external signals, and no remote laptop **.**This means that all complex computations are performed quickly directly aboard the vehicle. The visual navigation engine comprises both software and hardware components. **

More details are available in these videos:

2.9. How we can develop an application on PULP

Let’s now take a user's point of view. If you’d like to develop an application using machine learning or digital signal processing algorithms, you can start with PULP SDK if you intend to use mostly integer operations or with Snitch Runtime for optimized floating-point operations. If you wish to use Linux, CVA6 will be your choice. If you intend to develop Hardware (HW), e.g., an HW accelerator and would like to test some simple software code quickly, then you can go with PULP Runtime. Finally, if you prefer FreeRTOS, you can go with the PULP FreeRTOS.

Figure 14 an overall flow of PULP usage

Nazareno Bruschi introduces the Software Development Kit for PULP, while Giuseppe elaborates on the GCC Compilation Toolchain.

2.9.1. PULP SDK

PULP SDK includes the fundamental libraries, tools, and scripts necessary for developing applications for PULP chips, such as platform descriptions, operating system libraries, drivers, and simulators.

It includes the GVSoC virtual platform, which guarantees high accuracy, encompassing all PULP hardware IP models, such as cores, clusters, interconnects, caches, and uDMA. It is an event-based simulator (cycle-accurate at a core level, with statistical approximations at the interconnect level). It results in fast simulations and allows an agile reconfiguration thanks to JSON-based platform description files and Python generators.

Nazareno Bruschi is talking about GVSoC here.

The virtual platform allows developers to dump architecture events, helping them debug their applications by providing a clearer view of what is happening in the system. For example, it can display executed instructions, DMA transfers, generated events, and memory accesses, among other details. The generated traces can be visualized using GTKWave.

2.9.2. PULP FreeRTOS

PULP FreeRTOS provides FreeRTOS and drivers for developing real-time applications on PULP-based systems. Programs can be run using RTL simulation (simulating the hardware design), e.g., QuestaSim, or the GVSoC virtual platform (software emulation of the hardware design). A book about FreeRTOS can be found here, and the official documentation is available on this website. It has been tested on Pulpissimo, pulp-open, and ControlPULP.

2.9.3. PULP Runtime

PULP Runtime provides a minimal way to run a barebone program on PULP architectures. Programs can be run using RTL simulation, e.g., QuestaSim. You can use it, for example, when developing a new piece of hardware, such as hardware accelerators. It has been tested on Pulpissimo, pulp-open, ControlPULP, and Marsellus.

2.9.4. Snitch Runtime

Snitch Runtime provides a fundamental, bare-metal runtime for Snitch systems. It exposes a minimal API to manage the execution of code across the available cores and clusters, query information about a thread's context, and coordinate and exchange data with other threads.

It includes an LLVM-based binary translation simulator for Snitch systems, called Banshee, which can specifically emulate custom instruction set extensions (instruction-accurate).

For Snitch, the Trace-viewer or Catapult is used to visualize traces.

The DSP, NN, DORY, and QuantLab workflow support is under development.

2.9.5. CVA6

The CVA6 SDK is used for CVA6, a 6-stage, single-issue, in-order CPU that implements the 64-bit RISC-V instruction set. You can simulate CVA6 in QuestaSim, VCS, and Verilator (the Verilator output can be visualised with GTKWave) and emulate CVA6 on FPGAs.

2.10. Some Case Studies

2.10.1. GAP8

Low-Power License Plate Detection and Recognition on a RISC-V Multi-Core MCU-based Vision System

The project implements an image-based Deep Learning pipeline to detect license plates and read the registration number. The algorithm is based on a 2 steps inference model:

Mobilenet SSD-Lite to detect License plates within greyscale 320x240 images.
LPRNet to read the registration number from 94x24 License Plates crops.

Models are invoked in sequence to run the full pipeline.

Figure 15 License Plate Detection

More details are available in this talk.

enabling perception on Nano-Robots
- Paper
- GitHub

Figure 16 Nano-Robots

2.10.2. GAP9

Mixed-Precision Speech Enhancement on multi-core MCUs

This paper presents an optimized methodology for designing and deploying Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU) featuring 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline that interleaves parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually managed memory transfers of model parameters. To ensure minimal accuracy degradation compared to full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit precision, while maintaining the remaining layers at FP16 precision. Experiments are conducted on multiple LSTM and GRU-based SE models trained on the Valentini dataset, which features up to 1.24 million parameters. Thanks to the proposed approaches, we can speed up the computation by up to 4 times with respect to the lossless FP16 baselines. Unlike a uniform 8-bit quantization, which degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme results in a low degradation of only 0.06 while achieving a 1.4-1.7x memory savings. Thanks to this compression, we reduce the power cost of the external memory by fitting large models onto the limited on-chip non-volatile memory, and we achieve an MCU power saving of up to 2.5x by lowering the supply voltage from 0.8V to 0.65V while still meeting the real-time constraints. Our design results are 10 times more energy-efficient than state-of-the-art SE solutions deployed on single-core MCUs, which utilize smaller models and quantization-aware training.

Continual On-device Learning on Multi-Core RISC-V Microcontrollers

More details are available here.

Transprecision Floating Point Unit on PULP

An Open-Source Transprecision FPU on PULP:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This repository is actively maintained and will be regularly updated with new content, improvements, and contributions.

Table 1: Processors available in PULP

About

Uh oh!

Releases

Packages

ahmad-mirsalari/PULP

Folders and files

Latest commit

History

Repository files navigation

This repository is actively maintained and will be regularly updated with new content, improvements, and contributions.

Table 1: Processors available in PULP

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages