The present work constitutes the preliminary development of a compact convolutional neural network intended solely for exploratory experimentation. Owing to the modest scale and limited diversity of the dataset employed, the resulting model is not yet calibrated for robust inference performance; its weights, biases, and quantisation parameters require further refinement before any substantive evaluation can be undertaken. The principal aim at this stage has therefore been to establish a functional prototype rather than to optimise predictive accuracy or generalisability.
In parallel, an initial hardware architecture has been implemented in VHDL to investigate communication pathways and data-handling protocols between a Nexys-2 FPGA platform and a host system using libftdi1. This structure remains an early-stage framework whose primary purpose is to validate end-to-end connectivity, packet exchange, and basic processing flow within a quantised-inference context. Substantial stabilisation and architectural consolidation remain necessary before a full hardware–software co-design pipeline can be realised.
+--------------------------------------------------------+
| top_level.vhd |
| |
| +-------------------+ +-----------------------+ |
| | usb_rx_controller | ---> | packet_parser | |
| +-------------------+ +-----------------------+ |
| | word_out | opcode/payload |
| v v |
| +-------------------+ +-----------------------+ |
| | processing_core | ---> | conv2d_engine | |
| +-------------------+ +-----------------------+ |
| |
| +-------------------+ +-----------------------+ |
| | packet_builder | ---> | usb_tx_controller | |
| +-------------------+ +-----------------------+ |
+--------------------------------------------------------+
This project develops a compact inference engine for quantised convolutional neural networks, focusing on fixed-point computation, weight quantisation, and low-precision execution. The architecture implements 8-bit (or sub-8-bit) quantised convolutional layers, employing reusable multiply–accumulate units organised in a mosaic or systolic configuration to reduce hardware cost while sustaining throughput. Activation functions are realised through lightweight lookup tables to minimise latency.
The design explores end-to-end inference pipelining, including quantised weight storage and transmission, intermediate buffering, and layer-level scheduling for performance tuning. The overarching objective is to demonstrate how quantisation-aware design, together with structured datapath reuse, can yield an efficient, low-resource CNN accelerator suitable for FPGA-based deployment.
+-------------------------+ +----------------------+ +----------------------+
| Host PC | <====> | USB / Adept FW | <====> | usb_tx/rx pins |
| (C++ app using Adept) | | (FTDI on Nexys-2) | | (mapped via top.ucf) |
+-------------------------+ +----------------------+ +----------------------+
|
V (usb_rx / usb_tx signals)
+------------------+
| top_level |
|------------------|
| usb_rx_controller| <-- bytes -> words (32b)
| packet_parser | <-- parsed opcode/payload
| params_registers | <-- store params
| top_processing_core (CNN engine)
| ├─ conv_controller (FSM)
| ├─ mac_array (N parallel MACs)
| ├─ input_bram (dual-port)
| ├─ weight_bram (ROM/RAM)
| ├─ bias_bram
| └─ activation_rom (LUT 256x32)
| packet_builder | <-- assemble 32b words
| usb_tx_controller|
+------------------+
|
V (FPGA pins / HOST)
- Input: 1×6×6 grayscale image
- ConvLayer: 1 -> 4 channels, kernel 3×3, valid convolution
- DenseLayer: 64 -> 3
- Output: 3 logits -> softmax probabilities
Input (1×6×6)
│
Conv1 (1→4, 3×3)
▼
Output: 4×4×4
│
Flatten
▼
64 features
│
Dense (64→3)
▼
Softmax → [p0, p1, p2]
-------------------------------
Input (N,1,28,28)
│
Conv1 1→32, k=3, pad=1
▼
ReLU
▼
MaxPool 2×2
▼
Conv2 32→64, k=3, pad=1
▼
ReLU
▼
MaxPool 2×2
▼
Flatten (N,64*7*7=3136)
▼
FC1 3136→128
▼
ReLU
▼
FC2 128→10
▼
Softmax → probabilities
------------------------------------
dL/dOut (N,10)
│
FC2 backward → dW2: (10,128), dInput: (N,128)
│
ReLU backward (N,128)
│
FC1 backward → dW1: (128,3136), dInput: (N,3136)
│
Unflatten → (N,64,7,7)
│
MaxPool2 backward → (N,64,14,14)
│
ReLU backward → (N,64,14,14)
│
Conv2 backward → dW_conv2: (64,32,3,3), dInput: (N,32,14,14)
│
MaxPool1 backward → (N,32,28,28)
│
ReLU backward → (N,32,28,28)
│
Conv1 backward → dW_conv1: (32,1,3,3), dInput: (N,1,28,28)
----------------------------------------
Conv(1→32,3x3) → ReLU → MaxPool2
Conv(32→64,3x3) → ReLU → MaxPool2
Flatten → FC(128) → ReLU → FC(10) → Softmax
-----------------------------------------
PC (C++, Adept SDK)
|
| USB
|
FTDI / Microcontrolador USB (en la placa Nexys-2)
|
| JTAG / Slave Serial / SPI
|
FPGA XC3S500E
Should the proposed architecture demonstrate insufficient efficiency when deployed on an FPGA, the project will be extended to encompass alternative high-performance computing paradigms. In particular, I intend to investigate the feasibility of implementing the inference pipeline on modern graphics processing units, leveraging mature parallelisation frameworks and established concepts from supercomputing practice. This progression would enable a broader exploration of throughput, latency characteristics, and scalability, thereby providing a rigorous comparative basis for determining the most appropriate computational substrate for quantised neural network inference.