- The 16384 matrixMul on T4 and A100 GPUs with PyTorch Profiler
- Benchmarking data of GPU measured peak throughput (GFLOPS) and L1/L2/HBM bandwidth (GB/s)
- DeviceQuery and BW results from the CUDA samples
- Guide to get the hardware limit of the GPUs
- Update the GPU Microbenchmark suite to take parameters from the cmd line. And test many sizes instead.
Last updated by xmei@jlab.org on Oct-21-2022