Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/actions/spelling/allow/terms.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,17 @@ backpropagation
biodynamo
bioinformatics
blogs
chrono
cms
codegen
consteval
cout
cplusplus
cppyy
cytokine
cytokines
doxygen
endl
gitlab
gpu
gridlay
Expand All @@ -70,6 +73,7 @@ llm
llvm
meetinglist
microenvironments
nomarkdown
omp
openmp
oop
Expand All @@ -82,6 +86,7 @@ rntuple
samtools
samtoramntuple
sbo
setprecision
sitemap
softsusy
superbuilds
Expand Down Expand Up @@ -129,6 +134,7 @@ MAMODE
meetup
metaprogramming
Miapb
milli
multilanguage
omnidisciplinary
optimisation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
author: Hristiyan Shterev
permalink: blogs/xeus_cpp_Hristiyan_Shterev_blog/
date: 2026-03-01
tags: xeus-cpp cuda jupyter c++ xeus
tags: xeus-cpp cuda jupyter c++ xeus internship high-school systems-programming
custom_css: jupyter
---

{% include dual-banner.html
left_logo="/images/mg-pld-logo.png"

Check failure on line 14 in _posts/2026-03-01-Creating teaching materials for C++ and CUDA with xeus-cpp.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`pld` is not a recognized word. (unrecognized-spelling)
right_logo="/images/cr-logo_old.png"
caption=""
height="20vh" %}
Expand Down Expand Up @@ -56,11 +57,140 @@

## Example

**CPU - std::sort vs GPU - Merge sort speed test**

The example below shows a C++ benchmark comparing the performance of sorting a large array on a CPU versus a GPU. It provides a clear visual of how parallel processing can drastically outperform traditional sequential execution for data-heavy tasks.

<img src="/images/blog/MergeSortTest.png"/>
{::nomarkdown}

<div tabindex="-1" id="notebook" class="border-box-sizing">
<div class="container" id="notebook-container">
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt"></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="CPU - std::sort vs GPU - Merge sort speed test">CPU - std::sort vs GPU - Merge sort speed test<a class="anchor-link" href="#CPU - std::sort vs GPU - Merge sort speed test">&#182;</a></h1>
<p>
The example below shows a C++ benchmark comparing the performance of sorting a large array on a CPU versus a GPU. It provides a clear visual of how parallel processing can drastically outperform traditional sequential execution for data-heavy tasks.
</p>

<p>
In the first cell we create the unsorted data that is going to be sorted by the CPU and GPU. We have loaded a compiled CUDA .so file beforehand.
</p>
</div>
</div>
</div>

<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[1]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-c++">
<pre>
<span class="kt">unsigned int</span> <span class="n">N_bench = <span class="mi">1048576</span>;</span>
<span class="n">std<span class="o">::</span>vector<</span><span class="kt">unsigned int</span><span class="n">> data_cpu(N_bench);</span>
<span class="n">std<span class="o">::</span>vector<</span><span class="kt">unsigned int</span><span class="n">> data_gpu(N_bench);</span>

<span class="k">for</span> <span class="p">(</span><span class=
"kt">unsigned int</span> <span class="n">i</span> <span class="o">=</span> <span class=
"mi">0</span><span class="n">;</span> <span class="n">i</span> <span class=
"o">&lt;</span> <span class="n">N_bench;</span></span> <span class="n">i</span><span class="o">++</span><span class=
"n">)</span> <span class="p">{</span>
<span class="kt">unsigned int </span><span class="n">val = N_bench - i;</span>
<span class="n">data_cpu[i] <span class="o">=</span> val;</span>
<span class="n">data_gpu[i] <span class="o">=</span> val;</span>
<span class="p">}</span>
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt"></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="CPU and GPU sorting">CPU and GPU sorting<a class="anchor-link" href="#CPU and GPU sorting">&#182;</a></h1>
<p>
Next we use std::sort and merge_sort_gpu_full function form the loaded CUDA code and measure the time the CPU and GPU sorts the data.
</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[2]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-c++">
<pre>
<span class="k">auto</span> <span class="n">start_cpu = std<span class="o">::</span>chrono<span class="o">::</span>high_resolution_clock<span class="o">::</span>now();</span>
<span class="n">std<span class="o">::</span>sort(data_cpu.begin(), data_cpu.end());</span>
<span class="k">auto</span> <span class="n">end_cpu = std<span class="o">::</span>chrono<span class="o">::</span>high_resolution_clock<span class="o">::</span>now();</span>

<span class="n">std<span class="o">::</span>chrono<span class="o">::</span>duration&lt;double, std<span class="o">::</span>milli&gt; cpu_ms <span class="o">=</span> end_cpu - start_cpu;</span>

<span class="k">auto</span> <span class="n">start_gpu = std::chrono::high_resolution_clock::now();</span>
<span class="n">merge_sort_gpu_full(data_gpu.data(), N_bench);</span>
<span class="k">auto</span> <span class="n">end_gpu = std::chrono::high_resolution_clock::now();</span>

<span class="n">std<span class="o">::</span>chrono<span class="o">::</span>duration&lt;double, std<span class="o">::</span>milli&gt; gpu_ms <span class="o">=</span> end_gpu - start_gpu;</span>
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt"></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="">Printing the times and comparing them<a class="anchor-link" href="#Printing the times and comparing them">&#182;</a></h1>
<p>
Finally we print both times and compare them to see how much faster parallel processing is.
</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[3]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-c++">
<pre>
<span class="n"><span class="kt">double</span> speedup = cpu_ms.count() / gpu_ms.count();</span>

<span class="n">std<span class="o">::</span>cout << <span class="s">"CPU (std::sort) took: "</span> << std<span class="o">::</span>fixed << std<span class="o">::</span>setprecision(<span class="mi">4</span>) << cpu_ms.count() << <span class="s">" ms"</span> << std<span class="o">::</span>endl;</span>
<span class="n">std<span class="o">::</span>cout << <span class="s">"GPU (Merge Sort) took: "</span> << gpu_ms.count() << <span class="s">" ms"</span> << std<span class="o">::</span>endl;</span>

<span class="n">std<span class="o">::</span>cout << std<span class="o">::</span>endl;</span>

<span class="n">std<span class="o">::</span>cout << <span class="s">"GPU Speedup: "</span> << speedup << <span class="s">" times faster than CPU"</span> << std<span class="o">::</span>endl;</span>
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[3]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>
CPU (std<span class="o">::</span>sort) took: 145.3539 ms
GPU (Merge Sort) took: 9.6199 ms

GPU Speedup: 15.1097 times faster than CPU
</pre>
</div>
</div>
</div>
</div>
</div>
</div>

<br /> <br /> <br />

{:/}

## Related links

Expand Down
Binary file removed images/blog/MergeSortTest.png
Binary file not shown.
Loading