Skip to content

Weber State University. Computer Science 6850. Parallel Programming & Architecture. Summer 2025. Dr. Brad Peterson. Implementing Gaussian smoothing using MPI & OpenMP parallelism on large images. The solution splits the image into strips, applies smoothing with varying sigma values, and compares performance with serial execution

License

Notifications You must be signed in to change notification settings

bell-kevin/kokkos

  1. What Kokkos is

Kokkos is a performance portability library for C++ used mostly in High-Performance Computing (HPC). It lets developers write parallel code once, and then Kokkos maps it to the right backend:

OpenMP for multicore CPUs

CUDA or HIP for GPUs

Threads for general CPU threading

SYCL (experimental)

Serial fallback mode

This means the same code runs on NVIDIA GPUs, AMD GPUs, Intel GPUs, and many-core CPUs.

  1. Why Kokkos Exists

Modern HPC systems look like this:

Many CPU cores

Several GPU accelerators

Very complex memory hierarchies

Traditionally, you’d have to write different code for each:

CUDA code for NVIDIA GPUs

OpenMP code for CPUs

HIP for AMD GPUs

SYCL or DPC++ for Intel GPUs

Kokkos eliminates this problem by abstracting:

(A) Execution spaces (how/where parallelism runs) (B) Memory spaces (where data is stored) (C) Views (portable multidimensional arrays)

So you get portability AND performance, not one or the other.

CS 6850 - Assignment – Image Processing

Brad Peterson - Weber State University

Overall goal: Utilize a portability tool to implement a graphics blurring problem capable of execution on multiple threading backends.

General idea:

Use multiple MPI ranks to smooth an image. As each MPI rank needs information from its neighbors, they will participate in a round of MPI sends and MPI receives.

This assignment’s image split into 4 strips. For this assignment, keep the number of MPI ranks fixed at 4. Each MPI rank will utilize all CPU cores available within it. (This is sometimes referred to as OpenMP + MPI parallelism.)

In practice, MPI would not be used on a problem like this, as 2D image processing for smoothing easy to fit within one compute node. But this assignment lays a foundation for bigger problems. Additionally, a problem like this would supply each MPI rank its own strip of an image. Each MPI rank would then smooth its strip. All strips from all MPI ranks would then be gathered into one master image. This assignment assumes strips already exist and that the gathering process would be completed by something else.

Background

This homework uses the Gaussian Smoothing algorithm, which determines a new pixel color value by utilizing nearby pixel values. A good explanation is found here: https://homepages.inf.ed.ac.uk/rbf/HIPR2/gsmooth.htm. The closer the pixel, the larger its weight affects the target pixel. The 2D Gaussian function used here is given by the following formula, with mean (0,0) and sigma = 1:

For example, suppose a new smoothed pixel value is desired for pixel at row 2 column 2 (counting starts at index 0). Treat (2,2) as the origin point. Use a smoothing kernel or grid. An example of a 5x5 smoothing “kernel” is shown in the image below. The cell with “41” is the origin. (Note, this grid was computed by summing together a million points within each pixel, so this will be different from just merely running the formula on a single discrete integer (x,y) value.)

Thus, if your origin is a pixel at row 2 and column 2, then multiply the pixel value by 41. Then multiply the pixel value at row 2 column 3 by 26. Multiply the pixel value at row 2 column 4 by 7, and so on. You must multiply all 25 pixel values, sum the results, then divide by 273. The resulting integer is your new pixel value for row 2 and column 2.

However, this assignment will have you use a different computation to simplify.

First, do not average values within a cell by averaging finer values within a cell points like the prior link did. Also, use floating points for the grid cell values instead of rounding to ints.

So for example, in a 5x5 grid and sigma = 1, the (-2,-2) corner computes directly to 0.002915, the (-1,-2) cell computes to 0.013065, the (0,-2) cell computes to 0.02154, and so on. All 25 values add up to 0.981844 (remember this value, it will be used soon). This is close to, but not quite 1. A sum of 1 is desired. In the prior 5x5 grid example pixel values were weighted and summed and then divided by 273. By keeping the summation to 1, the algorithm would weight and sum and then divide by 1, and thus the dividing by 1 step isn’t necessary and the algorithm simpler. To get the sum of all 25 weights to become 1, compute 1 / 0.981844 = 1.018492. Then take each of those 25 values and multiply it by this new factor. Now the (-2, -2) corner is 0.002969, the (-1,-2) cell is 0.013306, the (0, -2) cell is 0.021938, and so on. All 25 values now sum to 1.

This 5x5 kernel is now used across each pixel. For simplicity, assume no smoothing any pixel outside the grid. For a 5x5 kernel, do not smooth the outer 2 pixel border of the image (if you wanted to smooth this border, just first duplicate the border pixel outward twice, then smooth).

Vary the sigma. Assume that any pixels at or beyond 3 sigma away from the original are too small to substantially affect the image. Thus, at sigma = 1, pixels 3 or more away are too small. Thus, go out -2 to 2 in the x and y directions. At sigma = 2, pixels 6 or more away are too small, and thus go out -5 to 5 in the x and y directions, giving a 11x11 kernel. Compute all integer sigmas between [1, 10]. This makes 10 total image smoothings. The sigma = 10 ignores pixels 30 or more away, giving a 59x59 kernel.

Varying the sigma also requires that the MPI communication patterns also change. At sigma = 2, ranks need to send 5 rows of pixels to neighboring ranks. At sigma = 10, ranks need to send 29 pixels to neighboring ranks.

Some bitmap caveats and details: The bitmap pixel values are not a single integer, but rather a 3-tuple of integers in a red-green-blue format. For each pixel, you must repeat the process three times, once for red, once for green, and once for blue. You will store the result as the red value, the green value, and the blue value, respectively. • For this assignment, the bitmap uses 24-bit colors, so red has a range of values 0-255, green gets 0-255, and blue gets 0-255. Thus, each color component of a pixel is 8 bits. • Do not store your results in-place. Rather, create a new bitmap that is a copy of the first. Write your results into the new bitmap. • Bitmaps store the bottom left pixel first, and the top right pixel last. The data is row-major, so the second pixel in the bitmap is going to be the pixel to the right of the bottom left pixel. • Bitmaps store the bits in blue/green/red order. Not red/green/blue. • For this assignment, do not worry about any edges. Only process those pixels 3 sigma in from an edge in any direction. • Many programmers confuse i/j with width/height. Since height is the number of rows, i should work with height, and j should work with width. That means your view should be allocated to have a height dimension first, and a width dimension second. An outer i loop should iterate up to height, and an inner j loop should iterate up to width. • If you wish to debug your pixel values against the image, remember that bitmap’s (0,0) is bottom left, and graphic viewers put (0,0) top left. I fixed this debugging dilemma by flipping my image vertically in an image viewer, as now the coordinates line up. Also, I used Gimp, which indicates the pixels in (column, row), so I had to mentally flip them in my head.

Overall, your assignment just needs to iterate across all appropriate pixels, compute a new pixel value using its neighbors, and then store the new pixel value out to file.

You can find the image at: /share/HK-7_left_H6D-400c-MS.bmp. It is 1.2 GB and is a 23200x17400 image = 404 megapixel image. You can also find each individual strip with a -0.bmp, -1.bmp, -2.bmp, or -3.bmp suffix.

You can also practice with a screw image: /share/HK-7_left_H6D-400c-MS_screw.bmp. It is a 288x328 pixel image. This image was not placed into strips.

Verification

Verify if your code is computing correctly by printing the values at these pixels for a 5x5 grid sigma = 1. (These values are close to actual values.)

The red, green, blue at (8353, 9111) (origin bottom left) is (252, 249, 248) The red, green, blue at (8351, 9113) (origin bottom left) is (141, 76, 68) The red, green, blue at (6352, 15231) (origin bottom left) is (67, 35, 18) The red, green, blue at (10559, 10611) (origin bottom left) is (191, 187, 172) The red, green, blue at (10818, 20226) (origin bottom left) is (73, 75, 69)

Submission

Obtain wall time for processing the image in these scenarios (ignore reading into an array and writing out to file):

1) Serial execution, 4 MPI ranks, sigma = 1, and 1 thread per MPI rank.

(Not part of this assignment - Parallel execution using OpenMP inside an MPI rank, 4 MPI ranks, sigma varying integer values between 1 and 10. 1, 2, 3, . . . 10, and many threads per MPI rank.)

Submit a brief report of your times as well as speedup vs serial execution.

Building

To build, use this:

/opt/openmpi-4.1.5/bin/mpic++ -g -Wall -o program.x name_of_program.cc

To run:

/opt/openmpi-4.1.5/bin/mpiexec -n 4 ./program.x

The MPI front-end now defaults to evaluating every integer sigma between 1 and 10 unless you override the list with --sigma or --sigma-range.

Reproducing assignment timing scenarios

Use the helper script after building gaussian_blur.x to gather the serial baseline and the four-rank MPI sweep:

./scripts/run_assignment_scenarios.sh <input-bmp> [output-prefix]

The script runs a serial baseline with sigma = 1 followed by a four-rank MPI execution that processes sigma values 1 through 10 while gathering each blurred image on rank 0. When the input image is too small for a particular sigma (for example, when strips have fewer rows than the kernel radius), that sigma is skipped and the tool prints a note so you can adjust your experiment.

Below is starter code which should assist you: #include #include //#include #include #include

using std::string; using std::ios; using std::ifstream; using std::ofstream;

double duration(std::chrono::high_resolution_clock::time_point start, std::chrono::high_resolution_clock::time_point end ) { return std::chrono::duration<double, std::milli>(end - start).count(); }

int main(int argc, char ** argv) { { string filename = "HK-7_left_H6D-400c-MS.bmp"; //std::uintmax_t filesize = std::filesystem::file_size(filename); //printf("The file size is %ju\n", filesize);

// Open File
ifstream fin(filename, ios::in | ios::binary);

if(!fin.is_open()) {
  printf("File not opened\n");
  return -1;
}
// The first 14 bytes are the header, containing four values. Get those four values.
char header[2];
uint32_t filesize;
uint32_t dummy;
uint32_t offset;
fin.read(header, 2);
fin.read((char*)&filesize, 4);
fin.read((char*)&dummy, 4);
fin.read((char*)&offset, 4);
printf("header: %c%c\n", header[0], header[1]);
printf("filesize: %u\n", filesize);
printf("dummy %u\n", dummy);
printf("offset: %u\n", offset);
int32_t sizeOfHeader;
int32_t width;
int32_t height;
fin.read((char*)&sizeOfHeader, 4);
fin.read((char*)&width, 4);
fin.read((char*)&height, 4);
printf("The width: %d\n", width);
printf("The height: %d\n", height);
uint16_t numColorPanes;
uint16_t numBitsPerPixel;
fin.read((char*)&numColorPanes, 2);
fin.read((char*)&numBitsPerPixel, 2);
printf("The number of bits per pixel: %u\n", numBitsPerPixel);
if (numBitsPerPixel == 24) {
  printf("This bitmap uses rgb, where the first byte is blue, second byte is green, third byte is red.\n");
}
//uint32_t rowSize = (numBitsPerPixel * width + 31) / 32 * 4;
//printf("Each row in the image requires %u bytes\n", rowSize);

// Jump to offset where the bitmap pixel data starts
fin.seekg(offset, ios::beg);

// Read the data part of the file
unsigned char* h_buffer = new unsigned char[filesize-offset];
fin.read((char*)h_buffer, filesize-offset);
std::chrono::high_resolution_clock::time_point start;
std::chrono::high_resolution_clock::time_point end;
printf("The first pixel is located in the bottom left. Its blue/green/red values are (%u, %u, %u)\n", h_buffer[0], h_buffer[1], h_buffer[2]);
printf("The second pixel is to the right. Its blue/green/red values are (%u, %u, %u)\n", h_buffer[3], h_buffer[4], h_buffer[5]);

// TODO: Read the image into a local array or arrays
   
delete[] h_buffer;

    
start = std::chrono::high_resolution_clock::now();
// TODO: Perform the blurring
end = std::chrono::high_resolution_clock::now();
printf("Time - %g ms\n", duration(start,end));

// TODO: Verification
printf("The red, green, blue at (8353, 9111) (origin bottom left) is (%d, %d, %d)\n", 0, 0, 0);
printf("The red, green, blue at (8351, 9113) (origin bottom left) is (%d, %d, %d)\n", 0, 0, 0);
printf("The red, green, blue at (6352, 15231) (origin bottom left) is (%d, %d, %d)\n", 0, 0, 0);
printf("The red, green, blue at (10559, 10611) (origin bottom left) is (%d, %d, %d)\n", 0, 0, 0);
printf("The red, green, blue at (10818, 20226) (origin bottom left) is (%d, %d, %d)\n", 0, 0, 0);

//Print out to file output.bmp
string outputFile = "output.bmp";
ofstream fout;
fout.open(outputFile, ios::binary);

// Copy of the old headers into the new output
fin.seekg(0, ios::beg);
// Read the data part of the file
char* headers = new char[offset];
fin.read(headers, offset);
fout.seekp(0, ios::beg);
fout.write(headers, offset);
delete[] headers;

fout.seekp(offset, ios::beg);
// TODO: Copy out the rest of the array or arrays to file (hint, use fout.put())
fout.close();

} }

This was given out as just extra practice after the end of summer 2025 semester.


== We're Using GitHub Under Protest ==

This project is currently hosted on GitHub. This is not ideal; GitHub is a proprietary, trade-secret system that is not Free and Open Souce Software (FOSS). We are deeply concerned about using a proprietary system like GitHub to develop our FOSS project. I have a website where the project contributors are actively discussing how we can move away from GitHub in the long term. We urge you to read about the Give up GitHub campaign from the Software Freedom Conservancy to understand some of the reasons why GitHub is not a good place to host FOSS projects.

If you are a contributor who personally has already quit using GitHub, please email me at [email protected] for how to send us contributions without using GitHub directly.

Any use of this project's code by GitHub Copilot, past or present, is done without our permission. We do not consent to GitHub's use of this project's code in Copilot.

Logo of the GiveUpGitHub campaign

back to top

About

Weber State University. Computer Science 6850. Parallel Programming & Architecture. Summer 2025. Dr. Brad Peterson. Implementing Gaussian smoothing using MPI & OpenMP parallelism on large images. The solution splits the image into strips, applies smoothing with varying sigma values, and compares performance with serial execution

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published