RidgeRun CUDA Optimisation Guide/Empirical Experiments/Multi-threaded bounding test

From RidgeRun Developer Connection
< RidgeRun CUDA Optimisation Guide
Revision as of 10:29, 27 February 2023 by Aartavia (talk | contribs) (Introduction)
Jump to: navigation, search


Nvidia-preferred-partner-badge-rgb-for-screen.png

RR Contact Us.png

Introduction

This page is the follow-up of Cuda Memory Benchmark, it adds multithreading to the testing of the different memory management modes. It's reduced to test traditional, managed, page-locked memory with and without copy call and CUDA mapped.

Testing Setup

Memory Management Methods

The program tested had the option to use each of the following memory management configurations:

  • Traditional mode, using malloc to reserve the memory on host, then cudaMalloc to reserve it on the device, and then having to move the data between them with cudaMemcpy. Internally, the driver will allocate a non-pageable memory chunk, to copy the data there and after the copy, finally use the data on the device.
  • Managed, using cudaMallocManaged and not having to manually copy the data and handle two different pointers.
  • Non paging memory, using cudaMallocHost a chunk of page-locked memory can be reserved that can be used directly by the device since its non-pageable
  • Non paging memory with discrete copy, using cudaMallocHost and a discrete call to cudaMemcpy, so its similar to the traditional model with different pointers one for host and another for device, but according to the NVIDIA docs on the mallocHost, the calls to cudaMemcpy are accelerated when using thid type of memory.
  • Zero-Copy Memory, using cudaHostAlloc to reserve memory that is page-locked and directly accessible to the device. There are different flags that can change the properties of the memory, in this case, the flags used were cudaHostAllocMapped and cudaHostAllocWriteCombined.

Platforms

Program Structure

The program is divided into three main sections, one where the input memory is filled with data, the kernel worker threads, and the verify. The verify reads all the results and uses assert to verify them. Before every test, 10 iterations of the full process were done to warm up and avoid any initialization time penalty. After that, the average of 100 runs was obtained. Each of the sections can be seen in Figure 1.

Error creating thumbnail: Unable to save thumbnail to destination
Figure 1. Measurement points on the code

Each kernel block can be seen in Figure 2.

Error creating thumbnail: Unable to save thumbnail to destination
Figure 2. Composition of a kernel block