Difference between revisions of "RidgeRun CUDA Optimisation Guide/Empirical Experiments/Multi-threaded bounding test"

Revision as of 10:29, 27 February 2023

Introduction

This page is the follow-up of Cuda Memory Benchmark, it adds multithreading to the testing of the different memory management modes. It's reduced to test traditional, managed, page-locked memory with and without copy call and CUDA mapped.

Testing Setup

Memory Management Methods

The program tested had the option to use each of the following memory management configurations:

Traditional mode, using malloc to reserve the memory on host, then cudaMalloc to reserve it on the device, and then having to move the data between them with cudaMemcpy. Internally, the driver will allocate a non-pageable memory chunk, to copy the data there and after the copy, finally use the data on the device.
Managed, using cudaMallocManaged and not having to manually copy the data and handle two different pointers.
Non paging memory, using cudaMallocHost a chunk of page-locked memory can be reserved that can be used directly by the device since its non-pageable
Non paging memory with discrete copy, using cudaMallocHost and a discrete call to cudaMemcpy, so its similar to the traditional model with different pointers one for host and another for device, but according to the NVIDIA docs on the mallocHost, the calls to cudaMemcpy are accelerated when using thid type of memory.
Zero-Copy Memory, using cudaHostAlloc to reserve memory that is page-locked and directly accessible to the device. There are different flags that can change the properties of the memory, in this case, the flags used were cudaHostAllocMapped and cudaHostAllocWriteCombined.

Platforms

Discrete GPU: desktop pc with RTX 2070s, using CUDA 12 and Ubuntu 20.04.
Jetson AGX Orin: using CUDA 11.4 and JP 5.0.2.
Jetson Nano 4GB devkit: using CUDA 10.2 and JP 4.6.3

Program Structure

The program is divided into three main sections, one where the input memory is filled with data, the kernel worker threads, and the verify. The verify reads all the results and uses assert to verify them. Before every test, 10 iterations of the full process were done to warm up and avoid any initialization time penalty. After that, the average of 100 runs was obtained. Each of the sections can be seen in Figure 1.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 1. Measurement points on the code

Each kernel block can be seen in Figure 2.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 2. Composition of a kernel block

@@ Line 17: / Line 17: @@
 This page is the follow-up of [https://developer.ridgerun.com/wiki/index.php?title=CUDA_Memory_Management_Benchmark Cuda Memory Benchmark], it adds multithreading to the testing of the different memory management modes. It's reduced to test traditional, managed, page-locked memory with and without copy call and CUDA mapped.
+== Testing Setup ==
+=== Memory Management Methods ===
+The program tested had the option to use each of the following memory management configurations:
+*Traditional mode, using malloc to reserve the memory on host, then cudaMalloc to reserve it on the device, and then having to move the data between them with cudaMemcpy. Internally, the driver will allocate a non-pageable memory chunk, to copy the data there and after the copy, finally use the data on the device.
+*Managed, using cudaMallocManaged and not having to manually copy the data and handle two different pointers.
+*Non paging memory, using [https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gab84100ae1fa1b12eaca660207ef585b cudaMallocHost] a chunk of page-locked memory can be reserved that can be used directly by the device since its non-pageable
+*Non paging memory with discrete copy, using [https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gab84100ae1fa1b12eaca660207ef585b cudaMallocHost] and a discrete call to cudaMemcpy, so its similar to the traditional model with different pointers one for host and another for device, but according to the NVIDIA docs on the mallocHost, the calls to cudaMemcpy are accelerated when using thid type of memory.
+*Zero-Copy Memory, using [https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902 cudaHostAlloc] to reserve memory that is page-locked and directly accessible to the device. There are different flags that can change the properties of the memory, in this case, the flags used were cudaHostAllocMapped and cudaHostAllocWriteCombined.
+=== Platforms ===
+*Discrete GPU: desktop pc with RTX 2070s, using CUDA 12 and Ubuntu 20.04.
+*[https://developer.ridgerun.com/wiki/index.php/NVIDIA_Jetson_Orin Jetson AGX Orin]: using CUDA 11.4 and JP 5.0.2.
+*[https://developer.ridgerun.com/wiki/index.php/Jetson_Nano Jetson Nano 4GB devkit]: using CUDA 10.2 and JP 4.6.3
+===Program Structure===
+The program is divided into three main sections, one where the input memory is filled with data, the kernel worker threads, and the verify. The verify reads all the results and uses assert to verify them. Before every test, 10 iterations of the full process were done to warm up and avoid any initialization time penalty. After that, the average of 100 runs was obtained. Each of the sections can be seen in Figure 1.
+<br>
+[[File:Time points t mem bench.png|thumb|720px|center|Figure 1. Measurement points on the code ]]
+Each kernel block can be seen in Figure 2.
+<br>
+[[File:Kernel block comp.png|thumb|720px|center|Figure 2. Composition of a kernel block ]]

Difference between revisions of "RidgeRun CUDA Optimisation Guide/Empirical Experiments/Multi-threaded bounding test"

Revision as of 10:29, 27 February 2023

Contents

Introduction

Testing Setup

Memory Management Methods

Platforms

Program Structure

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Useful Links

Legal

Services

Tools