Getting started with AI on NXP i.MX8M Plus - Neural Processing Unit - Use Case experiments: Smart Parking - Results

From RidgeRun Developer Connection
Jump to: navigation, search




NXP Partner Program Registered Vertical.jpg NXP Partner Program Horizontal.jpg
Previous: Neural Processing Unit/Use Case experiments: Smart Parking/Parallel Experiments Index Next: GStreamer



Serial results

TinyYOLO version 3 model execution time

For this fig.1 the time versus iterations graph shows that the best performance is achieved by XNNPACK using multithreading. When using NNAPI it is taking more time compared to the CPU, due to some operations for the TinyYOLO v3 model are not supported and have to be executed at the CPU, so this communication between heterogeneous modules is resulting in performance decreasing.

Figure 1 Execution time for TinyYOLO

Rosetta model execution time

For the Rosetta model, the performance shown in the fig.2 on multithreading and single CPU is similar, but when using the NNAPI, the performance is not expected, because the model has some operations that are not compatible with the NPU module. This result shows that choosing an optimized model will make a difference in the performance of the NPU.

Figure 2 Execution time for Rosetta


Parallel results

Execution time TinyYOLO over CPU and Rosetta over CPU

In fig.3 is visible that the combination of using Rosetta sharing the CPU with TinyYOLO decreases the performance for Rosetta at the first 50 iterations, but then the time gets constant below the average time. In the case of TinyYOLO, the execution time is constant, excluding the very first iterations, where the CPU cache is still warming up.

Figure 3 Execution time for TinyYolo over CPU and Rosetta over CPU

CPU Usage for TinyYOLO over CPU and Rosetta over CPU

The CPU usage for both models sharing resources increases until using the 4 cores of the board. After the first iterations, Rosetta uses only one core of the board, while TinyYOLO is using 2 cores.

Figure 4 CPU usage for TinyYolo over CPU and Rosetta over CPU

Execution time TinyYOLO over NNAPI and Rosetta over CPU

In the other hand, when executing both models into different modules the execution time is more stable than executing them sharing the CPU. This can be seen in fig.5. This graph shows that the time performance remains near the average.

Figure 5 CPU usage for TinyYolo over NNAPI and Rosetta over CPU

CPU Usage for TinyYOLO over NNAPI and Rosetta over CPU

Finally, the advantage of using different modules for different model executions is the CPU usage. In fig.6 the CPU usage for Rosetta when it is executed in the CPU is only 1 core per iteration, but when TinyYOLO is executed in the NPU, the CPU usage decreases to less than 20%. There are peaks of CPU usage at the first iterations, due to cache warming up.

Figure 6 CPU usage for TinyYolo over NNAPI and Rosetta over CPU


Previous: Neural Processing Unit/Use Case experiments: Smart Parking/Parallel Experiments Index Next: GStreamer