Convolution is the most computationally intensive task of the Convolutional Neural\nNetwork (CNN). It requires a lot of memory storage and computational power. There are different\napproaches to compute the solution of convolution and reduce its computational complexity. In this\npaper, a matrix multiplication-based convolution (ConvMM) approach is fully parallelized using\nconcurrent resources of GPU (Graphics Processing Unit) and optimized, considerably improving the\nperformance of the image classifiers and making them applicable to real-time embedded applications.\nThe flow of this CUDA (Compute Unified Device Architecture)-based scheme is optimized using\nunified memory and hardware-dependent acceleration of matrix multiplication. Proposed flow is\nevaluated on two different embedded platforms: first on an Nvidia Jetson TX1 embedded board\nand then on a Tegra K1 GPU of an Nvidia Shield K1 Tablet. The performance of this optimized\nand accelerated convolutional layer is compared with its sequential and heterogeneous versions.\nResults show that the proposed scheme significantly improves the overall results including energy\nefficiency, storage requirement and inference performance. In particular, the proposed scheme on\nembedded GPUs is hundreds of times faster than the sequential version and delivers tens of times\nhigher performance than the heterogeneous approach.
Loading....