In recent years, Convolutional Neural Networks (CNNs) have enabled unprecedented\nprogress on a wide range of computer vision tasks. However,\ntraining large CNNs is a resource-intensive task that requires specialized\nGraphical Processing Units (GPU) and highly optimized implementations to\nget optimal performance from the hardware. GPU memory is a major bottleneck\nof the CNN training procedure, limiting the size of both inputs and\nmodel architectures. In this paper, we propose to alleviate this memory bottleneck\nby leveraging an under-utilized resource of modern systems: the device\nto host bandwidth. Our method, termed CPU offloading, works by\ntransferring hidden activations to the CPU upon computation, in order to\nfree GPU memory for upstream layer computations during the forward pass.\nThese activations are then transferred back to the GPU as needed by the gradient\ncomputations of the backward pass. The key challenge to our method is\nto efficiently overlap data transfers and computations in order to minimize\nwall time overheads induced by the additional data transfers. On a typical\nwork station with a Nvidia Titan X GPU, we show that our method compares\nfavorably to gradient checkpointing as we are able to reduce the memory\nconsumption of training a VGG19 model by 35% with a minimal additional\nwall time overhead of 21%. Further experiments detail the impact of the different\noptimization tricks we propose. Our method is orthogonal to other\ntechniques for memory reduction such as quantization and sparsification so\nthat they can easily be combined for further optimizations.
Loading....