Figure 2
NVIDIA visual profiler snapshot of launch times. Implementing standard launches from the host upon completion (left) causes driver overhead in the range of tens of microseconds per operation. Using CUDA stream memory operation (right) decreases latency to a few microseconds. |