Figure 3
Example of modern GPU hardware architecture (modified from Lefohn et al., 2008). The von Neumann bottleneck formed by a single memory interface is eliminated. Each green square represents a scalar processor grouped within an array of streaming multiprocessors. Memory is arranged in three logical levels. Global memory (the lowest level in the figure) can be accessed by all streaming multiprocessors through individual memory interfaces. Different types of memory exist representing the CUDA programming model: thread local, intra-thread block-shared and globally shared memory. This logical hierarchy is mapped to hardware design. Thread local memory is implemented in registers residing within the multiprocessors, which are mapped to individual SPs (not shown). Additionally, dynamic random-access memory (DRAM) can be allocated as private local memory per thread. Intra-thread block-shared memory is implemented as a fast parallel data cache that is integrated in the multiprocessors. Global memory is implemented as DRAM separated into read-only and read/write regions. Two levels of caching accelerate access to global memory. L1 is a read-only cache that is shared by all SPs and speeds up reads from the constant memory space L2, which is a read-only region of global device memory. The caching mechanism is implemented per multiprocessor to eliminate the von Neumann bottleneck. A hardware mechanism, the SIMT controller for the creation of threads and the context switching between threads (work distribution), makes the single instruction multiple threads (SIMT) approach feasible. Currently, up to 12 000 threads can be executed with virtually no overhead. |