Figure 3
Large arrays are partitioned along the axis of rotation using a two-tier partitioning scheme. Larger, tier-1, partitions are per GPU. Smaller, tier-2, partitions (orange and green rectangles) are asynchronously streamed into GPU to overlap data transfers and computations.

