Programming Massively Parallel Processors部分章节重点摘录

Chap2 Heterogeneous data parallel computing

1、CUDA 代码的结构：The structure of a CUDA C program reflects the coexistence of a host (CPU) and one or more devices (GPUs) in the computer. Each CUDA C source file can have a mixture of host code and device code. By default, any traditional C program is a CUDA program that contains only host code. One can add device code into any source file.

2、CUDA runtime 提供的内存分配和回收接口

cudaMalloc：内存申请函数，用于在设备上开辟一段空间（device global memory）。该函数与cudaMallocManaged 的区别在于后者分配的空间会使用 unified memory 进行自动调度。需要注意该函数的输入的指针类型是二级指针（void**），这样该接口就不会受限于特定数据类型类型的指针
cudaFree：使用 cudaMalloc 开辟完空间后需要使用函数对空间进行释放

float* A_d
int size=n * sizeof(float);
cudaMalloc((void**)&A_d, size); // A_d为指向device global memory的地址 
...
cudaFree(A_d); // 释放分配给A_d的device global memory并放回至available pool

cudaMemcpy：用于在主机和设备之间同步数据。第一个入参是目的地址，第二个入参为源地址

3、CUDA Kernel 的运行结构

SPMD 分布式设计模式：single-program multiple-date，指的是多个计算节点执行相同的程序，但是每个节点处理的数据不同。SPMD 模型通常用于并行计算，可以将大规模的数据集分成多个小块，由不同的计算节点进行并行处理
blockDim：属于 built-in variable，用来表示一个 block 的维度（可以是一维、二维或者三维，分别对应 blockDim.x，blockDim.y 和 blockDim.z）。考虑到 hardware efficiency，最好是 32 的倍数。CUDA 3.0 以上版本中，一个 block 中最多可以有 1024 个线程
举例：在一个 block 中进行 vector add 计算。需要注意两点

调用下面这个 vecAddKernel 后在设备侧会 launch 一个新的 grid of threads（因为使用了__global__，说明该函数由 CPU 调用，并行计算任务会被发射到 GPU 的任务调用单元；而__device__则不会引起任何新的 device threads 的 launching）
下面代码中没有 loop 操作，因为一个 grid 中每个线程的运算相当于原 loop 中的一次迭代，这种模式被称为 loop parallelism，即 original sequential code are executed by threads in parallel

__global__ void vecAddKernel(float* A, float* B, float* C, int n) {
	int i = threadIdx.x + blockDim.x * blockIdx.x;
	if (i < n) {
		C[i] = A[i] + B[i];
	}
}

4、CUDA Kernel 的调用（略）

5、编译：可以使用 nvcc，该编译器会把 host code 和 device code 分开，host code 通过标准 C/C++ compiler 编译，device code 通过 nvcc 编译转变为 PTX 文件（一种二进制文件）。这类 PTX 文件会被 nvcc runtime component 进一步被编译为 real object files 最终在 GPU 设备上运行

Chap3 Multidimensional grids and data

1、grid 和 block 的多维表示

整型向量类型 dim3：该三维向量中三维参数分别为 x，y，z，不使用的维度可以设置为 1。维度参数可以常量，也可以是变量，使用变量的好处是 grid 可以在不同情况下都拥有足够多的线程来 cover 向量的值

dim dimGrid1(32, 1, 1)
// 维度也可以是变量
dim dimGrid2(ceil(n/256.0), 1, 1)
dim dimBlock(128, 1, 1)
vecAddKernelz<<<dimGrid1, dimBlock>>>(...)

单个 block 中的线程数上限一般是 1024，因此 block 三个维度上的总线程数加起来不得超过 1024
grid 中的 block 在表示时，其 label 一般使用 (blockIdx.y, blockIdx.x) 这种格式，block 中线程的表示也是类似。即 highest dimension comes first

2、多维张量降维处理的原因（以二维向量为例）

编译要求：ANSI C 标准要求编译时二维向量的列的维度必须确定，但对于 dynamically allocated arrays 这种情况列的维度是无法提前确定的。因此开发者需要手动将二维向量线性化（linearize），即将 dynamically allocated 2D array 转换为 equivalent 1D array 后编译器才能处理
内存结构和访问效率：目前的内存地址也是一维线性的，cache 的设计也是根据一维地址设计的，访存指令也是根据一维地址设计的，因此降到一维后的数据内存连续，可以提高访存效率，缓存命中率高（多维向量需要多次 malloc 和释放，容易形成内存碎片）
便于使用向量化指令：CPU 使用的 AVX 指令集以及 Tensor Core 使用的 MMA（matrix multiply-accumulate）对内存布局都有严格要求，不连续的内存很难使用这些优化指令。并且 CUDA 编程的核心在于优化内存访问，内存不连续时访问效率很低

3、二维向量线性化的两种方式

Row-major layout：每一行中的所有元素放置在连续内存上，一行接着一行地放。C 编译器一般都使用这种线性化方式
Column-major layout：每一列中的所有元素放置在连续内存上。FORTRAN 编译器一般使用这种线性化方式

Chap4 Compute architecture and scheduling

GPU 架构

1、Streaming multiprocessors 与 Streaming processor（CUDA core）

2、较早的 NVIDIA GPU 架构使用 DDR SDRAM，Pascal architecture 后使用 HBM（high-bandwidth memory）或者 HMB2

Block scheduling

1、多个 block 有可能会被分配到同一个 SM 上，但同时被分配到 SM 上的 blocks 数目有一定限制（比如 A100 GPU 支持每个 SM 上最多分配 32 个 blocks，64 个 warps，2048 个线程）。一般来说 grid 中的 block 数量要比 GPU 上 SM 的数量要多，所以当分配到 SM 上的 block 完成运算后新的 block 会被立刻分配到 SM 上

2、同个 block 中的所有线程会被分配至同一个 SM 上，从而保证了 block 中的线程间交互效率。该过程涉及到线程屏障同步（barrier synchronization），以及对共享内存（shared memory）的访问

线程同步函数：__syncthreads()

1、作用：该函数用于块内线程的通信，即阻塞 block 直至 block 内的线程全都执行到运行该函数的这一行，但不对块间进行同步

2、注意事项：一个 block 中的所有线程必须执行同一个__syncthreads() 函数。因为每次该函数被调用都表示不同的 barrier synchronization points，如果在 if-else 语句中不同条件都调用了该函数则会出现死锁（deadlock）等问题

3、barrier synchronization 对一个 block 内线程执行的要求：

应该在相近的时间内执行完毕以免产生过长的线程间彼此等待的时间
每个线程必须能够获取足够的资源并最终都能到达 barrier，不然会造成死锁。这一点 CUDA runtime 会对其进行

4、Transparent Scalability：由于不允许不同 block 间的线程同步，所以一个 grid 中的所有 blocks 可以以任意顺序被执行。这种灵活性有利于 scalable implementation，对于计算资源较少的 low-cost system，同一时刻可以选择只执行小部分 blocks，对于计算资源较多的 higher-end implementation 则可以选择执行较多的 blocks（如今 GPU 可以同时执行上百个 blocks）。这种模式下，同样的应用程序代码可在不同硬件设备上（可用资源各不相同）执行，而不需要开发者根据环境不同而修改代码，该能力被称为 Transparent Scalability

线程束与 SIMD 模式

1、线程束的定义与划分：每个 block 中的线程会被分成一个个的线程束（NVIDIA GPU 架构中称为 warp，在 AMD 中则被称为 wavefront），一般一个 warp 包含有 32 个线程。如果 block 只有一维则 warp 的划分比较简单，二维 block 则会复杂一些：首先所有 threadIdx.y 为 0 的线程进行线性化，并根据 threadIdx.x 的值进行排列；接着再对 threadIdx.y 为 1 的线程进行排列，以此类推。三维 block 也类似，会先从 z 轴开始排列

2、SIMD 模式：SM 会根据 SIMD（single-instruction multiple-data，单指令多数据）模式来执行一个 warp 里的所有线程。SIMD 在编码层把多指令砍到单指令，在硬件层面上砍掉了单指令以外的所有指令硬件，包括取值，译码，发射等硬件和流程

SIMD 模式下指令的读取和分发：SM 上每 N 个处理器核（cores，也就是 SP）会形成一个 processing block，该 block 会共享 1 个 instruction fetch/dispatch unit（例如 NVIDIA Ampere A100 架构中有 64 个 cores 被分成了 4 个 processing blocks，每个 block 有 16 个 cores）。同个 warp 内的所有线程会被分配到同个 processing block 内，这些线程会获取到同一条单指令并同时执行该指令，这也是为何 warp 的 execution behavior 会被称为 SIMD
SIMD 模式的优点：降低了硬件控制成本（hardware manufacturing cost and power consumption）。因为 instruction fetch/dispatch unit 被多个执行单元所共享，所以硬件的任务主要就集中在提升运算吞吐量上而非指令控制上
SIMD 模式的缺点：类似于底层原语，里面的操作是不允许存在分支的，所有的操作都是作为一条指令存在。因此一个 warp 内的线程不能有不同的 control flow（例如 if-else）

SIMD模式下指令和数据处理（PU=Process Unit）

3、SIMT 模式（补充）：single-instruction multiple-thread，在 SIMT 中所有线程执行同一条指令但每个线程访问的是各自的数据。SIMT 模式的优势在于：

减少了指令预取带来的等待时间
内存不需要是连续的（SIMD 因为只是对指令的简单封装，仍然要求内存连续）
允许进行逻辑判断（SIMD 不允许有分支）

Control divergence

1、定义：一般情况下，一个 warp 内的不同线程在同一个时刻都要执行相同的指令，但当这些线程 follow different execution paths 的时候就会出现 control divergence 现象，这种现象会导致不同线程拥有不同行为以及不同的资源占用

2、Control divergence 出现的两种场景

1）场景 1：if-else，例如下图中 0~23 号线程走 if 分支，24~31 号线程走 else 分支

2）场景 2：for-loop，例如下图中每个线程都要执行 for 循环，但是不同线程在执行 for 循环过程中的迭代次数各不相同

3、Control divergence 出现的原因与应对措施

表面原因：handling boundary conditions when mapping threads to data。一般情况下线程数是 block 数量的倍数，但是需要运算的数据量则不一定是 block 数量的整数倍，所以个别 block 中的某些 warp（往往是 last warp）中就可能会出现线程去做其他运算的情况，从而发生 control divergence。不过随着处理数据量的增加，control divergence 带来的 performance impact 随之减少，因为发生 control divergence 的 warp 占所有 warps 的比重会越来越少
底层原因：Volta 架构后的 GPU 上，不同分支可以进行并行执行，这种运行模式被称为独立线程调度机制（independent thread scheduling），原因在于 Volta 架构后的 GPU 中，每一个线程可以有自己独立的程序计数器 PC 以及堆栈，这使得每个线程可以独立进行线程的调度（而之前的 GPU 中一个 warp 内的不同线程使用共同的程序计数器）
应对措施：开发者不应该想当然认为一个 warp 内所有线程的执行时间均相同，所以如果想要对这些线程进行同步的话需要调用 warp-level primitives 提供的 __syncwarp 函数来确保 warp 内线程的同步。CUDA9 之后引入了三类 warp-level primitives，分别是

Synchronized data exchange：用于在 warp 中的线程之间交换数据
Active mask query：返回一个 32 位掩码，指示 warp 中的哪些线程在当前执行线程中处于活动状态
Thread synchronization：同步 warp 中的线程，并提供内存隔离（memory fence）

Latency tolerance

1、产生原因：一般来说，每个 SM 只有足够的硬件来执行在任何时间点分配给 SM 的所有线程中的一部分。在早期的 GPU 设计中，每个 SM 在任何时刻只能为单个 warp 执行一个指令。在最近的设计中，每个 SM 可以在任何时间点执行少量 warp 的指令。如果一个 SM 在任何时候都只能执行一小部分线程，那为何要在 SM 中有分配超过其处理能力上限的 warps？这就涉及到 CUDA 对 long-latency operation 的 toleration

2、Latency tolerance 定义：当某个 warp 需要执行的指令被执行前，该 warp 正在等待之前启动的 long-latency operation 时，处理器就不会选择执行该 warp，相反则是执行另一个无需等待结果的 resident warp，从而用其他线程的执行来弥补当前 warp 的 latency time。这种操作被称为 latency tolerance 或者 latency hiding

3、Latency tolerance 优势：zero-overhead thread scheduling（GPU 的 SM 会在 registers 上保持所有 assigned warps 的状态，从而在执行 ready for execution 的线程时就可以避免 processing units 上出现 idle cycles 和等待时间的浪费。传统 CPU 在线程切换过程中会涉及到当前线程状态的的保存和将要执行线程状态的加载），这也是为何 GPU 无需太多芯片来 cache memories 而是用芯片来进行浮点计算；同时也无需用到 CPU 中的 branch prediction mechanisms（CPU 中的一种机制，可以保证大部分时间下流水线都处于满负荷状态，以及正在执行的线程以最高效率运行）

4、注意事项：SM 最好能被分配到足够多的线程来尽可能提高 warps 的执行效率（即 oversubscription of threads）。例如 A100 GPU 中一个 SM 上虽然只有 64 个 cores，但同一时刻可以分配 2048 个线程

5、其他使用场景：pipelined floating-point arithmetic；branch instructions 等

资源分配

1、Occupancy：指分配到 SM 上的 warp 数量与 SM 能够支持的最大 warp 数量的比值

2、动态资源分配：以 A100 GPU 为例，其支持每个 SM 上最多分配 32 个 blocks，64 个 warps，2048 个线程（32 * 64=2048），每个 block 中最多能包含 1024 个线程。如果一个 grid 中 block size 为 1024，那么每个 SM 上 2048 个 thread slots 只能容纳 2 个 blocks。如果 block size 为 512，256，128，64，那么这些 thread slots 就分别能容纳 4，8，16，32 个 blocks。这么做的好处是“can either execute many blocks each having few threads or execute few blocks each having many threads”

3、资源分配效率的影响因素

block slots 和 thread slots 的错配：如果一个 block 包含 32 个线程，那么 2048 个 thread slots 就需要 64 个 blocks，但是像 Volta 架构的 GPU 最大只支持 32 个 blocks，也就是说单次只能加载 1024 个线程，此时 occupancy 计算出来就只有 50%
每个 block 中的线程数无法被 block 所能容纳的最大线程数整除：比如 block size 为 768，那么每个 SM 上只能容纳 2 个 block，此时该 SM 上还有 512 thread slots 未被初始化
register 和 shared memory：每个 kernel 中的线程对 register 的需求并不一致，例如A100 GPU 允许每个 SM 最多使用 65536 个 registers，所以线程对 register 的需求会影响 occupancy（参见 Chap 5）

4、Performance Cliff（补充）：a slight increase in resource usage can result in significant reduction in parallelism and performance achieve，参见论文 Program Optimization space pruning for a multithreaded GPU

资源申请

1、Compute capability：即 GPU 的计算能力，例如 A100 的算力为 8，算力越高意味着 SM 上有更多的计算资源（The compute capabilities refer to specified sets of hardware features present on the different generations of NVIDIA GPU）

2、算力的获取

cudaGetDeviceCount：该 API 接口可以获取当前设备上支持 CUDA 的设备数
cudaGetDeviceProperties：该 API 接口可以获取特定 CUDA 设备的属性
cudaDeviceProp：CUDA 内置类，包含了许多设备信息。具体源码如下

/**
 * CUDA device properties
 */
struct __device_builtin__ cudaDeviceProp
{
    char         name[256];                  /**< 设备名称，比如1080Ti ASCII string identifying device */
    cudaUUID_t   uuid;                       /**< 16-byte unique identifier  */
    char         luid[8];                    /**< 8-byte locally unique identifier. Value is undefined on TCC and non-Windows platforms */
    unsigned int luidDeviceNodeMask;         /**< LUID device node mask. Value is undefined on TCC and non-Windows platforms */
    size_t       totalGlobalMem;             /**< 设备上全局内存的总量，单位是字节 Global memory available on device in bytes  */
    size_t       sharedMemPerBlock;          /**< 在一个线程块Block中可使用的最大共享内存数量* Shared memory available per block in bytes /
    int          regsPerBlock;               /**< 每个线程块中可用的32位寄存器数量 32-bit registers available per block */
    int          warpSize;                   /**< 在一个线程束Warp中包含的线程数量 Warp size in threads */
    size_t       memPitch;                   /**< 在内存复制中最大的修正量Pitch，单位为字节 Maximum pitch in bytes allowed by memory copies */
    int          maxThreadsPerBlock;         /**< 在一个线程块中可以包含的最大线程数量 Maximum number of threads per block */
    int          maxThreadsDim[3];           /**< 在多维线程数组中，每一维可以包含的最大线程数量 Maximum size of each dimension of a block */
    int          maxGridSize[3];             /**< 在一个线程格Grid，每一维可以包含的最大线程数量 Maximum size of each dimension of a grid */
    int          clockRate;                  /**< Clock frequency in kilohertz */
    size_t       totalConstMem;              /**< 常亮内存总量 Constant memory available on device in bytes */
    int          major;                      /**< 设备计算功能集的主板号 Major compute capability */
    int          minor;                      /**< 设备计算功能集的此版本号 Minor compute capability */
    size_t       textureAlignment;           /**< 设备的纹理对齐需求Alignment requirement for textures */
    size_t       texturePitchAlignment;      /**< Pitch alignment requirement for texture references bound to pitched memory */
    int          deviceOverlap;              /**< bool类型，表示设备是否可以同时执行一个cudamemery调用和一个核函数调用 Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */
    int          multiProcessorCount;        /**< 设备上多处理器的数量 Number of multiprocessors on device */
    int          kernelExecTimeoutEnabled;   /**< bool类型，表示该设备上的核函数是否存在运行时间限制 Specified whether there is a run time limit on kernels */
    int          integrated;                 /**< bool， 设备是否是一个集成GPU Device is integrated as opposed to discrete */
    int          canMapHostMemory;           /**< bool，表示设备是否将主机内存映射到CUDA设备地址空间 Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */
    int          computeMode;                /**< 设备的计算模式，默认（Default），独占（Exclusize），禁止（Prohibited）Compute mode (See ::cudaComputeMode) */
    int          maxTexture1D;               /**< 一维纹理的最大大小 Maximum 1D texture size */
    int          maxTexture1DMipmap;         /**< Maximum 1D mipmapped texture size */
    int          maxTexture1DLinear;         /**< Maximum size for 1D textures bound to linear memory */
    int          maxTexture2D[2];            /**< 二维纹理的最大维数Maximum 2D texture dimensions */
    int          maxTexture2DMipmap[2];      /**< Maximum 2D mipmapped texture dimensions */
    int          maxTexture2DLinear[3];      /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */
    int          maxTexture2DGather[2];      /**< Maximum 2D texture dimensions if texture gather operations have to be performed */
    int          maxTexture3D[3];            /**< 三维纹理的最大维数 Maximum 3D texture dimensions */
    int          maxTexture3DAlt[3];         /**< Maximum alternate 3D texture dimensions */
    int          maxTextureCubemap;          /**< Maximum Cubemap texture dimensions */
    int          maxTexture1DLayered[2];     /**< Maximum 1D layered texture dimensions */
    int          maxTexture2DLayered[3];     /**< Maximum 2D layered texture dimensions */
    int          maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */
    int          maxSurface1D;               /**< Maximum 1D surface size */
    int          maxSurface2D[2];            /**< Maximum 2D surface dimensions */
    int          maxSurface3D[3];            /**< Maximum 3D surface dimensions */
    int          maxSurface1DLayered[2];     /**< Maximum 1D layered surface dimensions */
    int          maxSurface2DLayered[3];     /**< Maximum 2D layered surface dimensions */
    int          maxSurfaceCubemap;          /**< Maximum Cubemap surface dimensions */
    int          maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */
    size_t       surfaceAlignment;           /**< Alignment requirements for surfaces */
    int          concurrentKernels;          /**< bool，表示设备是否支持在同一个上下文中同时执行多个核函数 Device can possibly execute multiple kernels concurrently */
    int          ECCEnabled;                 /**< Device has ECC support enabled */
    int          pciBusID;                   /**< PCI bus ID of the device */
    int          pciDeviceID;                /**< PCI device ID of the device */
    int          pciDomainID;                /**< PCI domain ID of the device */
    int          tccDriver;                  /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */
    int          asyncEngineCount;           /**< Number of asynchronous engines */
    int          unifiedAddressing;          /**< Device shares a unified address space with the host */
    int          memoryClockRate;            /**< Peak memory clock frequency in kilohertz */
    int          memoryBusWidth;             /**< Global memory bus width in bits */
    int          l2CacheSize;                /**< Size of L2 cache in bytes */
    int          maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */
    int          streamPrioritiesSupported;  /**< Device supports stream priorities */
    int          globalL1CacheSupported;     /**< Device supports caching globals in L1 */
    int          localL1CacheSupported;      /**< Device supports caching locals in L1 */
    size_t       sharedMemPerMultiprocessor; /**< Shared memory available per multiprocessor in bytes */
    int          regsPerMultiprocessor;      /**< 32-bit registers available per multiprocessor */
    int          managedMemory;              /**< Device supports allocating managed memory on this system */
    int          isMultiGpuBoard;            /**< Device is on a multi-GPU board */
    int          multiGpuBoardGroupID;       /**< Unique identifier for a group of devices on the same multi-GPU board */
    int          hostNativeAtomicSupported;  /**< Link between the device and the host supports native atomic operations */
    int          singleToDoublePrecisionPerfRatio; /**< Ratio of single precision performance (in floating-point operations per second) to double precision performance */
    int          pageableMemoryAccess;       /**< Device supports coherently accessing pageable memory without calling cudaHostRegister on it */
    int          concurrentManagedAccess;    /**< Device can coherently access managed memory concurrently with the CPU */
    int          computePreemptionSupported; /**< Device supports Compute Preemption */
    int          canUseHostPointerForRegisteredMem; /**< Device can access host registered memory at the same virtual address as the CPU */
    int          cooperativeLaunch;          /**< Device supports launching cooperative kernels via ::cudaLaunchCooperativeKernel */
    int          cooperativeMultiDeviceLaunch; /**< Device can participate in cooperative kernels launched via ::cudaLaunchCooperativeKernelMultiDevice */
    size_t       sharedMemPerBlockOptin;     /**< Per device maximum shared memory per block usable by special opt in */
    int          pageableMemoryAccessUsesHostPageTables; /**< Device accesses pageable memory via the host's page tables */
    int          directManagedMemAccessFromHost; /**< Host can directly access managed memory on the device without migration. */
};

Chap5 Memory architecture and data locality

访存效率

1、FLOPS 与访存效率：Floating-point Operations Per Second，即每秒所能够进行的浮点运算数目，也被称为运算强度（arithmetic intensity）。以 A100 GPU 为例，其 global memory bandwidth 峰值为 1555 GB/s，而单精度浮点数矩阵乘算子的 CGMA（compute to global memory access ratio，用来评估单次全局内存访问后进行的浮点数计算的次数）为 0.25 FLOP/B，所以算子运算时的吞吐量为 389 GFLOPS（1555 * 0.25），它只占 A100 GPU 单精度计算吞吐量峰值的 2%（峰值为 19500 GFLOPS，如果使用 tensor cores 矩阵乘加速算子则峰值可达 156000 GFLOPS）。这种情况下矩阵乘算子的执行速度会被 memory bandwidth 严重影响。这也就是常说的 memory bound 问题，如果想要完全利用 A100 GPU 的算力，则 CGMA 的比值至少要是 12.5 FLOP/B，也就是说每获取到 4 字节的浮点数就要执行 50 次的浮点数运算

2、Roof-line Model：模型在一个计算平台的限制下，到底能达到多快的浮点计算速度。更具体的来说，Roof-line Model 解决的是“计算量为 A 且访存量为 B 的模型在算力为 C 且带宽为 D 的计算平台所能达到的理论性能上限 E 是多少”这个问题。带宽决定了斜率（红线），算力决定了性能上限（绿线），在到达性能上限前计算性能处于 memory bound 区域（浅红色区域），达到性能上限后计算速性能则处于 compute bound 区域（浅绿色区域）

CUDA 内存类型

1、几种不同类型的 memory

Global memory & constant memory：off-chip，线程共享，两者均可从 host 侧进行读写操作，前者还可以从 device 侧进行读写操作，但是后者只支持 device 侧的读操作。设立 constant memory 的主要作用是为了解决一个 warp 内多线程在访问相同数据时速度太慢的问题（其通过 cache 产生多个数据副本避免了 thread 访问冲突，从而提高并行度）
Local memory：off-chip，线程独享。位于 global memory 内，主要针对 register 不足时数据的存放（包括 statically allocated array，spilled register 以及线程调用栈中的数据），访存时延与 global memory 类似都比较慢
register & shared memory：两者均是 on-chip 类型的内存，register 线程独享，用于存放经常要访问的数据；shared memory 则被一个 block 内的所有线程共享，用于存放经常需要被线程共享的数据

2、为何 operand（操作数）要尽量放在 register 上

指令数量更少：each access to registers involves fewer instructions than an access to the global memory
获取 global memory 需要额外的开销：if an operand value is in the global memory, the processor needs to perform a memory load operation to make the operand value available to the ALU
能源消耗更少：the energy that is consumed for accessing a value from the register file is at least an order of magnitude lower than for accessing a value from the global memory

3、CUDA 中的变量声明、存储位置与生命周期

Tiling for Reduced Memory Traffic

1、Tiling 的定义与例子：global memory 空间大但是读取慢，而 shared memory 空间小但是读取快。因此通用的策略是将数据分成叫做 tiles 的子集，这样每一个 tile 可以 fits into shared memory。这些 tiles 上的 kernel computation 可以相互独立执行。下图中用到了 2 * 2 的 block 来计算 P 矩阵，比如 block(0,0) 中的 4 个线程分别计算 P(0,0)、P(0,1)、P(1,0) 和 P(1,1)。

2、global memory 中重复元素的访问：从左到右是 block (0,0) 中 4 个线程的执行顺序，注意不同线程在获取元素时会出现获取重复元素的现象（比如红色方框和蓝色方框中的元素）。因此如果线程间能够协作减少重复元素的访问那么就能降低 global memory traffic

3、矩阵乘优化方式：进行线程间协作，通过加载 M 与 N 矩阵的元素子集按照 block 的维度分成较小的 tiles（2 * 2）并加载到 shard memory 中，这样可以直接从 shared memory 中获取重复元素计算 dot product。注意 Mds 和 Nds 在不同阶段可以被重复利用，减少了对 shared memory 大小的需求（allows a much smaller shared memory to serve most of the accesses to global memory）。这种每个阶段都只关注并计算输入矩阵的一小部分的做法被称为 locality

4、矩阵乘示例代码

__global__ void MatrixMulKernel(
	float *d_M, float* d_N, float* d_P, int Width){
	__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
	__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

	int bx = blockIdx.x; int by = blockIdx.y;
	int tx = threadIdx.x; int ty = threadIdx.y;
	// Identify the row and column of the d_p element to work on
	int Row = by * TILE_WIDTH + ty;
	int Col = bx * TILE_WIDTH + tx;
	float Pvalue = 0;
	for(int ph = 0; ph < Width / TILE_WIDTH; ph++) {
		Mds[ty][tx] = d_M[Row * WIDTH + ph * TILE_WIDTH + tx];
		Nds[ty][tx] = d_N[(ty + ph * TILE_WIDTH) * WIDTH + Col];
		__syncthreads();
	
		for(int k = 0; k < TILE_WIDTH; k++)
		{
			Pvalue += Mds[ty][k] * Nds[k][tx];
		}
		__syncthreads();
	}
	d_P[Row * Width + Col] = Pvalue;
}

1）代码中的注意事项

tx，ty，bx，by 都是 automatic scalar variables，所以 runtime system 会为每个线程都创建对应的这些变量，并且将它们保存在 registers 中，生存期与对应的线程一致
每个线程负责计算 P 矩阵中的一个元素
外层循环代表 phase 的遍历，变量 ph 表示当前已经做过 dot product 的 phase 个数；内层循环代表矩阵 P 中每个元素的计算（累加）
第一个 syncthreads 的作用：read-after-write dependence，保证 block 中的所有线程将对应的元素加载到了 shared memory 之后（写操作）再进行点积运算（读操作）
第二个 syncthreads 的作用：write-after-read dependence，保证在加载下一个 block 之前（再次写操作） shared memory 中的数据已完成点积运算（读操作）
外层循环加上内层循环这一过程被称为 strip mining

2）CPU 与 GPU 中 tiling 的原理与区别：

CPU 中使用 CPU cache 来保证 reused data 隐式处于 on-chip 的状态，从而在特定时间窗口中 CPU 线程可以再 cache 中找到 reused data
GPU 则将这类数据显式地保存到了 shared memory 上
两者 tiling 做法不同的原因：CPU Core 同一时刻只运行 1-2 个线程，所以只需要用 cache 来保存数据即可，而 GPU SM 同一时刻运行大量的线程来隐藏访存时延，这些线程会竞争 cache slots 从而带来风险，因此需要使用 shared memory 来保存会被重复使用的数据

Boundary checks

1、目的：扩展矩阵乘法算子到任意矩阵宽度

2、示例代码（两个 input matrix 的维度假设为 3 * 3）

for(int ph = 0; ph < ceil(Width / (float)TILE_WIDTH; ++ph)
{
	if((ROW < Width) && (ph * TILE_WIDTH + tx) < Width)
		Mds[ty][tx] = M[Row * Width + ph * TILE_WIDTH + tx];
	else Mds[ty][tx] = 0.0f;
	
	if((ph * TILE_WIDTH + ty) < Width && Col < Width)
		Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
	else Mds[ty][tx] = 0.0f;
	 __syncthreads();
	 
	for(int k = 0; k < TILE_WIDTH; ++k) {
		Pvalue += Mds[ty][k] * Nds[k][tx];
	}
	__syncthreads();
}
if((Row < Width) && (Col < Width) 
	P[Row * Width + Col] = Pvalue;

3、通用矩阵乘法算子（假设 input matrix 的维度分别为 J * K 与 K * L）：需要修改两个 if 判断条件变为以下两条判断语句

// 原来是if((ROW < Width) && (ph * TILE_WIDTH + tx) < Width)
M: Col < J && (ph * TILE_WIDTH + tx) < K
// 原来是((ph * TILE_WIDTH + ty) < Width && Col < Width)
N: (ph * TILE_WIDTH + ty) < K && Col < L

Memory usage 对 occupancy 的影响

1、每个线程占用内存大小早餐的影响：the more resources each thread requires, the fewer the number of threads that can reside in each SM。以 A100 GPU 为例，每个 SM 对应的 shared memory 为 164 KB，最多支持 2048 个线程。因此对于所有 2048 个线程来说平均每个线程能够占用的 shared memory 为 82 B/thread（164/2048）。假设一个 kernel 中的 block 由 256 个线程组成，需要占用 32 KB 的 shared memory，则每个线程需要用到 132B 大小的 shared memory。但由于每个 SM 对应的 shared memory 可使用上限为 164 KB，一个 block 上最多能容纳 1272 个线程（164KB/132B=1272 threads），此时该 kernel 的 occupancy 只有 62%（1272/2048）。A100 GPU 参数另见 NVIDIA 官方博客Tuning CUDA Applications for NVIDIA Ampere GPU Architecture中第1.4.1.1.节）

2、不同设备的 shared memory 各不相同：host code 最好可以动态决定 shared memory 的大小从而能够按需调整分配给 kernel 的 shared memory 数量。实现这一目标可以使用 cudaGetDeviceProperties 函数，假设传入该函数的变量为&devProp，则返回后的 devProp 中就包含有 sharedMemPerBlock 属性，即每个 SM 上可用的 shared memory

3、修改代码实现动态内存分配：使用 extern 关键字来创建 Md 和 Nd 这两块 shared memory

#define TILE_WIDTH 16
__global__ void MatrixMulKernel(
	float *d_M, float* d_N, float* d_P, int Width, 
	unsigned Mdz_sz, unsigned Ndz_sz) {
	
	extern __shared__ char float Mds_Nds[];
	float *Mds = (float *)Mds_Nds
	float *Nds = (float *)Mds_Nds + Mds_sz
}
// 对MatrixMulKernel的调用如下
size_t = calculate_appropriate_SM_usage(devProp.sharedMemPerBlock,...)
MatrixMulKernel<<<dimGrid, dimBlock, size>>>(Md,Nd,Pd,Width,size/2,size/2)

Chap6 Performance considerations

Memory coalescing

1、作用：moving data between global memory and shared memories or registers in an efficient manner，一般与 tiling technique 一起从而充分利用 global memory 的访存带宽

2、DRAM 中数据获取原理：global memory 位于 DRAM cell 上，从 DRAM cell 中读取数据时，晶体管会通过微量电荷的充放电来驱动传感器的电容线进入高位（highly capacitive line），此后传感器便开始检查晶体管中的电荷数量应该被识别为数字“1”还是“0”。该过程要耗费十几纳秒，与数据获取速度（大约是 subnanosecond per byte）相比这一速度是非常慢的，因此现代 DRAM 的设计中使用了并行方法来增加数据读取速率（即 memory access throughput）

3、DRAM bursts：DRAM 在当收到一个读请求和地址后，会连续取出这个地址周围几个连续地址上的数据（具体取几个数据则被称为 Burst Length），这些连续地址上数据的读取就成 DRAM bursts。相比于从随机序列地址上读取数据，如果一个应用程序聚焦在这些 bursts 上的数据读取，那么速度就会快得多

4、Coalescing 的原理：一个 warp 中的线程在任意时刻执行的是同一条指令，例如当所有线程执行 load 指令时，硬件会检测这些指令访问的是否是连续的内存地址。换句话说，the most favorable access pattern 就是一个 warp 中的所有线程访问的是连续的 global memory locations。这种情况下硬件就会 coalesce 这些 accesses（例如当线程 0 访问 N 位置，线程 1 则访问 N+1，线程 2 则访问 N+2，以此类推），最终 DRAM 通过 burst 的方式来 deliver data

5、Row-major layout 和 Column-major layout 在数据读取上的区别（假设矩阵 M 是矩阵乘法过程中的第二个入参，第一个为矩阵 N）

使用 row-major layout 来存储矩阵时，每次迭代不同线程读取的是每一行对应的所有数据，由于这些数据保存在连续地址上，所以属于 coalesced memory accesses；
使用 column-major layout 来存储矩阵时，由于矩阵 M 是按照列优先保持，所以每次迭代不同线程读取的数据并不连续（相邻线程读取的数据所在地址相差一个 Width，在实际矩阵乘运算中 Width 的值可能是几百甚至高达几千），所以矩阵 M 中数据的获取无法在硬件中获得 coalescing
无法读取连续内存数据的优化方式：重新调整线程与数据之间的映射关系；重新调整数据存储结构；在 global memory 和 shared memory 之间以支持 coalescing 的方式传递数据，而当 shared memory 上进行不支持 coalescing access 的运算（例如 corner turning）

row-major layout

row-major layout代码

column-major layout

column-major layout代码

6、Corner turning：通过将线程访存模式从水平访问转换为垂直访问来保证数据从 global memory 读到 shared memory 的过程中处于 coalesced 的状态（即改变下图 B 矩阵中 block 所覆盖数据的读取方式）。至于 shared memory 中以何种形式保存数据并不重要，因为 shared memory 使用了 SRAM 技术，数据读取不要求 coalescing（GPU 的片上内存都是 SRAM）

7、Carpooling 的比喻：Memory coalescing 与 carpooling 很像，数据就像通勤者，DRAM access requests 就像交通工具，如果多个线程同时从一个 DRAM 地址上访问数据，那么它们就形成了 carpool——把多个 accesses 合到一个 DRAM 请求中，不过这么做就要求多个这些线程需要有类似的 execution schedule，比如同个 warp 中的线程就比较适用（通过 SIMD execution 同时执行 load 指令）

Hiding memory latency

1、DRAM bursts 存在的问题与解决方法：仅仅使用 DRAM bursts 技术不足以满足现代处理器对 DRAM access bandwidth 的要求，因此额外使用了两种 parallel organizations——banks 和 channel

channel 与 bank：channel 是一种内存和 CPU 上的内存控制器，通过总线来和计算机的其余部分（比如 bank）进行通信。在内存和内存控制器之间增加通信通道，可以加快数据传输速率。例如下图中一个处理器上包含 4 个 channels，每个 channel 通过总线与 4 个 DRAM banks 相连。实际场景中 channels 的数量通常从 1 到 8 不等，bank 的数量则要远多于 4 个。每个 bank 中都有一组 DRAM cells、用来感知电荷数量的 sensing amplifiers 以及将数据传输到 bus 上的接口。

bus 的 data transfer bandwidth：由 bus 的 width 和 clock frequency 决定。现代 double data rate (DDR) busses 在每个 clock cycle 内进行两次 data transfer。例如 64 bit、1 GHz clock frequency 的 DDR bus 的带宽就是 8 B * 2* 1 GHz=16 GB/s，而一般 CPU 要求 32 GB/，GPU 则是 256 GB/s

3）访存时延以及影响：每次读取 DRAM cell 上的数据时都需要进行电荷充电并将电荷量 share 给 sensing amplifier，只有当 sensing amplifier 完成检测工作后 burst data 才会被发送至 bus 上。相比于传输 burst data 的耗时，这一过程的时延要长许多（访存时延真正耗时的地方）。这也是为何需要多个 bank 的原因，因为可以通过不同 bank 间数据的交替发送来掩盖访存时延（下图中灰色长条代表访存耗时，黑色为发送耗时）

如何确定 banks 的数量：一般来说当 cell array access latency 和 data transfer time 两者的比值为 R 时，如果想要充分 bus 的数据传输带宽，那么至少需要 R+1 个 banks。原因有二：

更多的 bank 能减少 bank conflict 的概率，也就是 multiple simultaneous accesses targeting the same bank，更多的 bank 可以增加 data accesses 分配至多个 banks 上的概率
cell array 的大小存在 latency&manufacturability 的 trade-off，这就限制了每个 bank 上 cells 的数量

2、Interleaved data distribution：将数据分配到不同 banks 和 channels 上的一种方式

3、Parallel execution of the threads 和 parallel structure of the DRAM system 之间的共生关系（symbiotic relationship）：

有效利用 DRAM access bandwidth 要求尽可能多的线程同时访问数据
有效提升 execution throughput 要求尽可能高效地利用 DRAM 的 parallel structure，也就是 banks 和 channels，因为如果多个线程同时访问同个 channel 上的数据，那么 access throughput 和 device execution speed 就会大打折扣

Thread coarsening

1、以 finest granularity 进行跨线程并行操作的优缺点：

1）优点：增强了 transparent scalabiliy：如果硬件拥有足够的资源来 perform all the work in parallel，那么应用程序就暴露了足够的并行性来充分利用硬件。否则，如果硬件没有足够的资源来并行执行所有工作，则可以通过 serialize 的方式执行不同 blocks

2）缺点：并行化可能需要付出一些代价，例如不同 blocks 会导致数据的 redundant loading、redundant work、synchronization overhead 等。当线程由硬件并行执行时，这种并行化的代价往往是值得付出的。然而当资源不足时就会导致 blocks serializing，此时代价的付出就没有必要了。在这种情况下最好 partially serialize the work 来减少为并行化付出的代价。这可以通过为每个线程分配多个工作单元（units of work）来实现，这种做法被称为 thread coarsening（线程粗化）

2、thread coarsening 举例：下图展示了输出矩阵中两个 horizontally adjacent output tiles 的访存模式。对于这些 output tiles 会发现需要加载来自矩阵 N 的不同 input tiles，但加载是同一个来自矩阵 M 的 input tile。由于 shared memory 无法被不同 blocks 所共享，所以每个 block 必须加载其自身的 copy of the input tiles of matrix M。为了使用不同 blocks 来并行计算 output tiles，这种 data redundant 代价是必须付出的；但如果这些 blocks 被 serialized 了那么这个代价是没必要的。这种情况下最好是“a single thread block process the two output tiles, whereby each thread in the block processes two output elements”

3、注意事项：

不要在不必要时使用这种方法。并非所有 workloads 都要为并行计算付出诸如 redundant loading、redundant work、synchronization overhead 的代价，例如第二章中提到的向量加运算并不涉及并行处理不同元素（即向量中的每个元素只参与一次运算而不会同时参与多个运算）
不要过度使用这种方法以至于降低硬件资源的利用效率。从 transparent scalability 的角度来看 thread coarsening 会 reduce the amount of parallelism that is exposed to the hardware，特别是在 coarsening factor 设置过高的时候会导致 parallel execution resources being unutilized。最佳 coarsening factor 应该按照 device-specific&dataset-specific 的角度去设置
避免过度增加 resource consumption 而影响 occupancy。在 kernel 中使用 thread coarsening 可能会导致每个线程需要使用更多的 register 或每个 block 使用更多的 shared memory。Occupancy 降低带来的 performance penalty 可能比 coarsening 所带来的 performance benefit 影响更大

A checklist of optimizations

1、Maximizing occupancy：优化方法有 resource usage of their kernels tuning；shared memory tuning 等

2、Enabling coalesced global memory accesses：前面提到的矩阵乘在计算过程中的访存模式本身就属于 coalesced accesses，而更多情况下的访存模式往往是不规则的（解决方法之一就是前面提到的 Corner Turning，其他几种 patterns 包括 merge、sorting、reduction、minimizing divergence 等方法）

3、Minimizing control divergence：优化方法有 Reduction and Minimizing Divergence（rearrange 数据的分布以保证在调用下一个 warp 之前当前 warp 内的线程都被调用）；Prefix Sum；Graph Traversal；Sparse Matrix Computation

4、Tiling of reused data：Chap5 Memory architecture and data locality；还有一些 input 和 output 维度不同的情况例如 Convolution 和 Stencil

5、Privatization：用来解决多个线程同时写入某一个位置的内存的冲突。每个线程将得到的结果私有化，放在自己的 local memory 中，然后再将多个线程得到的私有化结果合并来得到最终结果。具体算法有 Parallel Histogram 等

6、Thread coarsening

7、其他优化方法可参见论文Optimization Techniques for GPU Programming

Chap7 Convolution

Parallel convolution: a basic algorithm

1、convolution kernel 的实现。如下图所示 (N 为指向 input array；F 指向 filter；P 指向 output array；r 指 filter 的半径，width 为 input 和 output arrays 的宽；height 为高)

2、上述代码中存在的问题

Control flow divergence：当 2D 卷积进行矩阵边缘某些元素的计算时会因为 boundary condition 而不会进入上述代码中的 if 判断条件。一般情况是： large input arrays&small filters 下出现计算少部分 output elements 引发 control divergence，而一般图片尺寸与 filter 相比都比较大，所以其影响不会特别严重
无法达到计算吞吐量峰值：在上述代码段第十行，每加载 8 个字节（两个 float 值）只进行了 2 次运算（乘法与加法），所以 FLOPS 只有 0.25 OP/B，这个计算速度根本无法达到峰值。具体原理可参考 Chap5 Memory architecture and data locality 中的 FLOPS 与访存效率

Constant memory and caching

1、2D 卷积的特点与优化

特点：filter size 通常很小；filter 的 contents 固定；所有线程都会访问 filter elements
优化思路：使用 constant memory（对 block 内的所有 threads 均可见）存储 filter 以避免 thread 访问冲突；
实现：假设 host code 已经对 filter 的初始化，则 kernel 只需将 host memory 上的 filter 直接拷贝至 device constant memory 上即可

在 kernel 外部使用 __constant__ float 申明 filter array，类似 C 语言中的 external declaration
使用 cudaMemcpyToSymble 函数将 host memory 上的 filter 直接拷贝至 device constant memory 上（该函数会将数据从 CPU 拷贝到常量内存中）
注意：constant memory 申明后无需作为变量传入 kernel 中，因为 kernel 会像访问 global memory 一样访问 constant memory

2、constant memory 的原理

Constant memory 对于 device 来说只读但是对于 host 是可读可写。Constant memory 和 global memory 一样都位于 DRAM 上。现代处理器为了尽量避免 memory bottleneck 会使用 on-chip cache memories 来保存常用数据，以减少直接访问 DRAM 的频率
由于 the indices for accessing F 与 thread indices 无关（这点从代码中也能看出），所以 constant caches 可以提供非常可观的访问带宽；并且 filter 因为比较小所以每次都可以放在 constant cache 中被获取，这样就不会占用 DRAM bandwidth
constant memory 的获取方式不同于其它的 GPU 内存，对于 constant memory 来说，最佳获取方式是 warp 中的 32 个 threads 获取 constant memory 中的同一个地址。如果获取的地址不同的话，只能串行地服务这些获取请求了

Tiled convolution with halo cells

1、Tiled convolution algorithms 原理

Input tile 指计算 output tile（指针 P）所需的 input elements，下图中 input tile（指针 N）为左侧矩阵中的蓝色区域，output tile 为右侧矩阵中的绿色区域
Tiled convolution algorithms 原理简单来说就是一个 block 中的所有 threads 首先集体将 input tile 中的元素加载到 shared memory 上，然后通过获取 shared memory 上的值来计算 output tile。需要注意在每个 dimension 上 input tile 的边长都要比 output tile 的边长要长

2、从上图中可以看出，卷积算子和第五章中使用tiling进行矩阵乘计算的不同在于前者的 input tile size 和 output tile size 是一样的，而卷积算子则不同，所以有两种计算方式：一种是使用与 input tile size 相同的 block，另一种是使用与 output tile size 一样的 block。两者各有利弊。下面以第一种方式为例来说明如何实现 tiled convolution algorithms

代码第 10 行和第 20 行：如果取到矩阵大小外的元素（ghost cell）则将 input tile 上对应位置的值设置为 0（col 和 row 计算出来有可能是负值，也有可能超过矩阵的 height 和 width）
代码第 15 行：每个线程完成对矩阵 N 中对应元素的加载后进行 barrier synchronization 来保证整个 input tile 都已经加载到 shared memory 中
代码第 17-18 行：output tile elements 的位置表示。减去 filter 半径是因为 input tile size 亚比 output tile size 大一圈，所以 threadIdx 并不是一一对应的关系
代码第 24-28 行：通过 filter 遍历 patch 来计算 output element

上面这段代码对应的运算强度计算如下。假设 filter 边长为 5，input tile 为 32，output tile 则为 28，则计算出来的 arithmetic-to-global memory access ratio 大约是 9.57 OP/B

不同 input tile size、filter size 下的 arithmetic-to-global memory access ratio，其中 bound 为通过(2FILTER_RADIUS+1)^2 * 2/4 计算出的 ratio upper bound。可以看出真实情况下的 ratio 基本上无法达到 upper bound，而且 small block&tile sizes 所能达到的 ratio 越小

Tiled convolution using caches for halo cells

1、在计算某个 input tile 的卷积输出时，该 tile 计算涉及到的 halo cells 可能已经被其他 tile 加载到了 L2 cache 中，这时该 tile 就不必再去 DRAM 里取这些 halo cells（即不必加载到 N_ds 中）。基于这种思路，代码如下：

Chap14 Sparse matrix computation

1、为何 compute and store the inverse matrix in solving real-world problems 是不实际的

许多 sparse matrices 的 size 会很大，不利于求逆矩阵
逆矩阵中往往会包含许多额外的非零值（fill-ins）
求解线性方程组的常见算法：共轭梯度法

2、the structure of a sparse matrix storage format

Space efficiency (or compaction): the amount of memory capacity that is required to represent the matrix using the storage format
Flexibility: the extent to which the storage format makes it easy to modify the matrix by adding or removing nonzeros Accessibility: the kinds of data that the storage format makes it easy to access Memory access efficiency: the extent to which the storage format enables an efficient memory access pattern for a particular computation (one facet of regularization)
Load balance: the extent to which the storage format balances the load across different threads for a particular computation (another facet of regularization)

A simple SpMV kernel with the COO format

1、COO format：使用额外的 rowIdx 和 colIdx 存储非零值的位置。其优势在于对这个表进行重排序也不会影响最终结果的计算：

Flexibility：从文件中读取数据时，非零数据出现的顺序并不是确定的（a file that does not provide the nonzeros in a particular order），使用 COO 是一个合适的选择；另外每次非零数据插入到这个表中时很简单，只需要加到末尾就可以
Accessibility：给定一个非零值，获取对应的行列比较容易；而给定某个行或列去获取对应行或列中所有的非零值则不太容易
缺点：需要原子操作（the same output value is updated by multiple threads），但是该操作可以避免，只要让每一行都让同个线程来获取和操作

代码如下。注意第七行中的加法为原子操作，以防多个线程同时操作输出向量的同一行的元素

2、CSR format (compressed sparse row)：非零值以行为单位进行归类

优点：比 COO 占用的空间更小（rowPtrs 的数量等于行数）；
缺点：往 CSR 中加入新的非零值比较麻烦（无法直接插入末尾，而是要插入对应的行中）；会产生比较严重的 control flow divergence（因为一个线程要执行的次数依赖于每一行中的非零值个数）；给定某个列，获取该列中所有的非零元素也比较麻烦
代码

Improving memory coalescing with the ELL format

1、基于 CSR format，对每一行中为零的值所在增加 padding element

2、ELL format 的线程便利过程

3、代码与注意事项

第 7 行和第 8 行所有线程进行了 coalesced memory access，因为所有元素是以 column-major order 排列的，adjacent threads 获取到的即为 adjacent memory locations（有些 GPU 架构对于 coalesced memory access 有着更为严格的 address alignment rules，所以在设计 ELL SPMV kernel 的时候可能需要调整来让每次迭代读取的是 specified alignment units，比如 64 个字节）
ELL 因为进行了 padding 操作所以 space efficiency 比 CSR 要差，特别如果小部分行中只有个别是非零值
往 ELL 中加入元素比 CSR 要更灵活（只要将 padding 的值替换为新加的值即可）
ELL kernel 仍存在 control divergence 的问题

Regulating padding with the hybrid ELL-COO format

1、ELL-COO 的目的：解决 ELL format 中的 low space efficiency & control divergence

2、ELL-COO 的原理和注意点：将 ELL 中的有较多非零值的行中的部分元素放置到一个 COO 中，这样在 ELL 中 padded elements 就不会太多

Educing control divergence with the JDS format

1、JDS：jagged diagonal storage format，原理是对整个矩阵中的 rows 做一个排序（降序），使得比较高负载的线程（有较多非零元素的行）都集中在同一个 thread block 内，使得每个线程不需要彼此等待太长时间，降低了计算核心的空闲率

2、JDS 的另一种格式：当 rows 被 sorted 之后，再将这些 rows 分割为不同 sections，确保每个 section 中不同 rows 包含的非零值大致相同。然后再对这些 sections 进行 ELL format 处理（相比将矩阵直接进行 ELL 这样能大幅减少 padding elements 的数量）

3、JDS 的优缺点

比 ELL 有更好的 space efficiency
Flexibility 不太行，因为添加非零元素可能会造成 rows 的重排序
JDS 允许 coalesced manner 获取矩阵非零值
effective at reducing control divergence