博文

目前显示的是标签为“NVIDIA”的博文

The Nvidia Way书摘

图片
1、NVIDIA最大的敌人不是来自外界,而是它自己:We launched into a wide-ranging discussion of the company’s history. Jensen knows that many of his former employees look back on Nvidia’s beginnings with nostalgia. But he resists overly positive accounts of Nvidia’s start-up period—and his own missteps.  “When we were younger, Tae, we sucked at a lot of things. Nvidia wasn’t a great company on day one. We made it great over thirty-one years. It didn’t come out great,” he said. “You didn’t build NV1 because you were great. You didn’t build NV2 because you were great,” he said, referring to the company’s first two chip designs, both of which were flops that nearly killed Nvidia. “We survived ourselves. We were our own worst enemy.”   There were several more near-death experiences. But each time, amid the stress and the pressure, the company learned from its mistakes. It retained a core of die-hard employees, many of whom remain in the fold to this day. 2、NVIDIA成功的最关键要素——独特的组织架构设计和工作文化:Through these int...

Programming Massively Parallel Processors部分章节重点摘录

图片
Chap2 Heterogeneous data parallel computing 1、CUDA 代码的结构:The structure of a CUDA C program reflects the coexistence of a host (CPU) and one or more devices (GPUs) in the computer. Each CUDA C source file can have a mixture of host code and device code. By default, any traditional C program is a CUDA program that contains only host code . One can add device code into any source file. 2、CUDA runtime 提供的内存分配和回收接口 cudaMalloc:内存申请函数,用于在设备上开辟一段空间(device global memory)。该函数与cudaMallocManaged 的区别在于后者分配的空间会使用 unified memory 进行自动调度。需要注意该函数的输入的指针类型是 二级指针(void**) ,这样该接口就不会受限于特定数据类型类型的指针 cudaFree:使用 cudaMalloc 开辟完空间后需要使用函数对空间进行释放 float* A_d int size=n * sizeof(float); cudaMalloc((void**)&A_d, size); // A_d为指向device global memory的地址 ... cudaFree(A_d); // 释放分配给A_d的device global memory并放回至available pool cudaMemcpy:用于在主机和设备之间同步数据。第一个入参是目的地址,第二个入参为源地址 3、CUDA Kernel 的运行结构 SPMD 分布式设计模式:single-program multiple-date,指的是 多个计算节点执行相同的程序,但是每个节点处理的数据不同 。SPMD 模型通常用于并行计算,可以将大规模的数据集分成多个小块,由不同的计算节点进行并行处理 blockDi...