神刀安全网

Multithreading in C++11/14 – Part 11

General-Purpose computing on Graphics Processing Units is a well-known technique to benefit from the huge parallelism of GPUs. An article about GPU computing in a series about multithreading can feel questionable. As we are about to see, GPUs are essentially multicores architectures, which are based on a multithreading programming model. This is the reason why I thought that it was important to introduce GPGPU in this series.

Before getting our hands dirty with actual code, we need to focus on the differences between CPUs and GPUs. So far, in this series about multithreading, all code that has been shown run exclusively on CPUs. Everybody knows about the existence of GPUs. We find them almost in every desktop computers, laptops, netbooks, smartphones, tablets, etc. But what are they, really? Why do we need them? How do they differ from CPUs?

GPGPU

GPU architectures

The following picture provides a first set of answers.

Multithreading in C++11/14 – Part 11

High-level view of the differences between CPU and GPU architectures

Let’s start with the similarities. Both CPUs and GPUs have a dedicated global memory, respectively depicted as main memory and graphic memory. Both have different levels of caches, control units, arithmetic and logical units, etc.

The main difference is the size and complexity of the caches, and the number of cores. Basically, GPUs use more transistors for cores, whereas CPUs use more transistors for memory caches. More cores enable performing more operations per unit of time. More memory caches lead to less clock cycles spoiled waiting for slow global memory accesses, and therefore to faster processing of a given task.

The fundamental reason is the performance purpose of these pieces of hardware. A CPU is designed for latency, which is the time taken to complete a task. A GPU is designed for throughput, which is the number of operations completed in a given amount of time. Latency and throughput are orthogonal and they are both essential to the performance of a system. Therefore, CPUs and GPUs have different roles to play in a computer architecture.

CPUs and GPUs can relate to one another in different ways. The most common way is to have 2 devices linked by a PCI express bus, as shown below.

Multithreading in C++11/14 – Part 11

Overview of the typical connections between a CPU and a GPU

This picture shows in particular that a CPU is more general-purpose than a GPU. A program that runs on a CPU can be used for all sorts of activity: I/O, networking, computation, etc. On the other hand, we have a GPU which is only connected to its own graphic memory, to the CPU, and potentially to a display device. Indeed, a GPU is initially a specialized hardware accelerator… for graphics processing!

To get a better feeling about the differences between a modern CPU and a modern GPU (at the time of writing), it is interesting to look at some actual figures, taken from specifications from real hardware vendors.

Multithreading in C++11/14 – Part 11

Typical figures on modern hardware at the time of writing

These figures teach us that a typical CPU core is 3 to 4 times more powerful than a GPU core. It shows that a GPU contains orders of magnitude more cores than a CPU. Last but not least, this diagram tells us about a major bottleneck in this architecture: the communication between the CPU and the GPU is limited by the PCIe bus speed. This limitation needs to be taken into account when implementing high performance code. Essentially, locality should be favored and memory transfers should be minimized.

The actual implementation of a GPU architecture depends strongly on the software vendor, and the version of the device. I provide below a simplified version of a real GPU architecture, in order to give a good sense of what it involves.

The cores are simpler than those in a traditional CPU architecture (seepart 9). Cores are grouped in Streaming Multiprocessors (SM). A typical SM contains several cores, with a dedicated shared memory. This shared memory is used for the whole threads’ group as well as for each thread’s local memory. A Streaming Multiprocessor usually contains an instructions cache, and a multithreaded instruction unit.

A small portion of the cores are specialized for cosine, sine, logarithm, etc. computations. They are called SFUs, for Special Function Units. These instructions are usually time-consuming because they are defined as mathematical series, which involve a loop. Besides, these kinds of operations are extremely common in graphics and scientific code. Therefore, SFUs further speed up GPUs compared to CPUs.

Streaming Multiprocessors may be grouped in higher-level units called Texture Processor Clusters (TPC). A TPC may contain specialized pieces of hardware for graphics computation, such as dedicated texture caches and units.

All these pieces of hardware are orchestrated by different compute work distribution units. They may communicate through an interconnection network. This network might be connected directly to the global memory, or via a L2 cache.

Due to the highly parallel nature of GPUs, a special programming paradigm, called Single Instruction Multiple Threads (SIMT) is used. It is closely related to Single Instruction Multiple Data paradigm, where the same instruction is executed in parallel on multiple data. With SIMT, the same instruction is executed on different threads concurrently.

It is important to be aware that threads are always executed in blocks. Typically, blocks of 32 threads may be used. So it is important to optimize with that in mind. For example, none of the code samples that I provide later in this article are optimized because they deal with too little data.

Now, let’s have a look at actual code. As before, I do not target thorough coverage of the available technologies. I simply want to address a sensible scope of solutions and programming models.We will start with the GPGPU dinosaur.

GPU computing prehistory

GLSL : Open Graphics Library (OpenGL) is used for 3D graphics. OpenGL Shading Language (GLSL) was designed to provide OpenGL programmers with a finer-grained access to the GPU power. In practice, in order to hide latency, GPU use pipelined operations. OpenGL relies on a pipelined model. The schema below shows a subset of the main steps of a traditional OpenGL pipeline.

Multithreading in C++11/14 – Part 11

Usual OpenGL pipeline

In the pipeline above, we can observe the process by which a triangle is rendered in a color buffer, which is typically displayed as-is on a screen. First, the position and the color of the 3 vertices are specified on the host, in a language like C/C++. Then, this information is sent to the device. It is used as the input of the Vertex Shader.

A shader is a program written in GLSL. GLSL looks like C, with some major differences. The Vertex Shader is responsible for producing the actual colors and positions of the vertices of the shapes to render. The positions are used for the inception of the Shape Assembly, which is then rasterized. The rasterization is the process by which fragments, which are more or less pixels, are computed based on the positions or the vertices and some graphics properties.

After that, the Fragment Shader is executed. This second shader can be written by a developer so as to change the color of the fragments to display. In the example above, a simple interpolation is used. Therefore, as the triangle has respectively a red, a green and a blue vertex, the triangle is displayed with internal colors varying in a broad color spectrum.

The Fragment Shader can also be used to manipulate textures. In the following figure, 2 textures are multiplied in the Fragment Shader to produce a simple effect.

Multithreading in C++11/14 – Part 11

Textures multiplication in a Fragment Shader

There are other kinds of shaders in OpenGL. However, for general-purpose computing, we need only to use the Fragment Shader. Indeed, instead of giving actual image textures, it is possible to build a texture from an array of floats. We take advantage of this capability to perform any kind of computation on the GPU with GLSL.

The following Fragment Shader program executes the addition of 2 arrays.

uniform sampler2DRect A;     uniform sampler2DRect B;      void main(void) {         vec4 a = texture2DRect(A, gl_TexCoord[0].st);         vec4 b = texture2DRect(B, gl_TexCoord[0].st);         gl_FragColor = a + b;     }

There are 2 global variables A and B, that are bound to the values of the host program. The main function is short but very interesting. In fact, the main function will be executed in parallel with SIMT. Each thread deals with one single fragment, which contains red, green, blue, and alpha values. Those 4 values are gathered for the 2 textures in vec4 variables. The gl_FragColor is a special GLSL variable which holds the final fragment value of the current fragment.

The fact that the main function is executed in parallel with a mechanism to fetch the specific value to handle in the current thread is really key. This programming technique is found mostly everywhere in GPU computing.

There are several major drawbacks to GLSL for GPU computing. Let’s have a look at the complete program.

// credits: http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html          constexpr std::size_t N{16};     constexpr std::size_t texture_size{2};          using array = std::array<float, N>;          GLuint fbo_identifier;     GLuint glsl_program;     GLuint fragment_shader;     GLuint glut_window_handle;      GLint A;     GLint B;          GLuint texture_B_device_id;     GLuint texture_A_device_id;     GLuint texture_result_device_id;          array host_A;     array host_B;     array host_result;          std::iota(std::begin(host_A), std::end(host_A), 1);     std::fill(std::begin(host_B), std::end(host_B), 5);     std::fill(std::begin(host_result), std::end(host_result), 0);          {         glutInit(&argc, argv);         glut_window_handle = glutCreateWindow("Addition with GLSL");     }          {         glGenFramebuffersEXT(1, &fbo_identifier);         glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, fbo_identifier);     }          {         glMatrixMode(GL_PROJECTION);         glLoadIdentity();         gluOrtho2D(0.f, texture_size, 0.f, texture_size);         glMatrixMode(GL_MODELVIEW);         glLoadIdentity();         glViewport(0.f, 0.f, texture_size, texture_size);     }          {         glGenTextures(1, &texture_A_device_id);                  glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texture_A_device_id);                  glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_MIN_FILTER, GL_NEAREST);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_MAG_FILTER, GL_NEAREST);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_WRAP_S, GL_CLAMP);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_WRAP_T, GL_CLAMP);                  glTexImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, GL_RGBA32F_ARB, texture_size, texture_size, 0, GL_RGBA, GL_FLOAT, 0);         glTexSubImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, 0, 0, texture_size, texture_size, GL_RGBA, GL_FLOAT, host_A.data());     }          {         glGenTextures(1, &texture_B_device_id);                  glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texture_B_device_id);                  glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_MIN_FILTER, GL_NEAREST);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_MAG_FILTER, GL_NEAREST);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_WRAP_S, GL_CLAMP);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_WRAP_T, GL_CLAMP);                  glTexImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, GL_RGBA32F_ARB, texture_size, texture_size, 0, GL_RGBA, GL_FLOAT, 0);         glTexSubImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, 0, 0, texture_size, texture_size, GL_RGBA, GL_FLOAT, host_B.data());     }          {         glGenTextures(1, &texture_result_device_id);                  glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texture_result_device_id);                  glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_MIN_FILTER, GL_NEAREST);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_MAG_FILTER, GL_NEAREST);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_WRAP_S, GL_CLAMP);         glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_WRAP_T, GL_CLAMP);                  glTexImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, GL_RGBA32F_ARB, texture_size, texture_size, 0, GL_RGBA, GL_FLOAT, 0);         glTexSubImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, 0, 0, texture_size, texture_size, GL_RGBA, GL_FLOAT, host_result.data());     }          {         glsl_program = glCreateProgram();         fragment_shader = glCreateShader(GL_FRAGMENT_SHADER_ARB);                  const GLchar* source = R"shader(             uniform sampler2DRect A;             uniform sampler2DRect B;                      void main(void) {                 vec4 a = texture2DRect(A, gl_TexCoord[0].st);                 vec4 b = texture2DRect(B, gl_TexCoord[0].st);                 gl_FragColor = a + b;             }         )shader";                  glShaderSource(fragment_shader, 1, &source, NULL);         glCompileShader(fragment_shader);         glAttachShader(glsl_program, fragment_shader);         glLinkProgram(glsl_program);                  A = glGetUniformLocation(glsl_program, "A");         B = glGetUniformLocation(glsl_program, "B");     }          {         glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0_EXT, GL_TEXTURE_RECTANGLE_ARB, texture_result_device_id, 0);         glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT1_EXT, GL_TEXTURE_RECTANGLE_ARB, texture_B_device_id, 0);                  glUseProgram(glsl_program);                  glActiveTexture(GL_TEXTURE1);         glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texture_A_device_id);         glUniform1i(B, 1);                  glDrawBuffer(GL_COLOR_ATTACHMENT0_EXT);                  glActiveTexture(GL_TEXTURE0);         glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texture_B_device_id);         glUniform1i(A, 0);                  glPolygonMode(GL_FRONT, GL_FILL);         glBegin(GL_QUADS);         {             glTexCoord2f(0.0, 0.0);             glVertex2f(0.0, 0.0);             glTexCoord2f(texture_size, 0.0);             glVertex2f(texture_size, 0.0);             glTexCoord2f(texture_size, texture_size);             glVertex2f(texture_size, texture_size);             glTexCoord2f(0.0, texture_size);             glVertex2f(0.0, texture_size);         }         glEnd();                  glFinish();     }          {         glReadBuffer(GL_COLOR_ATTACHMENT0_EXT);         glReadPixels(0, 0, texture_size, texture_size, GL_RGBA, GL_FLOAT, host_result.data());                  using outit = std::ostream_iterator<float>;         std::copy(std::begin(host_result), std::end(host_result), outit{std::cout, " "});         std::cout << std::endl;     }          {         glDetachShader(glsl_program, fragment_shader);         glDeleteShader(fragment_shader);                  glDeleteProgram(glsl_program);                  glDeleteFramebuffersEXT(1, &fbo_identifier);                  glDeleteTextures(1, &texture_A_device_id);         glDeleteTextures(1, &texture_B_device_id);         glDeleteTextures(1, &texture_result_device_id);                  glutDestroyWindow(glut_window_handle);     }

As we can see, there is a lot of boilerplate code to initialize the Fragment Shader, execute it and get the results back. Worse, it is mandatory to actually draw 3D graphics in a 3D window in order to compute the results.

The program starts by filling in the host arrays. Then, a GLUT window is initialized. After that, for each host array, one texture is defined. Each texture is bound to its respective host array. Then, the Fragment Shader code is defined as a string which is sent to the GPU, compiled, linked and activated. The global variables of the Fragment Shader are bound to the textures. Then, a square is drawn. Each vertex of the square is associated with a point of the final texture. The final texture is computed by the Fragment Shader, which adds the 2 textures A and B. As the viewpoint was suitably defined to encapsulate the drawn square, the Color Buffer contains the result of the addition of A and B. We can read this buffer and display the results to the standard output.

Although it works perfectly, it is a major pain in the ass to write this kind of code, which is extremely error-prone. Happyfully, the Khronos Group, which is in charge of OpenGL, has defined another technology which is much more suited to GPGPU tasks.

The Khronos family

OpenCL : The Open Computing Language is a cross-platform parallel programming language defined by the Khronos Group. It is a subset of the C++14 language. It can run on a CPU and a variety of hardware accelerators, such as FPGA (Field-Programmable Gate Array) and GPU. Like GLSL, you need to pass the host values to the device. Likewise, you need to define, compile and link the kernel code, which is the equivalent of the shader code in GLSL.

The following piece of code starts by getting a reference to a GPU device on the executing hardware. Then, it defines the kernel code in a raw string literal, compiles it, binds its input to the host array, runs it, and finally fetches the computation result and flushes it to stdout.

cl::Device device;     {         cl::Platform platform = cl::Platform::getDefault();                  std::vector<cl::Device> all_devices;         platform.getDevices(CL_DEVICE_TYPE_GPU, &all_devices);         if (all_devices.empty()){             throw std::runtime_error{"No devices found. Check OpenCL installation!/n"};         }                  device = all_devices.front();     }          std::string code = R"code(         void kernel compute_square(global int* table) {             size_t i = get_global_id(0);             table[i] *= table[i];         }     )code";          cl::Program::Sources sources;     sources.emplace_back(code.c_str(), code.length());          cl::Context context({device});     cl::Program program(context, sources);          try {         program.build({device});     }     catch (const std::exception& e) {         std::cerr << e.what() << '/n';         std::cerr << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device) << std::endl;         throw e;     }          constexpr std::size_t N = 10;     constexpr std::size_t buffer_size = sizeof(int) * N;          cl::CommandQueue queue(context, device);     {         auto table = std::array<int, N>{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };         cl::Buffer buffer(context, CL_MEM_READ_WRITE, buffer_size);                  queue.enqueueWriteBuffer(buffer, CL_TRUE, 0, buffer_size, table.data());                  cl::Kernel compute_square(program, "compute_square");         compute_square.setArg(0, buffer);                  // Actual blocking computation         queue.enqueueNDRangeKernel(compute_square, cl::NullRange, cl::NDRange(N));                  queue.enqueueReadBuffer(buffer, CL_TRUE, 0, buffer_size, table.data());                  using outit = std::ostream_iterator<int>;         std::copy(std::begin(table), std::end(table), outit{std::cout, " "});         std::cout << std::endl;     }

The flow feels much better than GLSL. OpenCL is a standard that can run on multiple devices from multiple manufacturers. Therefore, OpenCL is a very strong solution for GPU computing. However, it feels still quite low level and there is still a certain amount of boilerplate code that could be wrapped and encapsulated in higher-level constructs.

One fun observation is that you can really mess up your system with GPU computing. Indeed, when you enter the arena of undefined behavior on the CPU, your Operating System will often simply catch the problem and nicely issue a core dump or stack overflow. With GPU, without specific tooling support, you can easily end up seeing your whole system reboot due to a bug in your kernel code.

SPIR-V : Early versions of OpenCL had an important drawback in terms of intellectual property. Indeed, the kernel source code had to be distributed to customers, because it was always compiled just-in-time at runtime. The Standard Portable Intermediate Representation (SPIR-V) fixes this problem. As its name suggests, it provides an intermediate representation based on the compilation of kernel sources. Any driver that supports recent versions of OpenCL can interpret SPIR-V code.

Note that OpenCL is not the only language which targets the generation of SPIR-V code. Vulkan, the new OpenGL, is another one.

To be honest, I have not investigated Vulkan so far. I assume that OpenCL is more suited to GPGPU than Vulkan, which seems rather branded as a 3D graphics API.

SYCL : This specification describes a pure C++17 API to manipulate a GPU (or more generally, an accelerator). Contrary to OpenCL, you don’t have to write explicit kernel code, and contrary to OpenMP or OpenACC, you don’t have to use specific pragmas. You can simply use template algorithms parametrized with lambda functions. The kind of code that we obtain is rather neat and concise, as we can see in the following vector addition sample.

SYCL is supposed to be built on top of OpenCL and SPIR-V. The problem, nowadays, is that SYCL is mainly a specification. There is an implementation called triSYCL that is built on top of OpenMP. This implementation enables members of the Khronos committee and others interested in SYCL and C++ to try the API.

constexpr std::size_t N{16};          using array = std::array<int, N>;     using buffer = cl::sycl::buffer<int, 1>;     using range = cl::sycl::range<1>;     using id = cl::sycl::id<1>;          auto host_A = array{};     auto host_B = array{};     auto host_result = array{};          std::iota(std::begin(host_A), std::end(host_A), 1);     std::fill(std::begin(host_B), std::end(host_B), 5);          auto device_A = buffer(host_A.data(), range{N});     auto device_B = buffer(host_B.data(), range{N});     auto device_result = buffer(host_result.data(), range{N});          cl::sycl::queue queue;     queue.submit([&device_A, &device_B, &device_result](cl::sycl::handler& cl_handler) {         using cl::sycl::access::read;         using cl::sycl::access::write;                  auto a = device_A.get_access<read>(cl_handler);         auto b = device_B.get_access<read>(cl_handler);         auto result = device_result.get_access<write>(cl_handler);                  cl_handler.parallel_for(range{N}, [a, b, result](id i) {             result[i] = a[i] + b[i];         });     });          using outit = std::ostream_iterator<int>;     std::copy(std::begin(host_result), std::end(host_result), outit{std::cout, " "});     std::cout << std::endl;

SYCL looks very promising to me. However, it cannot really be used for GPGPU at the time of writing.

The common non-portable way

CUDA : Compute Unified Device Architecture is more mature than OpenCL and has a broader set of libraries support. However, it is specific to NVIDIA hardware. It also requires a special compiler. In CUDA, the kernel code is not defined as a string or in a separate file, but inside a C function tagged with a special keyword, like __global__. This kernel function has to be summoned with a special syntax that enables an accurate control over the allocated hardware resources, in terms of threads. The following example illustrates yet another vector addition.

__global__ void kernel_add(const float* A, const float* B, float* result, int size) {     int i = blockDim.x * blockIdx.x + threadIdx.x;      if (i < size) {         result[i] = A[i] + B[i];     } }  int main() {     struct raii_finalize {         ~raii_finalize() {             cudaDeviceReset();         }     } remember_to_finalize_cuda;          auto cuda_deleter = [](float* device) {         cudaFree(device);     };     using unique_ptr = std::unique_ptr<float, decltype(cuda_deleter)>;     auto make_unique = [&cuda_deleter](float* device) -> unique_ptr {         return std::move(unique_ptr(device, cuda_deleter));     };      constexpr int N{5};     constexpr std::size_t size{N * sizeof(float)};     constexpr int threadsPerBlock{256};     constexpr int blocksPerGrid{(N + threadsPerBlock - 1) / threadsPerBlock};          using array = std::array<float, N>;      auto host_A = array{ 3.14f, 42.f, 0.f, 5.f, 2.5f };     auto host_B = array{ 6.63f, 4.2f, 33.f, 0.123f, 7.f };          float* device_A{nullptr};     cudaMalloc((void**)&device_A, size);     auto make_sure_to_free_device_A = make_unique(device_A);     cudaMemcpy(device_A, host_A.data(), size, cudaMemcpyHostToDevice);      float* device_B{nullptr};     cudaMalloc((void**)&device_B, size);     auto make_sure_to_free_device_B = make_unique(device_B);     cudaMemcpy(device_B, host_B.data(), size, cudaMemcpyHostToDevice);          float* device_result{nullptr};     cudaMalloc((void**)&device_result, size);     auto make_sure_to_free_device_result = make_unique(device_result);      kernel_add<<<blocksPerGrid, threadsPerBlock>>>(device_A, device_B, device_result, N);      auto host_result = array{};     cudaMemcpy(host_result.data(), device_result, size, cudaMemcpyDeviceToHost);      using outit = std::ostream_iterator<float>;     std::copy(std::begin(host_result), std::end(host_result), outit{std::cout, " "});          std::cout << std::endl; }

The general purpose intermediary libraries

Boost.Compute : It provides an implementation of most STL algorithms on the GPU. It is a simple header file containing template code on top of OpenCL. It has been officially part of Boost for only 3 weeks. Boost.Compute is very convenient and well-designed.

It does not require much knowledge about GPU to use Boost.Compute. Anyone who is comfortable with the STL algorithms can very quickly become rather effective with it. With very few lines of code, it is very easy to go beyond the traditional vector addition.

In the following example, an array of integers is squared and then each of its elements are summed. It is a typical map/reduce.

const std::size_t N{10};          auto table = std::array<int, N>{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };          boost::compute::device gpu = boost::compute::system::default_device();     boost::compute::context context(gpu);     boost::compute::command_queue queue(context, gpu);          boost::compute::vector<int> device_vector(N, context);     boost::compute::copy(std::begin(table), std::end(table), std::begin(device_vector), queue);      // Squares each element by multiplying it by itself, and sum them     int sum{};     boost::compute::transform_reduce(std::begin(device_vector),                                      std::end(device_vector),                                      std::begin(device_vector),                                      &sum,                                      boost::compute::multiplies<int>(),                                      boost::compute::plus<int>(),                                      queue);          std::cout << "sum: " << sum << std::endl;

In this second example, we use iota to generate the first 10 floating point numbers into a vector. Then, this vector is reversed. After that, it is sorted and finally, its content is copied to the standard output. All these operations (but the copy to stdout) occur on the GPU.

const std::size_t N{10};          boost::compute::device gpu = boost::compute::system::default_device();     boost::compute::context context(gpu);     boost::compute::command_queue queue(context, gpu);          boost::compute::vector<float> device_vector(N, context);     boost::compute::iota(std::begin(device_vector), std::end(device_vector), 0.f, queue);     boost::compute::reverse(std::begin(device_vector), std::end(device_vector), queue);          boost::compute::sort(device_vector.begin(), device_vector.end(), queue);          using outit = std::ostream_iterator<float>;     boost::compute::copy(std::begin(device_vector), std::end(device_vector), outit{std::cout, " "}, queue);

Boost.Compute is my favorite GPGPU technology, in terms of programming experience. An actual benchmark would be nice in order to compare the available technologies with respect to performance.

C++ AMP : The Accelerated Massive Parallelism is a C++ specification from Microsoft to benefit from highly parallel hardware, such as GPUs. The following code performs a vector addition.

constexpr int N{3};     using array = std::array<float, N>;          auto host_A = array{ 1.f, 2.f, 3.14f };     auto host_B = array{ 42.f, 0.f, 6.33f };     auto host_result = array{};          using const_array_view = concurrency::array_view<const float, 1>;     using array_view = concurrency::array_view<float, 1>;          auto device_A = const_array_view(N, host_A);     auto device_B = const_array_view(N, host_B);     auto device_result = array_view(N, host_result);          device_result.discard_data();          auto kernel_addition = [=](concurrency::index<1> i) restrict(amp) {         device_result[i] = device_A[i] + device_B[i];     };          concurrency::parallel_for_each(device_result.extent, kernel_addition);          for (int i{}; i < N; i++)         std::cout << device_result[i] << "";     std::cout << std::endl;

C++ AMP Library : Library providing STL-like algorithms based on C++ AMP.

The pragma family

OpenACC : Open Accelerator is a cousin of OpenMP for GPU Computing (or other kinds of accelerators). Like OpenMP, it specifies a set of pragma directives to control what should be processed on a hardware accelerator.

OpenACC is pretty young and it is only partially supported by the very recent version 6 of gcc. As I use mainly clang and gcc 5.3, I could not build an actual code sample. The following piece of code relies on my understanding of the specification. Hence, it is not foolproof.

constexpr int N{5};     using array = std::array<float, N>;          auto A = array{ 42.f, 3.14f, 0.f, -3.34f, 6.36f };      float* pA{A.data()};     float sum{};          #pragma acc kernels     #pragma acc enter data copyin(pA[0:N])     #pragma acc exist data copyout(pA[0:N])     #pragma acc parallel reduction (+:sum)     for (int i{}; i < N; i++) {         pA[i] = std::pow(pA[i], 2);         sum += pA[i];     }          using outit = std::ostream_iterator<float>;     std::copy(std::begin(A), std::end(A), outit{std::cout, " "});     std::cout << "/nsum: " << sum << std::endl;

OpenMP : Starting from version 4, OpenMP, which was introduced inthe previous article, supports GPU computing thanks to #pragma omp target.

OpenHMPP : The Hybrid Multicore Parallel Programming is a C and Fortran compiler that handles special pragmas, like OpenMP and OpenACC.

The general purpose high-level libraries

Bolt : This is another cool library providing STL-like algorithms, built on top of OpenCL, C++ AMP, and TBB (which was introduced inthe previous article). It chooses at runtime the best suited backend based on the size of the input. For example, a small-size problem should not be sent to the GPU, as the overhead incurred by the message transfer from the CPU, plus the potentially required kernel JIT compilation would produce worse performance. Compare the following 2-liners with the GLSL example…

auto v = std::vector<int>{ 5, 4, 2, 3, 1 };     bolt::cl::sort(v.begin(), v.end());

Thrust : Like Bolt, Thrust is a nice C++ library providing STL-like algorithms. This one is built on top of CUDA, TBB and OpenMP.

The scientific high-level libraries

VexCL : This is an abstraction layer built on top of several backends: either OpenCL, Boost.Compute or CUDA. It aims at providing the best of OpenCL and CUDA worlds. VexCL particularly shines for linear algebra.

Once again, to give a feeling about this library, here is our usual vector addition.

auto context = vex::Context{vex::Filter::GPU};     if (! context) throw std::runtime_error("No GPU available.");          auto host_V1 = std::vector<float>{ 5.f, 2.5f, 0.f, -42.f, 3.14f };     auto host_V2 = std::vector<float>{ -5.f, 7.5f, 0.f, 4.2f, 6.63f };          auto device_V1 = vex::vector<float>{context, host_V1};     auto device_V2 = vex::vector<float>{context, host_V2};     auto device_result = vex::vector<float>{context, host_V1.size()};          device_result = device_V1 + device_V2;          auto host_result = std::vector<float>(host_V1.size());     vex::copy(device_result, host_result);          using outit = std::ostream_iterator<float>;     std::copy(std::begin(host_result), std::end(host_result), outit{std::cout, " "});     std::cout << std::endl;

ViennaCL : ViennaCL looks, to me, a lot like VexCL. It is built on top of OpenCL, CUDA and OpenMP. It targets linear algebra on accelerators. It seems to be a mature project with interfaces to Python, Matlab and mathematics libraries.

It seems that GPGPU has really matured lately. There is a fair bunch of strong libraries in the wild. However, it is pretty hard to tell which library really blends out. Different programming styles are available. You may trade portability for performance. The fans of pragmas can use OpenMP. Bolt enables you to get something up and running (fast) with a tiny amount of code. I like Boost.Compute and I am convinced that both VexCL and ViennaCL are very good solutions for scientific computing. I wish I had time to back this article with some sort of benchmark, but it goes beyond the scope of this series about multithreading.

Keeping up with accelerator hardware and programming models evolutions is hard. OpenCL, CUDA, OpenACC, C++ AMP, etc. are all actively developed. This is excellent news because GPU computing still feels largely less natural and less mature than CPU computing. In particular, it really looks like if you want the most out of your hardware, you should avoid the high-level abstractions.

I don’t believe that high-level programming models like Bolt and Thrust can give you the best performance because it hides the communication between the CPU and GPU. My experience with distributed computing tends to tell me that Bolt and Thrust are like CORBA: they hide too much of the complexity, with a likely performance penalty as a result.

As parallelism increases in common computer architectures, the odds are that the programming models that back them will need to evolve. A possible evolution would be from manual thread spawning and management, to threads pools, to tasks-based parallelism, to pragmas across usual code, to eventually purely functional kernels programming and SIMT.

This article just touches the surface of GPGPU. There are dedicated books, blogs and a whole community behind it. Techniques such as coalesced memory access (versus strided or random access), or stencil code, are key to efficiency. But they would deserve dedicated blog posts.

Somehow, GPGPU is already distributed computing. Indeed, you need to distribute the computation between the CPU and the GPU. In a way, the CPU performs Remote Procedure Calls (RPC) to the GPU. The next article will introduce briefly multiprocess programming and compare it to multithreading. In later articles, we will talk about distributed computing.

I will conclude with a remark: there are a plethora of web compilers that support the same set of programming languages. However, I did not find a web compiler supporting GPU computing. AWS allows you to spawn a Hardware Virtual Machine (HVM), but it does not help when all you want is to try a specific GPU library. That could be a startup idea!

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Multithreading in C++11/14 – Part 11

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址