转载:
OpenCL (Open Computing Language) is a new framework for writing programs that execute in parallel on different compute devices (such as CPUs and GPUs) from different vendors (AMD, Intel, ATI, Nvidia etc.). The framework defines a language to write “kernels” in. These kernels are the functions which are to run on the different compute devices. In this post I explain how to get started with OpenCL and how to make a small OpenCL program that will compute the sum of two lists in parallel.
Installing and setting up OpenCL on your computer
First of all you need to download the newest drivers to your graphics card. This is important because OpenCL will not work if you don’t have drivers that support OpenCL.
To install OpenCL you need to download an implementation of OpenCL. The major graphic vendors Nvidia and AMD/ATI have both released implementations of OpenCL for their GPUs. These implementation come in a so called software development kits and often include some useful tools such as a visual profiler. The next step is to download and install the SDK for the GPU you have on your computer. Note that not all graphic cards are supported. A list of which graphic cards are supported can be found on the vendors websites.
For AMD/ATI GPUs download the
For Nvidia GPUs download theThe installation steps differ for each SDK and the OS you are running. Follow the installation manual of the SDK carefully. Personally I use Ubuntu Linux and have an AMD 7970 graphics card. Below are some installation steps for this specific setup.
Installing OpenCL on Ubuntu Linux with AMD graphics card
To install the latest AMD drivers on Ubuntu 12.04 open additional drivers and install/active the one called “ATI/AMD proprietary FGLRX graphic driver (post-release updates)”.
After that is done, restart and download and extract the .AMD APP SDK 2.8 includes an installer. Run this with the command:
sudo sh Install-AMD-APP.sh |
Next, install the OpenCL headers files
sudo apt-get install opencl-headers |
And your done! Note that the AMD APP SDK and its samples is located at /opt/AMDAPP.
Installing OpenCL on Ubuntu Linux with NVIDIA graphics card
Download the CUDA toolkit for Ubuntu from . Open a terminal an run the installation file with the command:
sudo sh cudatoolkit_3.1_linux_64_ubuntu9.10.run |
Download the Developer Drivers for Linux at the same website and install it by first stopping X, running the file and start X again. To stop X use:
sudo /etc/init.d/gdm stop |
Then get a terminal up by pressing CTRL+ALT+F5, login and navigate to where you downloaded the devdriver then type:
sudo sh devdriver_3.1_linux_64_256.40.run |
After the driver has been installed start x again by typing
startx |
Before compiling an OpenCL application you need to add the path to the lib folder of CUDA to LD_LIBRARY_PATH like so:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
Your first OpenCL program – Vector addition
To demonstrate OpenCL I explain how to perform the simple task of vector addition. Suppose we have two lists of numbers, A and B, of equal size. The task of vector addition is to add the elements of A with the elements of B and put the result in the element of a new list called C of the same size. The figure below explains the operation.
The naive way of performing this operation is to simply loop through the list and perform the operation on one element at a time like the C++ code below:
for(int i = 0; i < LIST_SIZE; i++) { C[i] = A[i] + B[i]; } |
This algorithm is simple but has a linear time complexity, O(n) where n is the size of the list. But since each iteration of this loop is independent on the other iterations this operation is data parallel, meaning that each iteration can be computed simultaneously. So if we have n cores on a processor this operation can be performed in constant time O(1).
To make OpenCL perform this operation in parallel we need to make the kernel. The kernel is the function which will run on the compute device.
The kernel
The kernel is written in the OpenCL language which is a subset of C and has a lot of math and vector functions included. The kernel to perform the vector addition operation is defined below.
__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) { // Get the index of the current element to be processed int i = get_global_id(0); // Do the operation C[i] = A[i] + B[i]; } |
The host program
The host program controls the execution of kernels on the compute devices. The host program is written in C, but bindings for other languages like C++ and Python exists. The OpenCL API is defined in the cl.h (or opencl.h for apple) header file. Below is the code for the host program that executes the kernel above on compute device. I will not go into details on each step as this is supposed to be an introductory article although I can recommend the book if you want to dive into the details. The main steps of a host program is as follows:
- Get information about the platform and the devices available on the computer (line 42)
- Select devices to use in execution (line 43)
- Create an OpenCL context (line 47)
- Create a command queue (line 50)
- Create memory buffer objects(line 53-58)
- Transfer data (list A and B) to memory buffers on the device (line 61-64)
- Create program object (line 67)
- Load the kernel source code (line 24-35) and compile it (line 71) (online exeuction) or load the precompiled binary OpenCL program (offline execution)
- Create kernel object (line 74)
- Set kernel arguments (line 77-79)
- Execute the kernel (line 84)
- Read memory objects. In this case we read the list C from the compute device (line 88-90)
1 #include2 #include 3 4 #ifdef __APPLE__ 5 #include 6 #else 7 #include 8 #endif 9 10 #define MAX_SOURCE_SIZE (0x100000) 11 12 int main(void) { 13 // Create the two input vectors 14 int i; 15 const int LIST_SIZE = 1024; 16 int *A = (int*)malloc(sizeof(int)*LIST_SIZE); 17 int *B = (int*)malloc(sizeof(int)*LIST_SIZE); 18 for(i = 0; i < LIST_SIZE; i++) { 19 A[i] = i; 20 B[i] = LIST_SIZE - i; 21 } 22 23 // Load the kernel source code into the array source_str 24 FILE *fp; 25 char *source_str; 26 size_t source_size; 27 28 fp = fopen("vector_add_kernel.cl", "r"); 29 if (!fp) { 30 fprintf(stderr, "Failed to load kernel.\n"); 31 exit(1); 32 } 33 source_str = (char*)malloc(MAX_SOURCE_SIZE); 34 source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); 35 fclose( fp ); 36 37 // Get platform and device information 38 cl_platform_id platform_id = NULL; 39 cl_device_id device_id = NULL; 40 cl_uint ret_num_devices; 41 cl_uint ret_num_platforms; 42 cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms); 43 ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_DEFAULT, 1, 44 &device_id, &ret_num_devices); 45 46 // Create an OpenCL context 47 cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret); 48 49 // Create a command queue 50 cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret); 51 52 // Create memory buffers on the device for each vector 53 cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, 54 LIST_SIZE * sizeof(int), NULL, &ret); 55 cl_mem b_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, 56 LIST_SIZE * sizeof(int), NULL, &ret); 57 cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 58 LIST_SIZE * sizeof(int), NULL, &ret); 59 60 // Copy the lists A and B to their respective memory buffers 61 ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0, 62 LIST_SIZE * sizeof(int), A, 0, NULL, NULL); 63 ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0, 64 LIST_SIZE * sizeof(int), B, 0, NULL, NULL); 65 66 // Create a program from the kernel source 67 cl_program program = clCreateProgramWithSource(context, 1, 68 (const char **)&source_str, (const size_t *)&source_size, &ret); 69 70 // Build the program 71 ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); 72 73 // Create the OpenCL kernel 74 cl_kernel kernel = clCreateKernel(program, "vector_add", &ret); 75 76 // Set the arguments of the kernel 77 ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); 78 ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); 79 ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj); 80 81 // Execute the OpenCL kernel on the list 82 size_t global_item_size = LIST_SIZE; // Process the entire lists 83 size_t local_item_size = 64; // Divide work items into groups of 64 84 ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 85 &global_item_size, &local_item_size, 0, NULL, NULL); 86 87 // Read the memory buffer C on the device to the local variable C 88 int *C = (int*)malloc(sizeof(int)*LIST_SIZE); 89 ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, 90 LIST_SIZE * sizeof(int), C, 0, NULL, NULL); 91 92 // Display the result to the screen 93 for(i = 0; i < LIST_SIZE; i++) 94 printf("%d + %d = %d\n", A[i], B[i], C[i]); 95 96 // Clean up 97 ret = clFlush(command_queue); 98 ret = clFinish(command_queue); 99 ret = clReleaseKernel(kernel);100 ret = clReleaseProgram(program);101 ret = clReleaseMemObject(a_mem_obj);102 ret = clReleaseMemObject(b_mem_obj);103 ret = clReleaseMemObject(c_mem_obj);104 ret = clReleaseCommandQueue(command_queue);105 ret = clReleaseContext(context);106 free(A);107 free(B);108 free(C);109 return 0;110 }
To make OpenCL run the kernel on the GPU you can change the constant CL_DEVICE_TYPE_DEFAULT to CL_DEVICE_TYPE_GPU in line 43. To run on CPU you can set it to CL_DEVICE_TYPE_CPU. This shows how easy OpenCL makes it to run different programs on different compute devices.
Compiling an OpenCL program
If the OpenCL header and library files are located in their proper folders (/usr/include and /usr/lib) the following command will compile the vectorAddition program.
gcc main.c -o vectorAddition -l OpenCL |
How to learn more
To learn more about OpenCL I recommend the book from Fixstars called . Below are some links to useful sites with information on OpenCL: