OpenCL* and OpenGL* Interoperability Tutorial

April 28, 2014, 12:43 pm

Latest and popular articles on Intel Technologies

≪ Previous: Intel® graphics virtualization update

OpenCL and OpenGL* are two common APIs that support efficient interoperability. OpenCL* is specifically crafted to increase computing efficiency across platforms, and OpenGL* is a popular graphics API. This tutorial provides an overview of basic methods for resource-sharing and synchronization between these two APIs, supported by performance numbers and recommendations. A few advanced interoperability topics are also introduced, along with references.

Introduction

A general interoperability scenario includes transferring data between the OpenCL and OpenGL on a regular basis (often per frame) and in both directions. For example:

Physics simulation in OpenCL, producing the vertex data to be rendered with OpenGL
Generating a frame with OpenGL, with further image post-processing applied in OpenCL (e.g. HDR tone-mapping)
Procedural (noise) generation in OpenCL, followed by using the results as OpenGL texture in the rendering pipeline

Programmers often must choose between different APIs for programming GPUs, for example, choosing either GLSL or OpenCL kernels. For short, simple tasks that interact directly with the graphics pipeline, OpenGL Compute Shaders can be a good choice as an API. For more general (and complex) scenarios, OpenCL computing might have advantages over GLSL, since it executes the compute portion asynchronously to the rendering pipeline. Also, OpenCL allows you to utilize other devices than just GPUs. Yet, unlike conventional “native” programming in C/C++, OpenCL allows you to leverage devices other than CPUs, such as DSPs.

OpenCL interoperability also works with other APIs, too, which makes general applications flow more efficiently. A good example of this interoperability is video transcoding, which is best handled with the Intel Media SDK. Unlike OpenGL, Intel OpenCL implementation offers zero-copy with Media SDK.

General Execution Flow

It is important to understand the parameters and limitations of the interoperability, which are covered in this tutorial. Since most OpenCL and OpenGL calls are not executed immediately but are placed in command queues, host logic is required to coordinate the resource ownership between OpenCL and OpenGL.

True (zero-copy) interoperability is about passing ownership between the APIs and not the actual data of a resource. It is important to remember that it is an OpenCL memory object created from an OpenGL object and not vice versa:

OpenGL texture (or render-buffer) becomes an OpenCL image (via clCreateFromGLTexture).
OpenCL images are very similar to OpenGL textures by means of supporting interpolation, border modes, normalized coordinates, etc.
OpenGL (vertex) buffers are transformed to OpenCL buffers in a similar way (clCreateFromGLBuffer).

Basic Approaches to the OpenGL-OpenCL Interoperability

This tutorial discusses three different ways to utilize interoperability with OpenGL for the following general scenario:

OpenCL memory object is created from the OpenGL texture.
For every frame, the OpenCL memory object is acquired, then updated with an OpenCL kernel, and finally released to provide the updated texture data back to OpenGL.
For every frame, OpenGL renders textured Screen-Quad to display the results.

There are three different interop modes that we compare in this tutorial:

Direct OpenGL texture sharing via clCreateFromGLTexture.
- This is the most efficient mode of performance for interoperability with an Intel HD Graphics OpenCL device. It also allows the modification of textures “in-place.”
- The number of OpenGL texture formats and targets that are possible to share via OpenCL images is limited.
Creating an intermediate (staging) Pixel-Buffer-Object for the OpenGL texture via clCreateFromGLBuffer, updating the buffer with OpenCL, and copying the results back to the texture.
- The downside of this approach is that even though the PBO itself does support zero-copy with OpenCL, the final PBO-to-texture transfer still relies on copying, so this method is slower than the direct sharing method introduced above.
- The upside is that there are fewer restrictions on the texture formats and targets beyond those imposed by the glTexSubImage2D.
Mapping the GL texture with glMapBuffer, wrapping the resulting host pointer as an OpenCL buffer for processing, and copying the results back on glUnmapBuffer.
- Similar to the approach based on PBO referenced above, this approach allows you to perform interop for virtually any texture that can be updated with glTexSubImage2D.
- Unlike the original PBO-based method, it does not require any extension support.
- Performance-wise, it is even slower than the PBO-based method, particularly on small textures, because the OpenCL buffer is created/released in every frame.

From a performance perspective, the interoperability approach based on the direct OpenGL texture sharing—Texture Sharing via clCreateFromGLTexture—is the fastest way to share data with an Intel HD Graphics OpenCL device, while mapping the GL texture with glMapBuffer is the slowest. However, for a CPU OpenCL device, the performance is just the opposite.

We do not cover interoperability with OpenGL vertex buffers in this tutorial, but they are conceptually similar to Pixel-Buffer-Objects, in that they rely on the same clCreateFromGLBuffer for zero-copy sharing.

Also, plain data transfers from OpenGL to host memory and then to OpenCL and back (including mapping) is the most straightforward method that assumes neither extension usage nor actual sharing. As we stated, these methods allow the most general interoperability, while copying overheads (but not a power sipping) can be hidden with a multi-buffering approach. Refer to the details of the general asynchronous transfer to/from OpenGL for further reference ([3]). In order to be more performance/power efficient than a plain memory copy, an OpenCL implementation supporting cl_khr_gl_sharing is required, so we cover the extension in details first.

OpenCL-OpenGL Interoperability API

OpenCL-OpenGL interoperability is implemented as a Khronos extension to OpenCL [2]. The name of this extension is cl_khr_gl_sharing, and it should be listed among the supported extensions queried for the platform and the device.

The interfaces (API) for this extension are provided in the cl_gl.h header.

Just as with other vendor extension APIs, the clGetExtensionFunctionAddressForPlatform function should be used to provide pointers to the actual functions of the specific OpenCL platform. Following is the full list of functions for the extension and a short description for each:

Function	Description
clGetGLContextInfoKHR	Queries the devices associated with the OpenGL context
clCreateFromGLBuffer	Creates an OpenCL buffer object from the OpenGL buffer object
clCreateFromGLTexture	Creates an OpenCL image object from the OpenGL texture object
clCreateFromGLRenderbuffer	Creates an OpenCL 2D image object from the OpenGL render buffer
clGetGLObjectInfo	Queries type and name for the OpenGL object used to create the OpenCL memory object
clGetGLTextureInfo	Gets additional information (target and mipmap level) about the GL texture object associated with a memory object
clEnqueueAcquireGLObjects	Acquires OpenCL memory objects from OpenGL
clEnqueueReleaseGLObjects	Releases OpenCL memory objects to OpenGL

Creating the Interoperability-Capable OpenCL Context

Since OpenCL memory objects are created from OpenGL objects, we need some sort of shared OpenCL-OpenGL context. To avoid implicit copying via host, it is important to create OpenCL context for the same underlying device that drives the OpenGL context as well.

Via clGetGLContextInfoKHR, you can enumerate all OpenCL devices capable of sharing with the OpenGL context you are willing to interop. First, you need to set up additional context parameters:

//Additional attributes to OpenCL context creation
//which associate an OpenGL context with the OpenCL context
cl_context_properties props[] =
 {
//OpenCL platform
 	CL_CONTEXT_PLATFORM, (cl_context_properties)   platform,
//OpenGL context
      CL_GL_CONTEXT_KHR,   (cl_context_properties)   hRC,
//HDC used to create the OpenGL context
      CL_WGL_HDC_KHR,     (cl_context_properties) 	   hDC,
      0
    };

For the fastest interoperability, it is important to select a device currently associated with the given OpenGL context (CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR flag for the clGetGLContextInfoKHR).

Notice that there are potentially many more OpenCL devices that can potentially share the data with the OpenGL context (e.g. via copies). In the code example below, we use clGetGLContextInfoKHR with CL_DEVICES_FOR_GL_CONTEXT_KHR to enumerate all interoperable devices:


size_t bytes = 0;
// Notice that extension functions are accessed via pointers
// initialized with clGetExtensionFunctionAddressForPlatform.

// queuring how much bytes we need to read
clGetGLContextInfoKHR(props, CL_DEVICES_FOR_GL_CONTEXT_KHR, 0, NULL, &bytes);
// allocating the mem
size_t devNum = bytes/sizeof(cl_device_id);
std::vector<cl_device_id> devs (devNum);
//reading the info
clGetGLContextInfoKHR(props, CL_DEVICES_FOR_GL_CONTEXT_KHR, bytes, devs, NULL));
//looping over all devices
for(size_t i=0;i<devNum; i++)
{
      //enumerating the devices for the type, names, CL_DEVICE_EXTENSIONS, etc
      clGetDeviceInfo(devs[i],CL_DEVICE_TYPE, …);
      …
      clGetDeviceInfo(devs[i],CL_DEVICE_EXTENSIONS,…);
      …
}

This tutorial supports selecting the platform and device to run (refer to the section on controlling the sample), so after identifying the available devices (that support cl_khr_gl_sharing) for the requested platform, the “OpenGL-shared” OpenCL context is created along with a queue for the selected device:

context = clCreateContext(props,1,&device,0,0,NULL);
queue = clCreateCommandQueue(context,device,CL_QUEUE_PROFILING_ENABLE,NULL);

Once we established a shared OpenCL-OpenGL context we can implement sharing using one of three basic ways described below:

Method 1: Texture Sharing via clCreateFromGLTexture

This method is the most efficient, allowing direct OpenGL-texture to OpenCL-image sharing with the help of clCreateFromGLTexture. This approach also allows modifying the content of the texture in place. Following are the required steps:

Creating OpenGL 2D texture the regular way:

//generate the texture ID
		glGenTextures(1, &texture));
		//binnding the texture
		glBindTexture(GL_TEXTURE_2D, texture));
		//regular sampler params
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE));
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE));
		//need to set GL_NEAREST
		//(not GL_NEAREST_MIPMAP_* which would cause CL_INVALID_GL_OBJECT later)
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST));
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST));
		//specify texture dimensions, format etc
		glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, g_width, g_height, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);

Creating the OpenCL image corresponding to the texture (once):
```
cl_mem mem = clCreateFromGLTexture(context, CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0,texture,NULL);
```
Note the CL_MEM_WRITE_ONLY flag that allows fast discarding of the data. Use CL_MEM_READ_WRITE if your kernel requires reading the current texture context. Also, remove the _write_only qualifier for the image access in the kernel in that case.

Acquiring the ownership via clEnqueueAcquireGLObjects:

glFinish();
		clEnqueueAcquireGLObjects(queue, 1,  &mem, 0, 0, NULL));

Executing the OpenCL kernel that alters the image:

clSetKernelArg(kernel_image, 0, sizeof(mem), &mem);…
		clEnqueueNDRangeKernel(queue,kernel_image, …);

Releasing the ownership via clEnqueueReleaseGLObjects:

clFinish(queue);
		clEnqueueReleaseGLObjects(queue, 1,  &mem, 0, 0, NULL));

In this approach, the relation between OpenCL image and OpenGL texture is specified just once, and only acquire/release calls are used to pass ownership of the resources between APIs, thus providing zero-copy goodness for the actual texture data. However, the number of texture formats and usages for which this sharing is possible is rather limited.

Method 2: Texture Sharing via Pixel-Buffer-Object and clCreateFromGLBuffer

This method relies on the intermediate (staging) Pixel-Buffer-Object (PBO). It is less efficient than the direct sharing previously discussed, due to the final copying of the PBO bits to the texture. However, it allows for the potential of sharing textures of more formats (refer to the formats that glTexSubImage2D supports), unlike the limited set supported by clCreateFromGLTexture.

The code sequence is as follows:

Create OpenGL 2D texture in the standard way (refer to the first step in the previous section).

Create the OpenGL Pixel-Buffer-Object (once):

GLuint pbo;
glGenBuffers(1, &pbo);
glBindBuffer(GL_ARRAY_BUFFER, pbo);
//specifying the buffer size
glBufferData(GL_ARRAY_BUFFER, width * height * sizeof(cl_uchar4), …);

Create the OpenCL buffer corresponding to the Pixel-Buffer-Object (once):
```
mem = clCreateFromGLBuffer(g_context, CL_MEM_WRITE_ONLY, pbo,  NULL);
```
Note that the CL_MEM_WRITE_ONLY flag as buffer contains no original texture data, so there is no point to making it readable.
Acquire the ownership via clEnqueueAcquireGLObjects, execute the kernel that updates the buffer content, and release the ownership via clEnqueueReleaseGLObjects. These steps are the same as steps 3-5 in the previous section (only kernel itself is slightly different, as it operates on the OpenCL buffer and not image).

Finally, streaming data from the PBO to the texture is required:

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
glBindTexture(GL_TEXTURE_2D, texture);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL);

Method 3: Texture Sharing with glMapBuffer

This method is similar to the previous approach (PBO-based), but rather than rely on the clCreateFromGLBuffer to share the PBO with OpenCL, it performs straightforward mapping of the PBO to the host memory so it doesn’t require any extension to support.

Create texture and PBO (refer to the first steps in the previous section).
Map the PBO bits to the host memory:
void* p = glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_READ_WRITE));
The resulting pointer is wrapped with the OpenCL buffer using the CL_MEM_USE_HOST_PTR to avoid copy (the buffer is created/destroyed in each frame):
```
cl_mem mem =
clCreateBuffer(g_context,CL_MEM_WRITE_ONLY|CL_MEM_USE_HOST_PTR,
width*height*sizeof(cl_uchar4), p, NULL);
```

Call the OpenCL kernel that alters the buffer content, changing values to green:

		clSetKernelArg(kernel_buffer, 0, sizeof(mem), &mem);…
		clEnqueueNDRangeKernel(queue,kernel_buffer, …);
		clFinish(queue);

Upon the kernel completion, the buffer bits are copied back with glUnmapBuffer. Note that calls to clEnqueueMapBuffer/clEnqueueUnmapMemObject are needed to make sure the actual buffer memory behind the mapped pointer is updated, as discrete GPUs might mirror a buffer instead and perform an actual update (copy) on clEnqueueUnmapMemObject only.
Release the OpenCL buffer:
```
clReleaseMemObject(mem);
```
The rest of the procedure is the same to the last (fifth) step of the previous approach. PBO bits are copied to the texture with glTexSubImage2D.

Synchronization

To maintain data integrity, the application is responsible for synchronizing access to shared OpenCL/OpenGL objects. Specifically, prior to calling clEnqueueAcquireGLObjects, the application must ensure that any pending OpenGL operations that access the objects specified in mem_objects have completed in all OpenGL contexts. Note that no synchronization methods, other than glFinish, are portable between OpenGL implementations at this time, so this tutorial relies on this mechanism.

Similarly, prior to executing subsequent OpenGL commands that reference the released objects (after clEnqueueReleaseGLObjects), the application is responsible for ensuring that any pending OpenCL operations that access the objects have completed. The most portable way is calling clWaitForEvents with the event object returned by clEnqueueReleaseGLObjects or by calling clFinish as we do in this tutorial.

There is a more fine-grained way provided by the cl_khr_gl_event extension that allows creating OpenCL event objects from the OpenGL fence object. The OpenGL fence can be placed in the OpenGL command stream, allowing it to wait for completion of that fence in the OpenCL command queue. The complimentary GL_ARB_cl_event extension in OpenGL provides the way of creating an OpenGL sync object from an OpenCL event.

More importantly, supporting the cl_khr_gl_event guarantees that the OpenCL implementation will ensure that any pending OpenGL operations are complete for the OpenGL context upon calling the clEnqueueAcquireGLObjects in the OpenCL context. Similar clEnqueueReleaseGLObjects guarantee the OpenCL is done with the objects, so no explicit clFinish is required. This is referred to as “implicit synchronization.”

This tutorial checks for the extension support and omits calls to clFinish()/glFinish() if the cl_khr_gl_event support is presented for the selected device.

Notes on Textures, Formats, and Targets

It is important to understand that there is a limit to the number of OpenGL texture formats and targets that are possible to share via OpenCL images. You can find the full information on this topic on the Khronos web site: https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateFromGLTexture.html

There is a well-defined list of OpenGL texture usages (“targets”) that can be shared:

An OpenCL (1D, 2D, 3D) image object can be created from the regular OpenGL (1D, 2D, 3D) texture.
A 2D OpenCL image can be created from single face of an OpenGL cubemap texture.
An OpenCL 1D or 2D image array can be created from an OpenGL 1D or 2D texture array.

Refer to Table 9.4 in the The OpenCL Extension Specification, Version 1.2 [2] that describes the list of GL texture internal formats and the corresponding image formats in OpenCL. These are the most important four-channel formats, such as GL_RGBA8 or GL_RGBA32F, for which the mapping is guaranteed.

Textures created with other OpenGL internal formats may also have a mapping to a CL image format. So, if such mappings exist (they are implementation-specific), the clCreateFromGLTexture succeeds; otherwise it fails with CL_INVALID_IMAGE_FORMAT_DESCRIPTOR.

Note that single-channel textures are generally not supported for sharing.

Also note that OpenGL depth (depth-stencil) buffer sharing is subject for separate cl_khr_depth_images and cl_khr_gl_depth_images extensions, which we do not cover in this tutorial.

Finally, multi-sampled (MSAA) textures, both color and depth, are subject for cl_khr_gl_msaa_sharing (which requires cl_khr_gl_depth_images support from the implementation).

Conclusion

It is important to follow the right approach for OpenCL-OpenGL interoperability, taking into account limitations such as texture usages and formats and caveats like synchronization between the APIs. Also, the approach to interoperability (direct sharing, PBO, or plain mapping) might be different depending on the target OpenCL device.

Still, when utilizing the right approach from those discussed in this tutorial, you can achieve the best of both worlds: graphics and computing. For Intel HD Graphics and Iris Pro Graphics OpenCL devices, the direct sharing discussed in detail in this tutorial is ultimately the right way to go.

Resources

[1] Details of the cl_khr_gl_sharing extension: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/cl_khr_gl_sharing.html

[2] The OpenCL Extension Specification, Version: 1.2, Document Revision: 19, Khronos OpenCL Working Group: http://www.khronos.org/registry/cl/specs/opencl-1.2-extensions.pdf

[3] OpenGL Insights (Asynchronous Buffer Transfers Chapter): http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-AsynchronousBufferTransfers.pdf

Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Developers

Intermediate

Intel® SDK for OpenCL™ Applications

↧

OpenCL related question

May 5, 2014, 2:59 pm

Latest and popular articles on Intel Technologies

≫ Next: OpenCL related question

≪ Previous: OpenCL* and OpenGL* Interoperability Tutorial

Hello! I have restarted some of my experiments on the Intel Haswell processor and some of them stopped working, namely the ones related to examples meant to be executed for the GPU.

My main question is if I use clCreateBuffer with the flag CL_MEM_USE_HOST_PTR what does that flag actually do:

1. i create the array on the host, the gpu will use the same address as the address that was given to the array allocated on the host; if this thing happens then theoretically I will be able to compute something on the GPU and at the kernel termination point (also known as a synchronization point), the data written by the GPU should be inside the allocate region from the host?

2. i create the vector on the CPU host, after which there is a secondary memory location allocated when invoked clCreateBuffer, even though I am using CL_MEM_USE_HOST_PTR, and the data is inherently copied between the host and the device memory allocations in memory.

The reason i am asking is mainly because of the L3 cache which is shared between the CPU and the GPU. If both operate on the same address then in the L3 cache the data can be seen by both. Therefore a cooperation between the two may be possible

Thanks

↧

OpenCL related question

May 5, 2014, 2:59 pm

Latest and popular articles on Intel Technologies

≫ Next: A Basic Sample of OpenCL™ Host Code

≪ Previous: OpenCL related question

Hello! I have restarted some of my experiments on the Intel Haswell processor and some of them stopped working, namely the ones related to examples meant to be executed for the GPU.

My main question is if I use clCreateBuffer with the flag CL_MEM_USE_HOST_PTR what does that flag actually do:

Thanks

↧

A Basic Sample of OpenCL™ Host Code

May 5, 2014, 3:28 pm

Latest and popular articles on Intel Technologies

≫ Next: Adobe Photoshop* with Open Standards Enhanced by Intel® HD and Iris™ Graphics

≪ Previous: OpenCL related question

Download PDF [686.3 kB]

Download Sample OCL ZIP [10.89 mB]

Introduction

Programmers new to OpenCL may find that the most complete documentation—the Khronos OpenCL specification—is not the best guide to getting started programming for OpenCL. The specification describes many options and alternatives, which can be confusing at first. Other code samples written for OpenCL may focus on the device kernel code, or may use host code written with an OpenCL “wrapper” library that hides the details of how to directly use the standard OpenCL host API.

The SampleOCL sample code described in this document aims to provide a clear and readable representation of the basic elements of a non-trivial OpenCL program. The focus of the sample code is the OpenCL™ code for the host (CPU), rather than kernel coding or performance. It demonstrates the basics of constructing a fairly simple OpenCL application, using the OpenCL v1.2 specification.[1] Similarly, this document focuses on the structure of the host code and the OpenCL APIs used by that code.

About the Sample

This code sample uses the same OpenCL kernel as the ToneMapping sample (see reference below), previously published for the Intel® SDK for OpenCL Applications [2]. This simple kernel attempts to make visible features of an image that would otherwise be too dark or too bright to distinguish. It reads pixels from an input buffer, modifies them, and writes them out to the same position of an output buffer. For more information on how this kernel works, see the document High Dynamic Range Tone Mapping Post Processing Effect [3].

OpenCL Implementation

The SampleOCL sample application is not intended to "wrap" OpenCL; that is, it does not try to replace OpenCL APIs with a "higher level" API. Generally I have found that such wrappers are not much simpler or cleaner than using the OpenCL API directly and, while the original programmer of a wrapper may find the wrapper easier to work with, the wrapper will impose a burden on any OpenCL programmer called upon to maintain the code. The OpenCL APIs are a standard. To wrap them in a proprietary "improved" API is to throw away much of the value of having that standard.

With that said, the SampleOCL implementation does make use of a few C++ classes and associated methods to separate the use of OpenCL APIs into a few groups. The application is broken into two main classes to separate generic application elements from elements related to OpenCL. The former is C_SampleOCLApp; the latter is C_OCL.

Limitations

This sample code focuses only on the basics of an OpenCL application, as specified in version 1.2. It provides no insight into differences from other revisions, though most of the information should still be relevant for newer revisions.

The host side application code of this sample is not intended to demonstrate the most optimal performance. For simplicity, several obvious optimizations have been left out.

OpenCL Application Basics

What follows is a fairly complete explanation of a basic OpenCL application program sequence. The emphasis is on "basic," as many options are not covered. More information can be found in the OpenCL specification [1].

An OpenCL application should be able to execute with substantial parallelism on a variety of processing devices such as multi-core CPUs with SIMD instruction support and Graphics Processing Units (GPUs), either discrete or integrated into a CPU. As such, one of the first things an OpenCL application must do is determine what devices are available and select the device or devices that will be used. A single platform might support more than one type of device, such as a CPU that has an integrated GPU, and more than one platform may be available to the application.

Each platform available to the OpenCL application will have an associated name, vendor, etc. That information can be obtained using the OpenCL APIs clGetPlatformIDs() followed by clGetPlatformInfo() and can be used to select a desired platform.

Once a platform is selected, a context must be created to encompass the OpenCL devices, memory, and other resources needed by an application. With the selected platform ID and a specification of the desired device type (CPU, GPU, etc.), an application can call clCreateContextFromType() and then use clGetContextInfo() to obtain the device IDs. Or, it can directly request deviceIDs for a given platform ID and device type using clGetDeviceIDs() and then use clCreateContext() with those device IDs to create the context. This sample uses the latter approach to create a context with a single GPU device.

With the desired device ID(s) and context, one can create a command queue for each device to be used using clCreateCommandQueue(). The command queue is used to "enqueue" operations from the host application to the GPU or other device, for example, requesting that a particular OpenCL kernel be executed. This sample code creates a single command queue for a GPU device.

With that initialization work done, a common next step is to create one or more OpenCL program objects using clCreateProgramWithSource(). Once the program is created, it must still be built (essentially compiled and linked) using clBuildProgram(). That API allows setting options to the compiler, such as #defines to modify the program source.

Finally, with the program created and built, kernel objects that link to the functions in that program can be created, calling clCreateKernel() for each kernel function name.

Prior to executing an OpenCL kernel, set up the data to be processed-usually done by creating linear memory buffers using the clCreateBuffer() API function. (An Image is another OpenCL memory object type not used in this sample.) The clCreateBuffer function can allocate memory for a buffer of a given size and optionally copy data from host memory, or it can set up the buffer to directly use space already allocated by the host code. (The latter can avoid copying from host memory to the OpenCL buffer, which is a common performance optimization.)

Typically, a kernel will need at least one input and one output buffer as well as other arguments. The arguments need to be set up one at a time for the kernel to access at execution time by calling the clSetKernelArg() function for each argument. The function is called with a number indexing a particular argument in the kernel function argument list. The first argument is passed with index 0, the second with index 1, etc.

With the arguments set, call the function clEnqueueNDRangeKernel() with the kernel object and a command queue to request that the kernel be executed. Once the kernel is enqueued, the host code can do other things, or it can wait for the kernel (and everything previously enqueued) to finish by calling the clFinish() function. This sample calls clFinish(), as it includes code to time the total kernel execution (including any enqueue overhead) in a loop needs to wait for each execution to finish before recording the final or contribution to the average time.

That's the bulk of what goes into an OpenCL application. There are some clean-up operations, such as calling clReleaseKernel, clReleaseMemObject, clReleaseProgram, etc. These are included in the sample, even though OpenCL should automatically release all resources when the program exits. A more complex program might wish to release resources in a timely fashion to avoid memory leaks.

A final word of caution: while this sample does not use "events," they can be very useful for more complex applications that wish, for example, to overlap CPU and GPU processing. However, it is very important to note that any clEnqueueXXXXX() function (where "XXXXX" is replaced with the name for one of many possible functions) that is passed a pointer to an event will allocate an event, and the calling application code is then responsible for calling clReleaseEvent() with a pointer to that event at some point. If this is not done, the program will experience a memory leak as events accumulate.

A common mistake is to use the clCreateUserEvent() function to allocate an event to pass to any clEnqueueXXXX function, thinking that OpenCL will signal that event when it completes. OpenCL will not use that event, and the clEnqueueXXXX will return a new event, overwriting the contents of the event variable passed by pointer. This is an easy way to create a memory leak. User events have a different purpose, beyond the scope of this sample. For more details on OpenCL events, please see the OpenCL specification.[1]

Project Structure

_tmain ( argc, argv ) - Main entry point function in the Main.cpp file.

Creates an instance of class C_SampleOCLApp.

Calls C_SampleOCLApp::Run() to start application.

That's all it does! See the C_MainApp and C_SampleApp classes below for more details.

class C_MainApp - Generic OpenCL application super-class in the C_MainApp.h file.

On construction, creates instance of OpenCL class C_OCL.

Defines a generic application "run" function:

Run()

Run() is a good starting point for reading the code to understand how an OpenCL application is initialized, run, and cleaned up.

Run() calls virtual functions (see below) in a simple representative application sequence.

Declares virtual functions to be defined by C_SampleOCLApp (below):

AppParseArgs ()	Parse command line options
AppUsage ()	Print usage instructions
AppSetup ()	Application set up, including OpenCL set up
AppRun ()	Application specific operations
AppCleanup ()	Application clean up

class C_SampleOCLApp - Derived from C_MainApp, defines functions specific to this sample.

Implements application specific code for the C_MainApp virtual functions in the SampleApp.cpp and SampleApp.h files. (See class C_MainApp (above) for the virtual functions implemented.)

Defines "ToneMap" OpenCL kernel setup and execution functions in the ToneMap_OCL.cpp file:

RunOclToneMap ()	Does one-time set up for ToneMap, then calls ToneMap().
ToneMap ()	Sets ToneMapping kernel arguments and executes the kernel.

class C_OCL - Most of the host side OpenCL API set up and clean up code.

On construction, initializes OpenCL. On destruction, cleans up after OpenCL.

Defines OpenCL service functions in the C_OCL.cpp and C_OCL.h files:

Start ()	Sets up OpenCL device for Intel® Iris™ graphics for proper platform.
ReadAllPlatforms ()	Obtains all available OpenCL platforms, saving their names.
MatchPlatformName ()	Helper function, chooses a platform by name.
GetDeviceType ()	Helper function - is device type GPU or CPU?
CheckExtension ()	Checks if a particular OpenCL extension is supported on the current device.
ReadExtensions ()	Obtains a string listing all OpenCL extensions for the current device.
SetCurrentDeviceType ()	Sets desired device type and creates OpenCL context and command queue.
CreateProgramFromFile ()	Loads a file containing OpenCL kernels, creates an OpenCL program, and builds it.
ReadSourceFile ()	Reads OpenCL kernel source file into a string, ready to build as a program.
CreateKernelFromProgram ( )	Creates an OpenCL kernel from a previously built program.
GetDeviceInfo ()	Two helper functions to get device specific information: one allocates memory to receive and return results, the other returns results via a pointer to memory provided by the caller.
ClearAllPlatforms ()	Releases everything associated with a previously selected platform.
ClearAllPrograms ()	Releases all currently existing OpenCL programs.
ClearAllKernels ()	Releases all currently existing OpenCL kernels.

OpenCL APIs Used

clBuildProgram	clCreateBuffer
clCreateCommandQueue	clCreateContext
clCreateKernel	clCreateProgramWithSource
clEnqueueMapBuffer	clEnqueueNDRangeKernel
clEnqueueUnmapMemObject	clFinish
clGetDeviceIDs	clGetDeviceInfo
clGetPlatformIDs	clGetPlatformInfo
clReleaseCommandQueue	clReleaseContext
clReleaseDevice	clReleaseDevice
clReleaseKernel	clReleaseMemObject
clReleaseProgram	clSetKernelArg

Controlling the Sample

This sample is run from a Microsoft Windows* command line console. It supports the following command line and optional parameters:

ToneMapping.exe [ ? | --h ] [-c|-g] [-list] [-p "platformName] [-i "full image filename"]

? OR --h	Prints this help message
-c	Runs OpenCL on CPU
-g	Runs OpenCL on GPU - default
-list	Displays list of platform name strings
-p "platformName"	Supplies a platform name (in quotes if it has spaces) to check for and use.
-i "full image filename"	Supplies an image file name (in quotes if it has spaces) to process.

References

OpenCL Specifications from Khronos.org:
http://www.khronos.org/registry/cl/
Intel® SDK for OpenCL™ Applications:http://software.intel.com/en-us/vcsource/tools/opencl-sdk
High Dynamic Range Tone Mapping Post Processing Effect:
http://software.intel.com/en-us/vcsource/samples/hdr-tone-mapping

Intel, the Intel logo, and Iris are trademarks of Intel Corporation in the U.S. and other countries.
* Other names and brands may be claimed as the property of others.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission from Khronos.
Copyright © 2014, Intel Corporation. All rights reserved.

Intel® SDK for OpenCL™ Applications

OpenCL*

URL

↧

Adobe Photoshop* with Open Standards Enhanced by Intel® HD and Iris™ Graphics

May 7, 2014, 2:48 pm

Latest and popular articles on Intel Technologies

≫ Next: OpenCL Kernel Fails only on 4th Generation Processors

≪ Previous: A Basic Sample of OpenCL™ Host Code

Download PDF [1.13MB]

Introduction

In this article, we’ll explore the strides Adobe engineers have made over the last few years to enhance Photoshop using OpenGL* and OpenCL™ to increase hardware utilization. The Adobe team selected two features—Blur and Smart Sharpen—as the focus of its recent efforts because both provided less than optimal processing speed and quality. We will discuss the results on those features in this paper.

Open Standards Languages, Adobe Photoshop, and Intel

OpenGL (Open Graphic Language) has been used for years to boost rendering performance and resource utilization across platforms. OpenCL (Open Computing Language) is a royalty-free open standard for portable general purpose parallel programming across central, graphics and other processors. The OpenCL standard is a complement to the existing OpenGL APIs that adds general computation routines to OpenGL’s use of graphics processors for rendering work. OpenCL gives developers a uniform programming environment to execute code on all processing resources within a given platform.

Adobe Photoshop is a leading graphics industry application used for graphics editing and manipulation. A heavy processing and memory resource user, Photoshop is a powerful application that requires the greatest performance possible from a computer. To aid in its graphics processing capabilities, Adobe has used open standards for many generations of Photoshop. It has now been updated to take advantage of OpenCL, which allows for an even higher level of performance.

Intel provided testing for this report. Intel also makes available an array of tools and SDKs to accelerate development for visual computing. These include a Developer’s Guides for Intel® Processors (links to guides for each generation of Intel graphics processors on the page). The latest guide is the Graphics Developer's Guide for 4th Generation Intel® Core™ Processor Graphics – now includes OpenGL), the Intel® SDK for OpenCL Applications, and a web site dedicated to visual computing.

For a powerful application, like Photoshop, using open standards like OpenGL and OpenCL can improve performance and allow the processing routines to be used across platforms and with other Adobe products more easily.

Photoshop’s Use of OpenGL and OpenCL Standards

A few years ago, in Adobe’s Creative Suite* 4 release of Photoshop (Photoshop CS4), Adobe developers focused their OpenGL efforts on enhancing the Canvas and 3D interactions. They implemented Smooth Zoom, Panning, Canvas Rotate, Pixel Grid, and 3D Axis/Lights using the OpenGL API to improve the performance. Turn these features ON (in Preferences), enable “Use Graphics Processor, select the “Advanced Settings” button, select “Advanced Drawing Mode” from the dropdown menu, and enable the checkboxes for “Use Graphics Processor to Accelerate Computation” and “Use OpenCL.“ Refer to Figure 1 and Figure 2 for recommended settings in the Photoshop user interface.

Photoshop on Intel Figure 1: Graphics Settings in Photoshop* Preferences Dialog

Photoshop Settings Figure 2: Select "Advanced" in the Drawing Mode Dropdown

With Photoshop CS5, the developers used OpenGL to speed up the user interface (UI) and to add the Pixel Bender plug-in. The specific UI features targeted were the Scrubby Zoom, HUD Color Picker, Color Sampling Ring, Repousse, and 3D Overlays. With these new features, the OpenGL modes were expanded to encompass basic, normal, and advanced methods.

Then in Photoshop CS6, the team enhanced content editing with standards-based features from both OpenGL and OpenCL. Developers added Adaptive Wide Angle, Liquify, Oil Paint, Puppet Warp, Lighting Effects, and 3D Enhancements using OpenGL. The OpenCL standard was used to add a Field/Iris Tilt/Shift function as well as the Blur function.

Today, with Adobe’s latest release of their Creative Suite, Creative Cloud enhances the Photoshop application even further with Smart Sharpen and the selectable modes of “Use Graphics Processor” and “Use OpenCL.”

Intel HD Graphics Devices that support OpenGL and OpenCL

The following Intel graphics devices should be enabled for OpenGL and OpenCL graphics acceleration by default:

4th generation Intel® Core™ processors
- Intel® HD Graphics 4200, 4400, 4600, P4600, P4700
- HD Graphics 5000
- Iris™ Graphics 5100
- Iris™ Pro Graphics 5200
3rd gen Intel® Core™ processors
- Intel® HD Graphics 4000, P4000
2nd gen Intel® Core™ processors
- Intel® HD Graphics 3000, P3000

Developing Specific Photoshop Enhancements

How did the developers achieve these results? Photoshop uses a system of layers to apply many of its advanced features to an image. Figure 3 illustrates this concept by showing three layers of a very simple image as the WHITE space. The EFFECTS, the RED and BLUE layers in the example, are separate layers in the stack. Effects that can be applied in layers include Sharpen, Blur, and even Red-eye Removal. Effects can be applied to the final image without changing the original file. These layers can also be ordered, stacked, and combined to provide a blended effect as the right side of the figure shows, e.g., combining red and blue to get PURPLE. Additionally, there are special layers called “mask layers” that allow you to restrict an effect’s application to a select region of the image.

Photoshop on Intel Figure 3: Separate (Left) and Combined (Right) Layers

The image-combining aspect also applies to Photoshop textures. A Photoshop texture refers to the content of a layer that is then blended or overlaid with other layers to “texture” an image. Notice how the bricks from the image on the lower left below provide a texture to the cloak of the statue in the middle image when applied with a small percentage of opacity.

Photoshop on Intel Figure 4: Example of Using Texture in Photoshop*

Adobe used an OpenGL API to enhance the Photoshop texture/layer effect. In the OpenGL Advanced Mode, the OpenCL “– int format –GL_RGBA16F_ARB” call enables the Shader Tool to apply checkerboard compositing, tone mapping, and color matching.

Sharpen the Focus

Sports photographers use the Sharpen step extensively, as just a few increments on the controls can make a world of difference in the impact of an action shot. Figure 5 demonstrates how detail can be improved by applying a Sharpen step in photo editing. Notice the text, stars, and even brush stroke detail is a little more pronounced in the “After” image on the right.

Photoshop on Intel Figure 5: Original Image (Left) and After Sharpen (Right)

However, the Sharpen step can create some unwanted side effects. Details that are relatively sharp or insignificant in the original image can develop artifacts akin to boosting the image “noise,” producing a halo effect. For this release, Adobe renovated the legacy Smart Sharpen by introducing a patch-based “denoise and sharpen” algorithm implemented using the OpenCL standard. The new patch-based algorithm produces a sharpened image without any halo effect. Furthermore, the denoise step suppresses the “noise gets boosted when you sharpen” issue. Compare the images in Figure 6, Figure 7, and Figure 8 below. With this result, Adobe looks forward to using these standards to further improve all the sharpen tools.

Photoshop on Intel Figure 6: Original text image

Photoshop on Intel Figure 7: Image after applying legacy smart sharpen w/halo effect

Photoshop on Intel Figure 8: Image after applying patch-based smart sharpen

Bringing Blur into Focus

Another editing tool function that was improved by using OpenCL was the Blur tool. There are numerous ways to emphasize and de-emphasize a portion of an image. Many qualities can be influenced at the time the photograph is captured, but a photograph’s impact can be improved, or at least changed, with some post-processing. Red-eye removal and cropping are very common post-processing tasks, but image sharpness can also be improved. Image area-specific sharpness can have a large impact.

Photoshop on Intel Figure 9: Mona Lisa (Portrait of Lisa Gherardini, wife of Francesco del Giocondo) by Leonardo da Vinci

In his masterpiece Mona Lisa, Leonardo da Vinci (Figure 9) [8] emphasized his subject in the portrait by placing her image in the foreground with a somewhat out-of-focus rural landscape behind her. By blurring the background, he helped the viewer focus on his subject, which was the most important part of the painting, not the background. Following is an example of how blurring can improve a more modern image. Finding the photograph’s main theme can be difficult, so blurring helps refine the image’s theme. I took the sharpened image used in Figure 6 and further emphasized the central star in the image by applying a Blur tool, which results in the clarity of the star in the image on the right below (Figure 10). Blurring changed our perspective of the image so that the star is obviously the focus. Suffice it to say, there are lots of ways to blur an image (on purpose), and this is one of the newest ways.

Photoshop on Intel
Figure 10: Original Image (left) Sharpen (center), then Blur (right) added for emphasis

Adding Blur to an image is much like using a color crayon, except the mouse is the crayon and the color is the Blur feature. To apply the Blur, you select the Blur tool, size the tool “brush” (a cursor that can be sized from 1 screen pixel to the size of the entire image) to match the size of the image region you wish to blur, and then click-and-hold the mouse while “coloring or scrubbing” over the area of the image you wish to blur. The more coloring action performed, the more blur applied to the image region.

Intel Increased Graphics Performance

The exercise of adding an OpenCL Blur tool was somewhat challenging and provided a few good learning opportunities. The team wanted to balance the workload by utilizing all the possible resources on the host platform. Cross-platform support is critical including Windows* and Mac* OSs. These factors led them to OpenCL. The team ended up taking an existing blur tool in Photoshop and porting it from optimized CPU code to OpenCL kernels.

Adobe looked to reduce the complexity required of Blur that before OpenCL required multiple command queues running on multiple threads. They also experienced resource limitations, such as timeouts and out-of-memory failures, on lower-end video subsystems. Finally, platform variations, like driver stacks and the use of various compilers, would be reduced by going to the OpenCL-based solution. OpenCL allowed them to reduce these challenges by making it possible to cache a portion of an image to local memory and break the images down to smaller 2k by 2k blocks for the graphics processor. These improvements resulted in higher reliability and a 4 to 8 times faster filter time by utilizing the GPU.

Intel’s testing shows performance gains on the following Photoshop actions, as the available execution units and memory bandwidth have increased over the generations of Intel HD graphics as shown in the chart below (Figure 11).[1]

Photoshop on Intel
Figure 11: Photoshop* with OpenGL* Performance over Generations of Intel HD Graphics

When tests are run with the OpenGL or OpenCL features enabled and disabled, we see the routines add significant performance improvement to both the Liquify Filter and the Field Blur tools in the graphs below (Figure 12). Liquify and Blur processing times normalized in seconds to 1 GPU acceleration off/on Intel® HD 4600.[1]

Photoshop on Intel Figure 12: Photoshop* tool performance with Standards On/Off

The effort was well worth it. The performance of this new Blur function when tested with the OpenCL hardware acceleration ON versus OFF had a 3x faster processing time depending on the workload and the size of the radius being blurred (Figure 13).[1]

Photoshop on Intel
Figure 13: Sample blur execution time (in seconds) compared

General application processing accounts for the majority of time in smaller workloads, so larger workloads show a better improvement in processing time. When OpenCL acceleration is enabled, both the CPU and the GPU are efficiently utilized, with many of the multithreaded app’s cores submitting work to the graphics processor. The graphics processing unit is utilized at a minimum 70% rate while memory utilization is 10%-36% depending on the graphics subsystem. Finally, there were no stalls in the graphics pipeline making for an improved user experience.

Summary

Adding standards-based processing routines has allowed Adobe to continue its tradition of enhancing Photoshop performance with each release. With the addition of OpenCL-based acceleration on an Intel HD Graphics device, the user experiences an improvement in performance and gains an ability to evaluate the blur filter almost real-time across the entire image. This complete image experience was not possible before OpenCL was added to these filters, and this change makes creating compelling images much more efficient. Prior to the addition of OpenCL, only a small fraction of the image could be previewed before applying the effect. Similarly, users can review their smart sharpening filter as they make adjustments full screen and get to the desired final image faster. Now with OpenCL, Photoshop is clearly better.

[1]Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

References and Resources

About the Authors

Tim Duncan is an Intel Engineer and is described by friends as “Mr. Gidget-Gadget.” Currently helping developers integrate technology into solutions, Tim has decades of industry experience, from chip manufacturing to systems integration. Find him on the Intel® Developer Zone as Tim Duncan (Intel)

Murali Madhanagopal is a member of the Intel Visual & Parallel Computing Group, where he is a Lead Graphics Architect. He received his M.S. in Computer Information Systems from Texas A&M University, College Station and has a bachelor’s degree in Computer Engineering from the College of Engineering Guindy, Anna University, India. Madhanagopal is responsible for developing and executing Intel’s workstation processor graphics strategy that enables ISV’s software to run efficiently on current and future processor graphics-based platforms. He is actively engaged in application and system optimization activities with industry-leading CAD, CAE, and Digital Content Creation ISVs and OEMs.

Intel, the Intel logo, and Iris are trademarks of Intel Corporation in the U.S. and/or other countries.
OpenCL and the OpenCL logo are trademarks of Apple Inc and are used by permission by Khronos.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

Intel® SDK for OpenCL™ Applications

↧

OpenCL Kernel Fails only on 4th Generation Processors

May 8, 2014, 12:12 pm

Latest and popular articles on Intel Technologies

≫ Next: 教程：基于 Android* 操作系统的 OpenCL™ 入门

≪ Previous: Adobe Photoshop* with Open Standards Enhanced by Intel® HD and Iris™ Graphics

Hello,

I am working on creating a N-Body gravitational simulation using Intel OpenCL SDK. I have attached our Kernel. This kernel fails to execute only on the systems with Haswell (4th Generation) Intel processors. This has been highly frustrating issue blocking release of our infrastructure product. It highly relies on Intel CPU only OpenCL. Are you aware of this issue?

Following are the configurations we have tried to root cause the issue but no success

1) Intel core i7-4700 MQ + Nvidia GPU + OpenCL SDK 2013 = Fails

2) Intel core i7-4700 MQ + OpenCL SDK 2013 = Fails

3) Intel core i5(Third Generation + OpenCL SDK 2013 = Pass

4) Intel core i7-4770 + OpenCL SDK 2013 = Fails

5) Intel core i5(2nd Generation) + Nvidia GPU + OpenCL SDK 2013 = Pass

6) Intel core i5(3nd Generation) (notebook)+ no GPU + OpenCL SDK 2013 = Pass

We have tried all three OSes Windows 7, Windows 8, Windows 8.1. Still the Kernel seems to fails only on systems with Haswell processors. Please help resolve this issue.

Another note: We are using only CPU for execution(not Intel HD Graphics)

Attachment	Size
Download NBody_EC.txt	7.14 KB

↧

教程：基于 Android* 操作系统的 OpenCL™ 入门

May 12, 2014, 12:21 am

Latest and popular articles on Intel Technologies

≫ Next: Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes,etc.

≪ Previous: OpenCL Kernel Fails only on 4th Generation Processors

下载代码样本

下载文档

面向 Android* 操作系统的 OpenCL™ 基本指南可提供使用 Android 应用中的 OpenCL 的指南。本指南是处理 Android 应用的交互式图片。
本指南主要用于展示如何在 Android 应用中使用 OpenCL，如何着手编写 OpenCL 代码以及如何链接至 OpenCL 运行时。本指南展示了 OpenCL API 调用的典型序列，以及在 OpenCL 设备上获取配合动画运行的简单图像处理内核的一般工作流程。本指南不包括高级主题，如有效数据共享或 Android OpenCL 性能 BKM 等。

复杂等级：初级
开发平台：任意平台
目标平台： Android* OS 4.2.2 及更高版本
目标设备：基于 Android* 设备的 GPU 设备

注：
Android 模拟器不提供 GPU OpenCL 设备支持。如要在 Android 模拟器上运行样本，请将目标 OpenCL 设备类型从 GPU 更改为 CPU，更改方式为：在 jni/step.cpp 文件的 451 行将 CL_DEVICE_TYPE_GPU替换为 CL_DEVICE_TYPE_CPU。

关于样本的更多信息，请参见样本软件包中的样本用户指南。

* OpenCL 和 OpenCL 标识是苹果公司的商标，需获得 Khronos 的许可方能使用。

Developers

Android*

Linux*

Microsoft Windows* (XP, Vista, 7)

↧

Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes,etc.

May 14, 2014, 1:09 am

Latest and popular articles on Intel Technologies

≫ Next: Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes,etc.

≪ Previous: 教程：基于 Android* 操作系统的 OpenCL™ 入门

Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes, you know the product of Intel® SDK for OpenCL™ Applications 2014 can support the OpenCL2.0, and we begin to use it on inter machine, but it did not provide a sample? Thank you!

↧

Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes,etc.

May 14, 2014, 1:09 am

Latest and popular articles on Intel Technologies

≫ Next: Platform/Device Capabilities Viewer Sample

≪ Previous: Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes,etc.

↧

Platform/Device Capabilities Viewer Sample

May 15, 2014, 10:14 am

Latest and popular articles on Intel Technologies

≫ Next: work group with 1 work item using ~100 float8 vectors?

≪ Previous: Would you provide sample(Tutorial) for OpenCL 2.0 which include SVM and Pipes,etc.

Download for Windows Download for Linux Download Documentation

Description

This sample demonstrates how to enumerate available OpenCL™ platforms and devices. It also lists important capabilities per device.

Supported Devices: CPU, Intel® Xeon Phi™ coprocessor
Supported Operating Systems: Windows* and Linux* OS
Complexity Level: Novice

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

ZIP sample package contains sample files for Windows* OS
TAR.GZ sample package contains sample files for Linux* OS.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Developers

Linux*

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Beginner

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Intel® SDK for OpenCL™ Applications XE 2013

Highly-parallel Samples

↧

work group with 1 work item using ~100 float8 vectors?

May 15, 2014, 10:35 am

Latest and popular articles on Intel Technologies

≫ Next: General Matrix Multiply Sample

≪ Previous: Platform/Device Capabilities Viewer Sample

Will the Intel HD Graphics OpenCL compiler support "1 work item" work groups that are float8 vectors?

Example:

__kernel
__attribute__((vec_type_hint(float8),reqd_work_group_size(1,1,1)))
void __kernel(__global const float8* const restrict in, __global float8* const restrict out)
{
  ... // lots and lots of float8 vector registers
}

The goal is to occupy as many float8 registers as possible in a single work item. The kernel I'm designing can benefit from float4 swizzling ops and I'm assuming float8 is the narrowest width that matches the 128x8 register file found in Ivy and Haswell architectures.

Questions:

Does the HD Graphics OpenCL compiler support allocating as many as 128 registers on IvyBridge and Haswell?
If this isn't supported, why no?
If this isn't support then what is the best work group size to acquire the most possible registers per work item?

Thanks, I'm very impressed with the HD Graphics architecture. The EUs and sub-slices appear to have *huge* amounts of resources compared to other low power GPUs.

↧

General Matrix Multiply Sample

May 15, 2014, 11:20 am

Latest and popular articles on Intel Technologies

≫ Next: Monte Carlo Method for Stock Options Pricing Sample

≪ Previous: work group with 1 work item using ~100 float8 vectors?

Download for Windows Download for Linux Download Documentation

Description

General Matrix Multiply (GEMM) sample demonstrates how to efficiently utilize an OpenCL™ device to perform general matrix multiply operation on two dense square matrices. The primary target devices that are suitable for this sample are the devices with cache memory: Intel® Xeon Phi™ and Intel® Architecture CPU devices.

The sample:

Optimizes trivial matrix multiplication nested loop to utilize the memory cache more efficiently
Supports single-precision and double-precision data types
Demonstrates how to use different storage methods for matrices
Demonstrates how to utilize the automatic vectorizer efficiently and avoid gathers

Supported Devices: CPU, Intel® Xeon Phi™ coprocessor
Supported OS: Windows* and Linux* OS
Complexity Level: Intermediate

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

ZIP sample package contains sample files for Windows* OS
TAR.GZ sample package contains sample files for Linux* OS.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Linux*

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Intermediate

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Intel® SDK for OpenCL™ Applications XE 2013

Highly-Parallel Samples

↧

Monte Carlo Method for Stock Options Pricing Sample

May 15, 2014, 11:28 am

Latest and popular articles on Intel Technologies

≫ Next: Median Filter

≪ Previous: General Matrix Multiply Sample

Download for Windows Download for Linux Download Documentation

Description

This sample demonstrates implementation of the Monte Carlo simulation for the European stock option pricing. The algorithm is an OpenCL™ kernel that unifies three major algorithm components:

Mersenne twister - generation of uniformly distributed pseudorandom numbers
Box-Muller transform - generation of normally distributed random numbers
Option price calculation using Black-Scholes stock pricing model

The exact Black-Scholes model is implemented as native code on the host for comparison with the results, generated with Monte Carlo.

Supported Devices: CPU, Intel® Xeon Phi™ coprocessor
Supported OS: Windows* and Linux* OS
Complexity Level: Intermediate

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

ZIP sample package contains sample files for Windows* OS
TAR.GZ sample package contains sample files for Linux* OS.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

parallel processing

Developers

Linux*

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Intermediate

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Intel® SDK for OpenCL™ Applications XE 2013

Highly-Parallel Samples

↧

Median Filter

May 15, 2014, 6:13 pm

Latest and popular articles on Intel Technologies

≫ Next: OpenCL™ Technology and Intel® Media SDK Interoperability

≪ Previous: Monte Carlo Method for Stock Options Pricing Sample

Download Code Sample Download Documentation

Features / Description

The sample demonstrates how to implement efficient median filter with OpenCL™ standard. This implementation relies on auto-vectorization performed by Intel® SDK for OpenCL Applications compiler. The kernel code minimizes number of color buffer accesses, removes synchronization points, and uses data-level parallelism.

This sample demonstrates a CPU-optimized implementation of 2D image median filtration, showing how to:

Implement calculation kernels using OpenCL C99
Parallelize the kernels by running several work-groups in parallel
Organize host-device data exchange with final image storage on the hard drive.

Supported Devices: CPU, Intel Processor Graphics, Intel® Xeon Phi™ coprocessor
Supported OS: Windows* OS
Complexity Level: Novice

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Beginner

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Intel® SDK for OpenCL™ Applications XE 2013

Visual Computing Samples

Highly-Parallel Samples

↧

OpenCL™ Technology and Intel® Media SDK Interoperability

May 15, 2014, 6:15 pm

Latest and popular articles on Intel Technologies

≫ Next: Cross-Device NBody Simulation Sample

≪ Previous: Median Filter

Download Code Sample Download Documentation

Features / Description

The Intel® Media SDK Interoperability sample demonstrates how to use Intel® Media SDK and Intel® SDK for OpenCL™ Applications together for efficient video decoding and fast post-processing.

The sample demonstrates the Intel® Media SDK pipeline combined with post-processing filters in the OpenCL technology, showing how to:

Integrate processing with Intel® SDK for OpenCL Applications into Intel® Media SDK pipeline and get benefit from hardware-accelerated (if available) video decoding with Intel® Media SDK pipeline
Organize efficient sharing between Intel® Media SDK frames and OpenCL images by use of cl_khr_dx9_media_sharing extension
Implement simple video effects in OpenCL

Supported Devices: Intel® Processor Graphics
Supported OS: Windows* OS
Complexity Level: Advanced

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Visual Computing Samples

↧

Cross-Device NBody Simulation Sample

May 15, 2014, 6:28 pm

Latest and popular articles on Intel Technologies

≫ Next: 3D Fluid Simulation Using OpenCL™ Technology

≪ Previous: OpenCL™ Technology and Intel® Media SDK Interoperability

Download Code Sample Download Documentation

Features / Description

The NBody Simulation Sample features a load balancing approach to compute NBody simulation across both CPU and Intel® Processor Graphics.

This sample illustrates the basic principles of how to work simultaneously with OpenCL™ devices on both CPU and Intel® Processor Graphics. Source code is accompanied with graphics visualization of the job distribution between the devices. Running OpenCL code on both CPU and Intel Processor Graphics not only results in the sum of the performance of both devices, but also largely improves application power performance on Ultrabook™.

Supported Devices: CPU, Intel® Processor Graphics
Supported OS: Windows* OS
Complexity Level: Intermediate

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Intermediate

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Visual Computing Samples

↧

3D Fluid Simulation Using OpenCL™ Technology

May 15, 2014, 7:16 pm

Latest and popular articles on Intel Technologies

≫ Next: HDR Rendering with God Rays Using OpenCL™ Technology

≪ Previous: Cross-Device NBody Simulation Sample

Download Code Sample Download Documentation

Features / Description

The sample demonstrates shallow water solver implemented with the OpenCL™ technology. The Shallow Water sample relies on flux splitting method for solving the approximated Navier-Stokes equations. The algorithm operates on 2D maps of velocity and height, calculates updated maps for the next time step. The updated maps are rendered with pixel and vertex shaders (Microsoft DirectX* 10).

It also demonstrates a CPU-optimized implementation of the shallow water fluid effects and shows how to perform the following:

Implement calculation kernels using OpenCL C99
Parallelize these kernels by running several work-groups in parallel
Organize host-device data exchange
Visualize results using pixel and vertex shaders (Microsoft DirectX* 10)

Supported Devices: CPU, Intel® Processor Graphics
Supported OS: Windows* OS
Complexity Level: Advanced

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Advanced

Intel® SDK for OpenCL™ Applications

OpenCL*

URL

Code Sample

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Visual Computing Samples

↧

HDR Rendering with God Rays Using OpenCL™ Technology

May 15, 2014, 7:16 pm

Latest and popular articles on Intel Technologies

≫ Next: HDR Tone Mapping for Post Processing Using OpenCL™ Technology

≪ Previous: 3D Fluid Simulation Using OpenCL™ Technology

Download Code Sample Download Documentation

Features / Description

This sample demonstrates a CPU-optimized implementation of the God Rays effect, showing how to:

Implement calculation kernels using the OpenCL™ technology C99
Parallelize the kernels by running several work-groups in parallel
Organize data exchange between the host and the OpenCL device

Supported Devices: CPU, Intel® Processor Graphics
Supported OS: Windows* OS
Complexity Level: Intermediate

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Visual Computing Samples

↧

HDR Tone Mapping for Post Processing Using OpenCL™ Technology

May 15, 2014, 11:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Bitonic Sorting

≪ Previous: HDR Rendering with God Rays Using OpenCL™ Technology

Download Code Sample Download Documentation

Features / Description

The Tone Mapping sample demonstrates how to use high dynamic range (HDR) rendering with tone mapping effect with the OpenCL™ technology.
It also demonstrates a CPU-optimized implementation of the tone mapping effect, showing how to:

Implement calculation kernels using OpenCL C99
Parallelize the kernels by running several work-groups in parallel
Organize data exchange between the host and the OpenCL device
Store the final image on the hard drive.

Supported Devices: CPU, Intel® Processor Graphics, Intel® Xeon Phi™ coprocessor
Supported OS: Windows* OS
Complexity Level: Novice

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Beginner

Intel® SDK for OpenCL™ Applications

URL:

Sample Release Notes

Intel® SDK for OpenCL™ Applications 2014 Beta

Intel® SDK for OpenCL™ Applications 2013

Intel® SDK for OpenCL™ Applications XE 2013

Highly-Parallel Samples

↧

Bitonic Sorting

May 15, 2014, 11:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Simple Optimizations of OpenCL™ Code

≪ Previous: HDR Tone Mapping for Post Processing Using OpenCL™ Technology

Download Code Sample Download Documentation

Features / Description

Demonstrates how to implement an efficient sorting routine with the OpenCL™ technology.

Operates on arbitrary input array of integer values
Utilizes properties of bitonic sequence and principles of sorting networks
Enables efficient SIMD-style parallelism through OpenCL vector data types
Fits modern CPUs

Supported Devices: CPU, Intel Processor Graphics, Intel® Xeon Phi™ coprocessor
Supported OS: Windows* OS
Complexity Level: Intermediate

Refer to the sample release notes for information on system requirements.
For more information about the sample refer to the sample User's Guide inside the sample package.

* OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

visual computing

Developers

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

C/C++

Intermediate

Intel® SDK for OpenCL™ Applications