OpenCL::Example with Image Blurring Algorithm ,Questions and Issues

 

Hello 

Sharing a benchmark for OpenCL that can process an image and make it look as if it is coming in and out of focus constantly.

The normal execution is 3 times faster than OpenCL which means , it can be improved. (3x faster with 0min 3max will explain parameters lower)

I don't need to use this code , i just wanted to be immersed in OpenCL a bit and i believe that such a simple algorithm could be helpful for starting out with OpenCL. 

I'm not judging the OpenCL native library , i know there's probably issues in my approach of the solution -that ends up making the OpenCL blur 3 times slower-

Now , if you see a blatant ,or not , omission or fundamental error in the approach on my part , let the forum know of course .

Note : Not using any built in GPU functionality for the image , the "standard" everyday poor man's solution for blurring is also available in its mql5 form so you can compare.

With that said , after you start laughing when you read my OpenCL C source code, please also share your insights publicly they will be helpful to me and readers alike. 😊

Anyway. Here is what the algorithm does :

You have an input at the top where you can select which mode of execution to run :

  1. standard - will run the mql5 code no opencl
  2. cpu - will run openCL code with the CPU
  3. gpu - will run openCL code with the GPU

Below you have a parameter for the bmp file to load to play with , its an image of a pizza  🤏 , i'll attach that too .

Then you have the blurPulseMin and the blurPulseMax . What are those ? 

Suppose you set the min to 0 and the max to 10 .

That will create a "blur pulse" that oscillates between 0 and 10 constantly and at each point , if bp is the blur pulse , it will create a region of ((bp)*2)^2 around each pixel from which it will derive neighboring pixel color data to mix it in the final blurred pixel and compose the blurred photo ,and update the display.

This is my 2nd day with OpenCL and 1st with C but i think i kinda got the structure of things more or less , so ,  if you have any questions about the code in the source file ask , if i can't answer there are many others who can.

Here is a visual of the test , the milliseconds interval measured is for a full cycle of the "blurPulse" going from min to max and back to min.

That was the benchmark attempt.

The gif below runs in standard mode and takes 28seconds roundtrip with min0 and max10.

Cheers , i apologize if i made you hungry .  😎 🍺 🍕

Files:
 

So you finally found a way to post gif without moderator intervention.


 
Alain Verleyen #:

So you finally found a way to post gif without moderator intervention.


🤣 🤣 

 

I found some issues myself :

  1. the output buffer size was x*y*8 bytes while it should be x*y*4 bytes
  2. but the real speed gain , the first one for now , came after following @William Roeder's suggestion on another thread for the ambiguity of the calculation of the power function

Now the OpenCL+GPU is 5 times faster than the default ! , so thank you William . I don't understand why but i assume it has to move memory around to do the conversions and it slows it down.

The OpenCL+CPU is 2.5 times slower though.

i'm attaching the updated code.


Files:
 
A moderator asked me to not post new threads about OpenCL so this thread here may serve as a collection for questions . It is handy as it contains a test which has a test you can run normally or with OpenCL .


The code is far from perfect and updates will be posted for each "gain" in speed for this particular code. (it may not speed up a neural nets kernel for instance but it will show you general approaches etc)

Now , you will be looking for the device info documentation constantly , here is the khronos docs that mql5 themsevles suggest you to refer to :

CLGetDeviceInfo : https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html

version1.0 docs 

version 1.2 docs

version 2.0 docs

version 3.0 docs

And all the docs : https://registry.khronos.org/OpenCL/

 

I'd like to know which open cl api commands this parameter uses in this function

Function : https://www.mql5.com/en/docs/opencl/clexecute

parameter :  const uint&  local_work_size[]         // Number of tasks in the local group

For instance , if a user has a gpu with 32 warps(nvidia)/wavefronts(ati/amd) how could this array be adjusted ? If you have that information of course.

(for instance , if someone wanted to change the equivalent of this parameter in open cl in c++ what would they alter/add ?) 
Documentation on MQL5: Working with OpenCL / CLExecute
Documentation on MQL5: Working with OpenCL / CLExecute
  • www.mql5.com
CLExecute - Working with OpenCL - MQL5 Reference - Reference on algorithmic/automated trading language for MetaTrader 5
 
Lorentzos Roussos #:

I'd like to know which open cl api commands this parameter uses in this function

Function : https://www.mql5.com/en/docs/opencl/clexecute

parameter :  const uint&  local_work_size[]         // Number of tasks in the local group

For instance , if a user has a gpu with 32 warps(nvidia)/wavefronts(ati/amd) how could this array be adjusted ? If you have that information of course.

(for instance , if someone wanted to change the equivalent of this parameter in open cl in c++ what would they alter/add ?) 

Local memory is an advanced optimization technique in OpenCL (really, a hard subject). Another optimization technique is kernel vectorization.

In local memory, you have to use sync objects (named barriers, or memory fences) to sync the locally executing threads, and you should test various optimizations, which is not portable between different graphic cards' manufacturers.

See this article https://www.mql5.com/en/articles/407

Section: 2.7. Transferring the Column of the Second Array to Local Memory


OpenCL: From Naive Towards More Insightful Programming
OpenCL: From Naive Towards More Insightful Programming
  • www.mql5.com
This article focuses on some optimization capabilities that open up when at least some consideration is given to the underlying hardware on which the OpenCL kernel is executed. The figures obtained are far from being ceiling values but even they suggest that having the existing resources available here and now (OpenCL API as implemented by the developers of the terminal does not allow to control some parameters important for optimization - particularly, the work group size), the performance gain over the host program execution is very substantial.
 
amrali #:

Local memory is an advanced optimization technique in OpenCL (really, a hard subject). Another optimization technique is kernel vectorization.

In local memory, you have to use sync objects (named barriers, or memory fences) to sync the locally executing threads, and you should test various optimizations, which is not portable between different graphic cards' manufacturers.

See this article https://www.mql5.com/en/articles/407

Section: 2.7. Transferring the Column of the Second Array to Local Memory


I haven't picked the most suitable example for this as each pixel can be independent it seems .

I'm just trying to relate the argument of the function to external open cl tutorials to grasp the operation better.

Also there was no ArgMemLocal at the time of this article been written i think , shouldn't he/she have used it ?

ps: note i am an utter noob in such matters
 
Lorentzos Roussos #:

Also there was no ArgMemLocal at the time of this article been written i think , shouldn't he/she have used it ?

There is now:

bool  CLSetKernelArgMemLocal(
   int    kernel,           // handle to a kernel of an OpenCL program
   uint   arg_index,        // number of the OpenCL function argument
   ulong  local_mem_size    // buffer size
   );

https://www.mql5.com/en/docs/opencl/clsetkernelargmemlocal

Documentation on MQL5: Working with OpenCL / CLSetKernelArgMemLocal
Documentation on MQL5: Working with OpenCL / CLSetKernelArgMemLocal
  • www.mql5.com
CLSetKernelArgMemLocal - Working with OpenCL - MQL5 Reference - Reference on algorithmic/automated trading language for MetaTrader 5
 
amrali #:

There is now:

bool  CLSetKernelArgMemLocal(
   int    kernel,           // handle to a kernel of an OpenCL program
   uint   arg_index,        // number of the OpenCL function argument
   ulong  local_mem_size    // buffer size
   );

https://www.mql5.com/en/docs/opencl/clsetkernelargmemlocal

yeah i mean , wouldn't he have used it if there was back then ? 

 
Lorentzos Roussos #:

yeah i mean , wouldn't he have used it if there was back then ? 

I did not play with local memory the time i was testing OpenCL. I found it very complicated, even on other resources on the internet. I only tested vectorized kernels.

I managed to implement Bitonic sort that beats radix sort in speed (by optimizing the kernel algorithm).

But, finally OpenCL is a non-portable solution for parallel-processing. Now, it is almost dead!

Reason: