OpenCL::Example with Image Blurring Algorithm ,Questions and Issues

Lorentzos Roussos 2023.04.25 18:03

Hello

Sharing a benchmark for OpenCL that can process an image and make it look as if it is coming in and out of focus constantly.

The normal execution is 3 times faster than OpenCL which means , it can be improved. (3x faster with 0min 3max will explain parameters lower)

I don't need to use this code , i just wanted to be immersed in OpenCL a bit and i believe that such a simple algorithm could be helpful for starting out with OpenCL.

I'm not judging the OpenCL native library , i know there's probably issues in my approach of the solution -that ends up making the OpenCL blur 3 times slower-

Now , if you see a blatant ,or not , omission or fundamental error in the approach on my part , let the forum know of course .

Note : Not using any built in GPU functionality for the image , the "standard" everyday poor man's solution for blurring is also available in its mql5 form so you can compare.

With that said , after you start laughing when you read my OpenCL C source code, please also share your insights publicly they will be helpful to me and readers alike. 😊

Anyway. Here is what the algorithm does :

You have an input at the top where you can select which mode of execution to run :

standard - will run the mql5 code no opencl
cpu - will run openCL code with the CPU
gpu - will run openCL code with the GPU

Below you have a parameter for the bmp file to load to play with , its an image of a pizza 🤏 , i'll attach that too .

Then you have the blurPulseMin and the blurPulseMax . What are those ?

Suppose you set the min to 0 and the max to 10 .

That will create a "blur pulse" that oscillates between 0 and 10 constantly and at each point , if bp is the blur pulse , it will create a region of ((bp)*2)^2 around each pixel from which it will derive neighboring pixel color data to mix it in the final blurred pixel and compose the blurred photo ,and update the display.

This is my 2nd day with OpenCL and 1st with C but i think i kinda got the structure of things more or less , so , if you have any questions about the code in the source file ask , if i can't answer there are many others who can.

Here is a visual of the test , the milliseconds interval measured is for a full cycle of the "blurPulse" going from min to max and back to min.

That was the benchmark attempt.

The gif below runs in standard mode and takes 28seconds roundtrip with min0 and max10.

Cheers , i apologize if i made you hungry . 😎 🍺 🍕

Files:

pizzas_turn_to_bmp.png 1904 kb

imageBlurTest.mq5 18 kb

Alain Verleyen 2023.04.25 18:17 #1

So you finally found a way to post gif without moderator intervention.

Lorentzos Roussos 2023.04.25 18:21 #2

Alain Verleyen #:

So you finally found a way to post gif without moderator intervention.

🤣 🤣

Lorentzos Roussos 2023.04.25 20:34 #3

I found some issues myself :

the output buffer size was x*y*8 bytes while it should be x*y*4 bytes
but the real speed gain , the first one for now , came after following @William Roeder's suggestion on another thread for the ambiguity of the calculation of the power function

Now the OpenCL+GPU is 5 times faster than the default ! , so thank you William . I don't understand why but i assume it has to move memory around to do the conversions and it slows it down.

The OpenCL+CPU is 2.5 times slower though.

i'm attaching the updated code.

Files:

imageBlurTest.mq5 18 kb

V1andV2 Hedged EA: Beautiful Strategy Test become drastically Memory Leak???

Lorentzos Roussos 2023.04.26 15:47 #4

A moderator asked me to not post new threads about OpenCL so this thread here may serve as a collection for questions . It is handy as it contains a test which has a test you can run normally or with OpenCL .

The code is far from perfect and updates will be posted for each "gain" in speed for this particular code. (it may not speed up a neural nets kernel for instance but it will show you general approaches etc)

Now , you will be looking for the device info documentation constantly , here is the khronos docs that mql5 themsevles suggest you to refer to :

CLGetDeviceInfo : https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html

And all the docs : https://registry.khronos.org/OpenCL/

Discussion DECEMA - Xmas gift! FXiGoR-(T_S_R) very effective Trend

Lorentzos Roussos 2023.04.27 15:14 #5

I'd like to know which open cl api commands this parameter uses in this function

Function : https://www.mql5.com/en/docs/opencl/clexecute

parameter : const uint& local_work_size[] // Number of tasks in the local group

For instance , if a user has a gpu with 32 warps(nvidia)/wavefronts(ati/amd) how could this array be adjusted ? If you have that information of course.

(for instance , if someone wanted to change the equivalent of this parameter in open cl in c++ what would they alter/add ?)

Documentation on MQL5: Working with OpenCL / CLExecute

www.mql5.com

CLExecute - Working with OpenCL - MQL5 Reference - Reference on algorithmic/automated trading language for MetaTrader 5

Noob question. How can [WARNING CLOSED!] Any newbie Partial close of position

amrali 2023.04.27 17:09 #6

Lorentzos Roussos #:

I'd like to know which open cl api commands this parameter uses in this function

Function : https://www.mql5.com/en/docs/opencl/clexecute

parameter : const uint& local_work_size[] // Number of tasks in the local group

For instance , if a user has a gpu with 32 warps(nvidia)/wavefronts(ati/amd) how could this array be adjusted ? If you have that information of course.

(for instance , if someone wanted to change the equivalent of this parameter in open cl in c++ what would they alter/add ?)

Local memory is an advanced optimization technique in OpenCL (really, a hard subject). Another optimization technique is kernel vectorization.

In local memory, you have to use sync objects (named barriers, or memory fences) to sync the locally executing threads, and you should test various optimizations, which is not portable between different graphic cards' manufacturers.

See this article https://www.mql5.com/en/articles/407

Section: 2.7. Transferring the Column of the Second Array to Local Memory

OpenCL: From Naive Towards More Insightful Programming

www.mql5.com

This article focuses on some optimization capabilities that open up when at least some consideration is given to the underlying hardware on which the OpenCL kernel is executed. The figures obtained are far from being ceiling values but even they suggest that having the existing resources available here and now (OpenCL API as implemented by the developers of the terminal does not allow to control some parameters important for optimization - particularly, the work group size), the performance gain over the host program execution is very substantial.

[ARCHIVE!] Any rookie question, OpenCL in trading Crash Heeeeelp! going gone

Lorentzos Roussos 2023.04.27 18:21 #7

amrali #:

Local memory is an advanced optimization technique in OpenCL (really, a hard subject). Another optimization technique is kernel vectorization.

See this article https://www.mql5.com/en/articles/407

Section: 2.7. Transferring the Column of the Second Array to Local Memory

I haven't picked the most suitable example for this as each pixel can be independent it seems .

I'm just trying to relate the argument of the function to external open cl tutorials to grasp the operation better.

Also there was no ArgMemLocal at the time of this article been written i think , shouldn't he/she have used it ?

ps: note i am an utter noob in such matters

Trend of interesting thoughts [Archive!] Any rookie question, [ARCHIVE]Any rookie question, so

amrali 2023.04.27 18:59 #8

Lorentzos Roussos #:

Also there was no ArgMemLocal at the time of this article been written i think , shouldn't he/she have used it ?

There is now:

bool  CLSetKernelArgMemLocal(
   int    kernel,           // handle to a kernel of an OpenCL program
   uint   arg_index,        // number of the OpenCL function argument
   ulong  local_mem_size    // buffer size
   );

https://www.mql5.com/en/docs/opencl/clsetkernelargmemlocal

Documentation on MQL5: Working with OpenCL / CLSetKernelArgMemLocal

www.mql5.com

CLSetKernelArgMemLocal - Working with OpenCL - MQL5 Reference - Reference on algorithmic/automated trading language for MetaTrader 5

Lorentzos Roussos 2023.04.27 19:01 #9

amrali #:

There is now:

https://www.mql5.com/en/docs/opencl/clsetkernelargmemlocal

yeah i mean , wouldn't he have used it if there was back then ?

amrali 2023.04.27 19:05 #10

Lorentzos Roussos #:

yeah i mean , wouldn't he have used it if there was back then ?

I did not play with local memory the time i was testing OpenCL. I found it very complicated, even on other resources on the internet. I only tested vectorized kernels.

I managed to implement Bitonic sort that beats radix sort in speed (by optimizing the kernel algorithm).

But, finally OpenCL is a non-portable solution for parallel-processing. Now, it is almost dead!

Bayesian regression - Has Low End VPS OpenCL in trading

1 2 3

New comment