Your first - and mine - silly open cl program 😇

Your first - and mine - silly open cl program 😇

24 April 2023, 22:10
Lorentzos Roussos
0
124

Hi there .

First things first , watch the following video :

Good , now , this is my first openCL program so there may be issues in terminology etc but the goal is to have the simplest example possible , not only because its helpful but also because that's all i can do for now 😇.

Let's go 

As you saw in the video we need to create a context with a device and to that context we'll feed our function (kernel) 

If you know C you are in luck , i don't keep that in mind.

So , let's start with the basics , setup the foundation and when that works go for the calculations too.

Note , some tutorials may give you the sense you can assign multiple devices to one context , but its actually one device per context.

I think you can have multiple concurrent contexts however ,(one per device) -> as i deduce , not tested ,not familiar.

First thing we do is create a context , we notice that if the creation is successful the log receives the name of the device that was assigned:

#property copyright "Lorentzos Roussos"
#property link      "https://www.mql5.com/en/users/lorio"
#property version   "1.00"
#include <OpenCL\OpenCL.mqh>

bool busy=false,loaded=false;
int ctx=INVALID_HANDLE;
int OnInit()
  {
  ctx=INVALID_HANDLE;
  busy=false;
  loaded=false;
  EventSetMillisecondTimer(44);
  return(INIT_SUCCEEDED);
  }

void OnTimer(){
  if(!busy){
  busy=true;
  
  if(!loaded){
  EventKillTimer();
  //create a context 
    ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
    if(ctx!=INVALID_HANDLE){
      Print("CL.Context Created");
      }
  }
  
  busy=false;
  }
  }
//+------------------------------------------------------------------+
//| Expert deinitialization function                                 |
//+------------------------------------------------------------------+
void OnDeinit(const int reason)
  {
//---
  if(ctx!=INVALID_HANDLE){CLContextFree(ctx);} 
  }
//+------------------------------------------------------------------+
//| Expert tick function                                             |
//+------------------------------------------------------------------+
void OnTick()
  {
//---
   
  }

This is the structure of the code we will run in general , on deinit we free up the context if it were valid.

Simple stuff so far.

Also note there's 2 documentation pages for OpenCL , one in the standard library and one for the native commands support

I think the standard library is the old one , not sure , not clear, not clarified anyway , not that anything is on this site but let's work with the native and if something is missing we know where to look.

So , the native functions have this command called ProgramCreate that receives a string source code written in "OpenCL C" (The openCl language)

The KernelCreate command however receives the handle of a program and the name of the kernel so that implies that we throw all our programs source code in the string that we pass in the creation of the program and then we declare the kernels . 

For this test we will only have one "function" (kernel) so it won't matter , for now.

So , a simple function in OpenCL C , kay , and i don't know C , so how about we send an array and a number and the "kernel" multiplies the values in the array by that number . 

But as stressed in the video the best way to squeeze performance out of parallel computations is if we don't execute them in a linear fashion . Think of it as a loop with iterator i that is not actually accessing the index with the iteration variable i but with another "pool" of indexes that is available to all "compute units".

Let's think about it differently so it makes sense . If you wanted to create parallel operations in mql5 , on your own , you would need a common log of what "tasks" are still available and once one of your charts finished a task they would then pick the next one from that common pool of unfinished tasks.

So , the "function" can know which index it has in the pool when it is executed with get_global_id(0) (1st dimension)

__kernel void biscuit(__global double *array,
                               double by,
                                  int total_items){
int idx=get_global_id(0);
if(idx>total_items){return;}
array[idx]*=by;
}

Okay , so , another question that arises is is the get_global_id starting from 0 or 1 ? , the example in the program create of the docs indicates its from 1.We will see thought , one way is to multiply the array by the index , yeah let's do that .

So , do i have to typecast ?...hmmm 

So 3 questions that need answers :

  • is the get_global_id(0) (and get_local_id(0) in that matter) starting from 1?
  • if the pool of "leftover" tasks is equal to the amount of tasks we create why do i need to exit if the index goes above total tasks ?
    isn't running over the limit impossible ? 
  • do i need to typecast the int to multiply the array?

Let's then change the function to this and find out :

__kernel void biscuit(__global double *array){
int idx=get_global_id(0);
array[idx]*=idx;
}

And let's create the program with this , expecting 3 errors from the openCL compiler here.

Okay it says program created! awesome

      string biscuit_source_code="__kernel void biscuit(__global double *array){\r\n"
                                 "int idx=get_global_id(0);\r\n"
                                 "array[idx]*=idx;}\r\n";
      string build_log="";
      program_handle=CLProgramCreate(ctx,biscuit_source_code,build_log);
      if(program_handle!=INVALID_HANDLE){
        Print("Program created!");
        }else{
        Alert("Errors\n"+build_log);
        }

Then i create the memory i suppose 

buffer_handle=CLBufferCreate(ctx,1000,CL_MEM_READ_WRITE);

i'm using a lot of handles , i handle handler (and unloader) could be created here , but that's a test.

Then the kernel , we are sending the program handle here so the kernel name must be the same as the one in the source code we sent.

The docs state "The name of the kernel that execution starts from" so any "sub-functions" don't need to be "kerneled" ? that question 4 i guess.

Okay , no errors so far , they will pop up on execution probably . 

This is what i've got till now 

#property copyright "Lorentzos Roussos"
#property link      "https://www.mql5.com/en/users/lorio"
#property version   "1.00"

bool busy=false,loaded=false;
int ctx=INVALID_HANDLE;
int program_handle,kernel_handle,buffer_handle;
int OnInit()
  {
  ctx=INVALID_HANDLE;
  program_handle=INVALID_HANDLE;
  kernel_handle=INVALID_HANDLE;
  buffer_handle=INVALID_HANDLE;
  busy=false;
  loaded=false;
  EventSetMillisecondTimer(44);
  return(INIT_SUCCEEDED);
  }

void OnTimer(){
  if(!busy){
  busy=true;
  
  if(!loaded){
  EventKillTimer();
  //create a context 
    ResetLastError();
    ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
    if(ctx!=INVALID_HANDLE){
      ResetLastError();
      Print("CL.Context Created");
      string biscuit_source_code="__kernel void biscuit(__global double *array){\r\n"
                                 "int idx=get_global_id(0);\r\n"
                                 "array[idx]*=idx;}\r\n";
      string build_log="";
      program_handle=CLProgramCreate(ctx,biscuit_source_code,build_log);
      if(program_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("Program created!");
        buffer_handle=CLBufferCreate(ctx,1000,CL_MEM_READ_WRITE);
        if(buffer_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("buffer created");
        kernel_handle=CLKernelCreate(program_handle,"biscuit");
        if(kernel_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("Kernel created");
        
        }else{Print("Cannot create kernel #"+IntegerToString(GetLastError()));}
        }else{Print("Cannot create buffer #"+IntegerToString(GetLastError()));}
        }else{Alert("Errors #"+IntegerToString(GetLastError())+"\n"+build_log);}
      }else{Print("Cannot create CL.context #"+IntegerToString(GetLastError()));}
  }
  
  busy=false;
  }
  }
//+------------------------------------------------------------------+
//| Expert deinitialization function                                 |
//+------------------------------------------------------------------+
void OnDeinit(const int reason)
  {
//---
  if(kernel_handle!=INVALID_HANDLE){CLKernelFree(kernel_handle);}
  if(buffer_handle!=INVALID_HANDLE){CLBufferFree(buffer_handle);}
  if(program_handle!=INVALID_HANDLE){CLProgramFree(program_handle);}
  if(ctx!=INVALID_HANDLE){CLContextFree(ctx);} 
  
  }
//+------------------------------------------------------------------+
//| Expert tick function                                             |
//+------------------------------------------------------------------+
void OnTick()
  {
//---
   
  }

Now , i have to declare the arguments for the kernel 

There are 3 variants here :

  • CLSetKernelArg
  • CLSetKernelArgMem
  • CLSetKernelArgMemLocal

So , the first one -i assume- is for passing constants , like if we sent a multiple it'd be with this 

the second one is for the global memory and the third one for the local memory ,the local memory one receives an argument in size and not a buffer handle so it allocates memory in the device locally in the CUs .

5th question is where is the constant memory or is it handled internally? probably 

So here i have a global array so i'll use the CLSetKernelArgMem for the 1st argument

Okay 

        if(CLSetKernelArgMem(kernel_handle,0,buffer_handle)){
        ResetLastError();
        Print("Memory arg assigned to kernel");
        
        }else{Print("Cannot assign memory arg#"+IntegerToString(GetLastError()));}

Now what ? i must fill the memory , i'm sending the array down , that's handy.

But wait i have no array , damn it . We are testing the index at the same time (the get_global_id(0)) so let's create a mock array with the value of 1.0 for all emelents .

All right i hit the first error here , finally  😂 it says error 5110 

        Print("Memory arg assigned to kernel");
        double arr[];
        ArrayResize(arr,1000,0);
        ArrayFill(arr,0,1000,1.0);
        uint filled=CLBufferWrite(buffer_handle,arr,0,0,1000);
        if(filled==1000){
        Print("Filled "+IntegerToString(filled)+"items in buffer");
        }else{Print("Cannot fill buffer #"+IntegerToString(GetLastError()));}

What error is that let's see " ERR_OPENCL_WRONG_BUFFER_SIZE" wrong buffer size , but why ? 

Okay , the buffer size when creating the buffer refers to bytes not items ! good to know ,it was in the docs to be fair . my bad.

So , what now ? execute ?

Yes , okay so this is asynchronous by default i guess , that's question 6 so , i call the execution then set the timer up again and querry the status of the kernel execution .

Let's go with the default execute variant ,i'm not seeing a blocking flag (like in the vids) so it must be async by default.

here it is :

void OnTimer(){
  if(!busy){
  busy=true;
  
  if(!loaded){
  EventKillTimer();
  //create a context 
    ResetLastError();
    ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
    if(ctx!=INVALID_HANDLE){
      ResetLastError();
      Print("CL.Context Created");
      string biscuit_source_code="__kernel void biscuit(__global double *array){\r\n"
                                 "int idx=get_global_id(0);\r\n"
                                 "array[idx]*=idx;}\r\n";
      string build_log="";
      program_handle=CLProgramCreate(ctx,biscuit_source_code,build_log);
      if(program_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("Program created!");
        buffer_handle=CLBufferCreate(ctx,1000*8,CL_MEM_READ_WRITE);
        if(buffer_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("buffer created");
        kernel_handle=CLKernelCreate(program_handle,"biscuit");
        if(kernel_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("Kernel created");
        if(CLSetKernelArgMem(kernel_handle,0,buffer_handle)){
        ResetLastError();
        Print("Memory arg assigned to kernel");
        double arr[];
        ArrayResize(arr,1000,0);
        ArrayFill(arr,0,1000,1.0);
        uint filled=CLBufferWrite(buffer_handle,arr,0,0,1000);
        if(filled==1000){
        ResetLastError();
        Print("Filled "+IntegerToString(filled)+"items in buffer");
        //so call execute
          if(CLExecute(kernel_handle)){
          Print("Executing");
          //setup the timer again
            EventSetMillisecondTimer(44);
          //and we are loaded 
            loaded=true;
          }else{Print("Cannot execute kernel #"+IntegerToString(GetLastError()));}          
        }else{Print("Cannot fill buffer #"+IntegerToString(GetLastError()));}
        }else{Print("Cannot assign memory arg#"+IntegerToString(GetLastError()));}
        }else{Print("Cannot create kernel #"+IntegerToString(GetLastError()));}
        }else{Print("Cannot create buffer #"+IntegerToString(GetLastError()));}
        }else{Alert("Errors #"+IntegerToString(GetLastError())+"\n"+build_log);}
      }else{Print("Cannot create CL.context #"+IntegerToString(GetLastError()));}
  }
  else if(loaded){
  //check execution status 
    ENUM_OPENCL_EXECUTION_STATUS status=(ENUM_OPENCL_EXECUTION_STATUS)CLExecutionStatus(kernel_handle);
    Comment("Kernel("+IntegerToString(kernel_handle)+" Status("+EnumToString(status)+")");
  }
  
  busy=false;
  }
  }

That -obviously- finished very fast , but what we want is to take a peep into the array .

So if it completed , read , print and go sprint (exit)  🤓

    if(status==CL_COMPLETE){
      double get[];
      ArrayResize(get,1000,0);
      ArrayFill(get,0,1000,0.0);
      ResetLastError();
      if(CLBufferRead(buffer_handle,get,0,0,1000)){
      //get the top 10 , they have the info we want 
        string msg="";
        for(int i=0;i<10;i++){
           msg+=DoubleToString(get[i],2)+"\n";
           }
        Alert(msg);
      }else{Print("Cannot read buffer #"+IntegerToString(GetLastError()));}
      Print("Exit");
      ExpertRemove();
      }

aaand heres what we got back , the first element is 0 , which means the get_global_id(0) starts from 0 ? but the rest are 1.00

Now i must find out if i need to typecast before i multiply , but let's change the multiplication line real quick to this , i have a hunch

array[idx]=array[idx]*idx;

nope , so i'll create a second buffer , an int and we'll fill it up with the index values to get to the bottom of this.

so what do we do :

  1. alter the source code string
  2. create a buffer 
  3. add the arg buffer , write only this time (i assume these enums are from the devices side not ours)
  4. read the new int buffer and if we see 0,0,0,0,0,0 we panik

Also , note we did not get any indication of an error and we are not checking for the limit of the array , which means i did not understand something , its not broadly and instantly evident or it does not matter in the way the work groups are relayed to the device.So many unknowns.

Anyway.also i hate this structure so it will look even uglier with the addition of the 2nd buffer , but its a test.

here is the updated code : 

void OnTimer(){
  if(!busy){
  busy=true;
  
  if(!loaded){
  EventKillTimer();
  //create a context 
    ResetLastError();
    ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
    if(ctx!=INVALID_HANDLE){
      ResetLastError();
      Print("CL.Context Created");
      string biscuit_source_code="__kernel void biscuit(__global double *array,__global int *idx_array){\r\n"
                                 "int idx=get_global_id(0);\r\n"
                                 "idx_array[idx]=idx;\r\n"
                                 "array[idx]=array[idx]*idx;}\r\n";
      string build_log="";
      program_handle=CLProgramCreate(ctx,biscuit_source_code,build_log);
      if(program_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("Program created!");
        buffer_handle=CLBufferCreate(ctx,1000*8,CL_MEM_READ_WRITE);
        buffer_handle2=CLBufferCreate(ctx,1000*4,CL_MEM_WRITE_ONLY);
        if(buffer_handle!=INVALID_HANDLE&&buffer_handle2!=INVALID_HANDLE){
        ResetLastError();
        Print("buffer created");
        kernel_handle=CLKernelCreate(program_handle,"biscuit");
        if(kernel_handle!=INVALID_HANDLE){
        ResetLastError();
        Print("Kernel created");
        if(CLSetKernelArgMem(kernel_handle,0,buffer_handle)&&CLSetKernelArgMem(kernel_handle,1,buffer_handle2)){
        ResetLastError();
        Print("Memory arg assigned to kernel");
        double arr[];
        ArrayResize(arr,1000,0);
        ArrayFill(arr,0,1000,1.0);
        uint filled=CLBufferWrite(buffer_handle,arr,0,0,1000);
        if(filled==1000){
        ResetLastError();
        Print("Filled "+IntegerToString(filled)+"items in buffer");
        //so call execute
          if(CLExecute(kernel_handle)){
          Print("Executing");
          //setup the timer again
            EventSetMillisecondTimer(44);
          //and we are loaded 
            loaded=true;
          }else{Print("Cannot execute kernel #"+IntegerToString(GetLastError()));}          
        }else{Print("Cannot fill buffer #"+IntegerToString(GetLastError()));}
        }else{Print("Cannot assign memory arg#"+IntegerToString(GetLastError()));}
        }else{Print("Cannot create kernel #"+IntegerToString(GetLastError()));}
        }else{Print("Cannot create buffer #"+IntegerToString(GetLastError()));}
        }else{Alert("Errors #"+IntegerToString(GetLastError())+"\n"+build_log);}
      }else{Print("Cannot create CL.context #"+IntegerToString(GetLastError()));}
  }
  else if(loaded){
  //check execution status 
    ENUM_OPENCL_EXECUTION_STATUS status=(ENUM_OPENCL_EXECUTION_STATUS)CLExecutionStatus(kernel_handle);
    Comment("Kernel("+IntegerToString(kernel_handle)+" Status("+EnumToString(status)+")");
    if(status==CL_COMPLETE){
      double get[];
      ArrayResize(get,1000,0);
      ArrayFill(get,0,1000,0.0);
      int get_idx[];
      ArrayResize(get_idx,1000,0);
      ArrayFill(get_idx,0,1000,-1);
      ResetLastError();
      if(CLBufferRead(buffer_handle,get,0,0,1000)&&CLBufferRead(buffer_handle2,get_idx,0,0,1000)){
      //get the top 10 , they have the info we want 
        string msg="";
        for(int i=0;i<10;i++){
           msg+=DoubleToString(get[i],2)+"(idx:"+IntegerToString(get_idx[i])+")\n";
           }
        Alert(msg);
      }else{Print("Cannot read buffer #"+IntegerToString(GetLastError()));}
      Print("Exit");
      ExpertRemove();
      }
  }
  
  busy=false;
  }
  }

And i'm getting 0 on all the index values ... hmmm . which begs the question , the first array value is multiplied then the rest are not ?

So does it only execute the first one ?

Okay , let's add the limit check real quick . 

No not it.

Okay what if the CLExecute executes only once ? 

Hmm , okay , so what if i create a counter of my own and keep pumping it unti it finishes that would take 44 seconds though (44ms*1000) so i'll reduce the items to 100 .

But how is that parallel ? wtf . This is the problem with the mql5 docs , the person that understands the code writes up the docs and they are bored or the person that writes the code and the docs is different . At the peak moment you understand something fully that is when it should be extensively explained to us peasants . Why? because it grows your ecosystem faster ! Imagine if 10000 coders read the docs for this , if a bigger % grasps it in less time then they will create more stuff earlier . More stuff will attract more activity etc. We can't be guessing what their thought process was when they were deploying these tools everytime... i mean...anyway.

Rant over , let's see at what they have shared anyway , i'm seeing arrays there instead of integers for the sizes too . okay

Yeah , okay so there has to be extra studying and crystal ball interaction to try and relate the  global_work_offset[] and the  global_work_size[]  and the  local_work_size[]  arrays to the tutorial video above but i set the offset array with one element (one dimension) to 0 and the work size array with one element (dimension) to 1000 and it worked.

So , the get_global_id(0) starts from 0 , so their docs have a little error unless i'm missing something else -that is not documented anywhere either-

Here is the code 

Your's and mine first open cl , i'm attaching it as it went over 64k

And here is the second part of the video above going in more detail on OpenCL C 

Questions left : 

  • if the pool of "leftover" tasks is equal to the amount of tasks we create why do i need to exit if the index goes above total tasks ?
    isn't running over the limit impossible ? 
  • do i need to typecast the int to multiply the array?
  • where is the constant memory functions ?
  • do sub functions need to be created as kernels ?

Second video . 

I typed this out as i thought it through , so , i hope it's helpful.



Files:
basics.mq5  5 kb
Share it with friends: