OpenCL::Example with Image Blurring Algorithm ,Questions and Issues - page 2

 
amrali #:

I did not play with local memory the time i was testing OpenCL. I found it very complicated, even on other resources on the internet. I only tested vectorized kernels.

I managed to implement Bitonic sort that beats radix sort in speed (by optimizing the kernel algorithm).

But, finally OpenCL is a non-portable solution for parallel-processing. Now, it is almost dead!

non-portable ? We are talking about MT5 platform. mql5 isn't portable either.

Maybe I misunderstood your point.

 
amrali #:

I did not play with local memory the time i was testing OpenCL. I found it very complicated, even on other resources on the internet. I only tested vectorized kernels.

I managed to implement Bitonic sort that beats radix sort in speed (by optimizing the kernel algorithm).

But, finally OpenCL is a non-portable solution for parallel-processing. Now, it is almost dead!

yeah i find it daunting . Like Scarlett Johannsson appears in front of me and i must flirt with her.

dead within the mql5 ecostytem or in general ?

 
Lorentzos Roussos #:

yeah i find it daunting . Like Scarlett Johannsson appears in front of me and i must flirt with her.

dead within the mql5 ecostytem or in general ?

Almost dead in general, because there are yet no API standards among graphic cards' manufacturers, unlike the situation of cpu manufacturers. 

Not portable: means the code you develop for your graphic card may not work at all (or work with degraded performance) on my PC, although both are MT5, but different GPUs.

My opinion is that OpenCL is an experimental rather than a production framework.
 

amrali #:

...

Not portable: means the code you develop for your graphic card may not work at all (or work with degraded performance) on my PC.
Ok in this sense, got it.
 
amrali #:
Almost dead in general, because there are yet no API standards among graphic cards' manufacturers, unlike the situation of cpu manufacturers. 

Not portable: means the code you develop for your graphic card may not work at all (or work with degraded performance) on my PC, although both are MT5, but different GPUs.

My opinion is that OpenCL is an experimental rather than a production framework.

Hmm yeah i got the sense nVidia does not like CL a lot . Implementing standards for an open library would be a big headache for the companies i imagine , besides them running their own .

A standard between their libraries would make more sense .

Experimental or not though , if you can find the sweet spot for your algorithm it's still faster.

 
Lorentzos Roussos #:

Experimental or not though , if you can find the sweet spot for your algorithm it's still faster.

On your computer alone! You will never know what GPU other users have.
 
amrali #:
On your computer alone! You will never know what GPU other users have.

So the example i posted may be running slower on your GPU even thought its just the minimum basic optimization it could have ?

I mean its not factoring in anything (apart from gpu with double) it just sends the data down.

Unless you mean if the optimization becomes too specific it can't scale (can't be distributed with the speed claims it tested at) , that i get
 
Lorentzos Roussos #:

So the example i posted may be running slower on your GPU even thought its just the minimum basic optimization it could have ?

I mean its not factoring in anything (apart from gpu with double) it just sends the data down.

Unless you mean if the optimization becomes too specific it can't scale (can't be distributed with the speed claims it tested at) , that i get

From the resources I read on subject before, I remember that the main complaint was performance non-portability. 

These fine details needs extended testing across different GPUs.

 
amrali #:

From the resources I read on subject before, I remember that the main complaint was performance non-portability. 

These fine details needs extended testing across different GPUs.

I see , i'm in no way experienced in it to counter that . 

It's the poor man's Rocinante i guess .

Thanks for the tips 

 

Where are you getting the values from for ?

CLGetInfoInteger(ker,CL_KERNEL_PRIVATE_MEM_SIZE);

I'm running a kernel

__kernel void memtests(__global int* x,int f,int b)
{
int a=get_global_id(0);
int c=a*b+f;
x[a]=c;
}

and it says private memory 0 , shouldn't it say 32? (f,b,a and c)?

here is the full code :

int OnInit()
  {
//---
  EventSetMillisecondTimer(33);
//---
   return(INIT_SUCCEEDED);
  }

void OnDeinit(const int reason)
  {
//---
   
  }
void OnTimer(){
  EventKillTimer();
  int ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
  if(ctx!=INVALID_HANDLE){
    string kernel="__kernel void memtests(__global int* x,int f,int b){int a=get_global_id(0);int c=a*b+f;x[a]=c;}";
    string errors="";
    int prg=CLProgramCreate(ctx,kernel,errors);
    if(prg!=INVALID_HANDLE){
    ResetLastError();
    int ker=CLKernelCreate(prg,"memtests");
    if(ker!=INVALID_HANDLE){
    int buf[];ArrayResize(buf,100,0);
    ArrayFill(buf,0,ArraySize(buf),0);
    int b_handle=CLBufferCreate(ctx,ArraySize(buf)*4,CL_MEM_WRITE_ONLY);
    CLSetKernelArgMem(ker,0,b_handle);
    CLSetKernelArg(ker,1,2);CLSetKernelArg(ker,2,4);
    uint offsets[]={0};
    uint works[]={ArraySize(buf)};
    CLExecute(ker,1,offsets,works);
    while(CLExecutionStatus(ker)!=CL_COMPLETE){Sleep(10);}
    Print("Kernel finished");
    int kernel_local_mem_size=CLGetInfoInteger(ker,CL_KERNEL_LOCAL_MEM_SIZE);
    int kernel_private_mem_size=CLGetInfoInteger(ker,CL_KERNEL_PRIVATE_MEM_SIZE);
    int kernel_work_group_size=CLGetInfoInteger(ker,CL_KERNEL_WORK_GROUP_SIZE);
    Print("Kernel local mem ("+kernel_local_mem_size+")");
    Print("Kernel private mem ("+kernel_private_mem_size+")");
    Print("Kernel work group size ("+kernel_work_group_size+")");
    CLKernelFree(ker);
    CLBufferFree(b_handle);
    }else{Print("Cannot create kernel");}
    CLProgramFree(prg);
    }else{Alert(errors);}
    CLContextFree(ctx);
    }
  else{
    Print("Cannot create ctx");
    }
  }