OpenCL: real challenges - page 5

 
Create a request to the SD and attach the code (you can do it via PM), I will analyse your code.
Общайтесь с разработчиками через Сервисдеск!
Общайтесь с разработчиками через Сервисдеск!
  • www.mql5.com
Ваше сообщение сразу станет доступно нашим отделам тестирования, технической поддержки и разработчикам торговой платформы.
 

What part of my code are you interested in exactly? I have a lot of dependencies on different files.

The problem I have now is only writing and reading the buffer in 1 tick of the tester, and for checking it is enough:

#property copyright ""
#property link      ""

int hcontext, hprogram, hkernel, hbuffer[5];

void InitGlobal()
{
   for (int cdev = (int)CLGetInfoInteger(0, CL_DEVICE_COUNT)-1; cdev>-1; cdev--)
   {
      string name;
      CLGetInfoString(cdev, CL_DEVICE_NAME, name);
      Print("Device #",cdev," = ",name);
   }
   
   string source =
"kernel void tester(global double *price, global double *result)                                   \r\n"
"{                                                                                                 \r\n"
"   int global_index = get_global_id(0);                                                           \r\n"
"   result[global_index] = price[global_index] / global_index;                                     \r\n"
"}                                                                                                 \r\n"
;
   
   hcontext = CLContextCreate(CL_USE_GPU_ONLY);
   string build_log;
   hprogram = CLProgramCreate(hcontext, source, build_log);
   Print("build log = ", build_log);
   hkernel = CLKernelCreate(hprogram, "tester");
}

void DeinitGlobal()
{
   CLBufferFree(hbuffer[0]);
   CLBufferFree(hbuffer[1]);
   
   CLKernelFree(hkernel);
   CLProgramFree(hprogram);
   CLContextFree(hcontext);
}

int OnInit()
{
   InitGlobal();
   return(0);
}

void OnDeinit(const int reason)
{
   DeinitGlobal();
}

// Скрипт, в отличии от эксперта, можно дебажить на выходных :)
//void OnStart() {  InitGlobal();
void OnTick() {
   double price[30];
   CopyClose(_Symbol,_Period,0,ArraySize(price),price);
   
   static bool firststart = true;
   if (firststart)
   {
      firststart = false;
      uint bufsize = sizeof(price);
      Print("Размер буфера в байтах =",bufsize);
      hbuffer[0] = CLBufferCreate(hcontext, bufsize, CL_MEM_READ_ONLY);
      hbuffer[1] = CLBufferCreate(hcontext, bufsize, CL_MEM_WRITE_ONLY);
      
      CLSetKernelArgMem(hkernel, 0, hbuffer[0]);
      CLSetKernelArgMem(hkernel, 1, hbuffer[1]);
   }
   
   // А вот здесь не хватает clGetMemObjectInfo(buffer, CL_MEM_SIZE) для проверки размера.
   
   CLBufferWrite(hbuffer[0], price);
   
   uint units = (uint)CLGetInfoInteger(hcontext, CL_DEVICE_MAX_COMPUTE_UNITS);
   uint global_work_offset[] = {0};
   uint global_work_size[1];
   uint local_work_size[1];
   global_work_size[0] = ArraySize(price);
   local_work_size[0] = global_work_size[0] / units;
   bool exec = CLExecute(hkernel, 1, global_work_offset, global_work_size, local_work_size); // async
   if (exec == false) Print("Error in ",__FUNCSIG__," CLExecute: ",GetLastError());
   
   CLBufferRead(hbuffer[1], price);
   
   if (MQL5InfoInteger(MQL5_PROGRAM_TYPE) == PROGRAM_SCRIPT) DeinitGlobal();
}

Running by script:

2013.10.30 18:55:40 OpenCL_buffer_test (EURUSD,H1) Device #1 = AMD Phenom(tm) II X4 925 Processor
2013.10.30 18:55:40 OpenCL_buffer_test (EURUSD,H1) Device #0 = Cypress
2013.10.30 18:55:40 OpenCL_buffer_test (EURUSD,H1) OpenCL: GPU device 'Cypress' selected
2013.10.30 18:55:40 OpenCL_buffer_test (EURUSD,H1) build log =
2013.10.30 18:55:40 OpenCL_buffer_test (EURUSD,H1) Buffer size in bytes =240

Running expert in tester from 2013.01.09 to 2013.10.10 on M5 with "OHLC on M1":

2013.10.30 19:01:44 Core 1 EURUSD,M5: testing of experts\OpenCL_buffer_test.ex5 from 2013.01.09 00:00 to 2013.10.10 00:00 started
2013.10.30 19:01:44 Core 1 2013.01.09 00:00:00 Device #0 = Cypress
2013.10.30 19:01:44 Core 1 2013.01.09 00:00:00 OpenCL: GPU device 'Cypress' selected
2013.10.30 19:01:44 Core 1 2013.01.09 00:00:00 build log =
2013.10.30 19:01:44 Core 1 2013.01.09 00:00:00 buffer size in bytes =240
2013.10.30 19:04:55 Core 1 EURUSD,M5: 1108637 ticks (55953 bars) generated within 192443 ms (total bars in history 131439, total time 192521 ms)
2013.10.30 19:04:55 Core 1 294 Mb memory used

Note that there is only 1 device in the tester.

If

   //CLBufferRead(hbuffer[1], price);

then

2013.10.30 19:16:00 Core 1 EURUSD,M5: 1108637 ticks (55953 bars) generated within 88218 ms (total bars in history 131439, total time 88297 ms)
 
Read is required; in fact, it is the function waiting for the kernel to terminate, since CLExecute only queues the job for execution and then returns control to the MQL program without waiting for it to terminate
 
Specifically in this example, the advantage of using OpenCL is eaten up by the buffer copying overhead.

If it is necessary to perform calculations on OHLC data, then it is imperative to do sparing writing, i.e. create a larger buffer in advance and only overwrite this new data when new data arrives, telling the kernel the new beginning and size of the buffer.
OpenCL: Мост в параллельные миры
OpenCL: Мост в параллельные миры
  • 2012.05.16
  • Sceptic Philozoff
  • www.mql5.com
В конце января 2012 года компания-разработчик терминала MetaTrader 5 анонсировала нативную поддержку OpenCL в MQL5. В статье на конкретном примере изложены основы программирования на OpenCL в среде MQL5 и приведены несколько примеров "наивной" оптимизации программы по быстродействию.
 

Even if we manage to optimize OHLC transfer (we will use CLSetKernelArg to transfer the last bar), we will still crash when reading the results buffer:

2013.10.31 19:24:13 Core 1 EURUSD,M5: 1108637 ticks (55953 bars) generated within 114489 ms (total bars in history 131439, total time 114598 ms)
(moved the line with CLBufferWrite(hbuffer[0], price); under IF)
 
Roffild: Eh... Articles on speed boosting using OpenCL on GPU turned out to be a fairy tale as they did not really address the tasks at hand :(

Well, who prevents you from doing that? Go and write something real that would not be a fairy tale. But try to find an example so that acceleration would happen. This is the hardest part.

If you're talking about my articles... Well, I was writing a primer. And matrix multiplication is quite a useful operation.

P.S. By the way, if your CPU is Intel, emulation of x86 cores on it is much faster than on a competitor CPU. That is if you recalculate it per core.

HD5850: basically a pretty decent card, but modern cards are better - not only due to more flies, but also due to OpenCL optimizations. For example, global memory access time is significantly reduced.

P.P.S. OpenCL is not a panacea; it is a viable tool that can significantly speed up in some rare cases. And in other not so convenient cases, the acceleration is not so impressive - if there is one.

 
Roffild:
Eh... Articles on speedup using OpenCL on GPU turned out to be a fairy tale as they don't really deal with real tasks :(

Not so. The fairy tale is that "any algorithm can be accelerated in OpenCL". Not any algorithm.

The first thread on OpenCL even quite well describes the criteria that an algorithm must possess in order to have ocl acceleration potential.

Good luck with that.

 

The claim is not about calculation speed - there is a 2x speedup (0.02 ms vs 0.05 ms)

The claim is that there is no information in the articles:

  • Read/write latency of even a small buffer = 0.35300 ms - this is what invalidates the conversion of most algorithms to OpenCL!
  • The tester does not select CPU for OpenCL - this is not reported anywhere!

I'm probably the first one who wanted to speed up the test at the expense of GPU, having read the promises...

MetaDriver: The first thread on OpenCL even quite well describes the criteria that an algorithm must possess to have ocl acceleration potential.

Read my post again.

The main criterion: the execution of MQL code in the "OpenCL-style" should exceed the time = Number of_Buffers * 0.35300 ms in 1 tick.

To find out the speed of the algorithm in MQL with an accuracy of microseconds (1000 microseconds = 1 millisecond), you will have to run it several times in the tester and Total_Time / Number_of_Ticks (my top post).

Were it not for the buffer delay, my code would pass the test in ~30 seconds - that's ~2 times faster than "OpenCL style" MQL (55 seconds) and ~11 times faster than regular code (320 seconds).

What other criteria are there?

 
Roffild: The claim is that there is no information in the articles:
  • Read/write latency of even a small buffer = 0.35300 ms - this is what makes most algorithms in OpenCL senseless!

Judging by your experience in dealing with OpenCL, you must have already understood that not every algorithm is directly accelerated. One of the main problems here is minimizing global memory accesses.

By the way, I now have to solve a similar problem with random access to global device memory (too private this random access, and it's a fucking overhead). I will solve it as soon as I get my brain back on track.

The tester does not select the CPU for OpenCL - this is not reported anywhere!

Write to Service Desk and justify the need for such a feature.

If the tester isn't used, it's already done (this is my application). And I haven't checked on the tester yet.



 
Mathemat:

It has already been written that not every algorithm is directly accelerated. You have to use your brain here, and one of the main problems is to minimize global memory accesses.

Well, now I have to solve a similar problem with random access to global memory (this random access is too frequent). I'll solve it as soon as I get my brain working.

It's time to use your brain because 0.35300 ms refers exactly to clEnqueue[Read/Write]Buffer() and not to global memory accesses inside the kernel.

The second can be solved by optimizing the kernel itself while the first is an iron limitation.

OpenCL: From Naive Towards More Insightful Programming
OpenCL: From Naive Towards More Insightful Programming
  • 2012.06.29
  • Sceptic Philozoff
  • www.mql5.com
This article focuses on some optimization capabilities that open up when at least some consideration is given to the underlying hardware on which the OpenCL kernel is executed. The figures obtained are far from being ceiling values but even they suggest that having the existing resources available here and now (OpenCL API as implemented by the developers of the terminal does not allow to control some parameters important for optimization - particularly, the work group size), the performance gain over the host program execution is very substantial.
Reason: