Evaluating CPU cores for optimisation - page 6

 
Here is some information from a source about the nature of the instructions

Instructions which E5-2670 does not have:
BMI (Bit Manipulation Instructions) are instruction sets used in Intel and AMD processors to accelerate bit manipulation operations. BMI are not SIMD instructions and work only with universal registers of processors.
Bit manipulation operations are most often used by applications designed for low-level device control, error detection and correction, optimization, compression and encryption. The use of BMI by programs significantly speeds up these operations (sometimes by a factor of several); however, the program code becomes more difficult to write by programmers.
BMI instruction sets in Intel and AMD processors differ significantly.
Intel processors use BMI1 and BMI2 (in addition to BMI1).
AMD processors use ABM (Advanced Bit Manipulation) instructions which are part of SSE4a package (these instructions are also used by Intel but are implemented as a part of SSE4.2 and BMI1). In addition, AMD processors use the TBM (Trailing Bit Manipulation) instruction set which is an extension of BMI1.

F16C is an instruction set used on x86 processors to accelerate conversions between half-precision binary numbers (16 bit) and standard single-precision floating-point binary numbers (32 bit).
F16C is used in both AMD and Intel processors, dramatically extending their capabilities to handle multimedia data as well as other types of data.
F16C (16-bit Floating-Point conversion) - is an instruction set used on x86 processors to accelerate conversions between half-precision binary numbers (16 bit) and standard single-precision floating-point binary numbers (32 bit). It is essentially an extension of the basic 128-bit SSE instructions.
The use of different precision numbers in computer engineering is a compromise between accuracy and the range of values represented, necessary to achieve high speed performance and a wide range of tasks to be performed.
F16C was first used in AMD processors in 2009, although it was developed much earlier and was known as CVT16. CVT16 was originally planned as part of the never-released SSE5 package, which was to include XOP and BMI4 instructions.
Nowadays, the F16C instruction set is used in both AMD and Intel processors, significantly extending their capabilities in terms of handling multimedia data, as well as other types of data.

FMA
A set of processor instructions accelerating floating point multiplication and addition operations. FMA is an acronym for Fused Multiply-Add, which stands for single-rounded multiplication and addition.
Multiply-Add operations are very common and play an important role in computer technology. Especially when it comes to digital processing of analog signals (binary video and audio coding and other similar operations). Because of this, support for FMA instructions has been built into not only CPUs but also the GPUs of many of today's graphics cards.

Instructions that all but the i7-8700 don't have:
TSX (Transactional Synchronization eXtensions) is a set of multi-core processor instructions developed by Intel that improves the efficiency of cores communicating with each other when sharing the same data and ultimately increases overall computer performance.

MPX (Memory Protection Extensions) - A technology that provides enhanced protection against viruses and other threats using buffer overflow mechanisms.
The processor can examine the heap and stack buffer limits before accessing memory to ensure that applications accessing memory only access the memory area allocated to them. This makes it much more difficult for a hacker or malware program to "substitute" its code for the CPU via memory.

SGX (Software Guard Extensions) - a set of instructions developed by Intel and used in its processors starting with the Skylake architecture.
SGX allows protected sections of code and data (called "enclaves") to provide a high level of protection for running programs against malicious applications and hacker attacks.


BMI2 (complementary to BMI1).

Since MPX and SGX are about protection, I would venture to guess that the compiler actively uses BMI2 instructions/technologies and there is an effect of TSX, which is less likely.

 
Alexey, it seems to me that it would be more efficient to learn how to write code for OpenCL and buy a good card for that.
OpenCL on a card will be a priori much more efficient than multiple processors.
And forget about this hassle of comparing processors.
But yes, you have to figure out how to write code for OpenCL.
I can't really say how informative it is - I've been trying to get around to reading it on my own.
 
Roman:
Alexey, I think it would be more efficient to learn how to write OpenCL code and get a good card for that.
OpenCL on a card will be a priori much more efficient than several processors.
And forget about this hassle of comparing processors.
But yes, you have to figure out how to write code for OpenCL.
I can't really say how informative it is - I've been trying to get around to reading it on my own.

It's not so easy to write in OpenCL, I studied the theory a bit, it's easier to make an agent with OpenCL technology, not like now, easier in terms of consumers.

And then, OpenCL is not always efficient, so I was comparing on software from Yandex(CatBoost) card 1060 and FX-8350 processor - it turned out that the processor is twice faster, and if that trend is economically more profitable to buy a powerful processor than five 1080i, from which there will certainly be an effect, but expensive ... in general, it's not clear-cut and it's not a solution for everyone.

And then, I think that in the compiler you can simply disable support for the latest technologies and for old-timers everything will work faster, the option to disable.
 
Aleksey Vyazmikin:

It's not so easy to write in OpenCL, I studied the theory a bit, it's easier to make an agent with OpenCL technology, not like now, easier in terms of consumers.

And then, OpenCL is not always efficient, so I was comparing on software from Yandex (CatBoost) card 1060 and FX-8350 processor - it turned out that the processor is twice faster, and if that trend is economically more profitable to buy a powerful processor than five 1080i, from which there will certainly be an effect, but expensive ... in general, it's not clear-cut and it's not a solution for everyone.

For mathematical calculations, the green ones are not particularly suitable.
The red ones are better for maths, they even have a maths mode as standard, which can be set up through the official app.
I have an old Radeon 7970 reference, it still supports mining. This is not to say that I mine on one card, no it's not profitable, but that it pulls the calculations.
For math calculations on the card, you need to look at the number of shaders, the more of them the better, the rest of the fps, etc. does not matter, most importantly the shader blocks.


 
Roman:

For mathematical calculations green ones are not particularly suitable.
For maths, the red ones are better. They even have a maths mode as standard, which can be set up through the official app.
I have an old Radeon 7970 reference, it still supports mining. This is not to say that I mine on one card, no it is not profitable, but that it pulls the calculations.
For math calculations on the card, you need to look at the number of shaders, the more of them the better, the rest of the fps, etc. does not matter, most importantly the shader blocks.


As far as I know, the red ones just know how to work with double and the green ones don't - I know that. But, in machine learning (CatBoost) there is a sharpening on comparison operations, which by idea should work as fast as in red and green. And reds are not supported by CatBoost, alas.

In any case, I can not do it myself, and the artists to find it was not so easy for an adequate price and understanding.

 

It was suggested to me that the code could be accelerated by using switch enumeration.

It used to be like this:

         if(Type_Poisk_Tree==Tree_Buy_Filter || Type_Poisk_Tree==Tree_Sell_Filter || Type_Poisk_Tree==Tree_Buy || Type_Poisk_Tree==Tree_Sell)
           {
            if(Test_P>=1000 && Test_P<5000)
              {
               if(Test_P<2500)
                 {
                  if(Test_P==1000)if(DonProc<5.5 && Levl_Down_DC<-7.5) CalcTest=CalcTest+1; //(0.4810127 0.3037975 0.2151899)
                  if(Test_P==1001)if(DonProc< 5.5 && Levl_Down_DC>=-7.5 && TimeH< 21.5 && TimeH>=16.5 && TimeH< 19.5 && Levl_Close_H1s1N< 2.5) CalcTest=CalcTest+1; //(0.4400657 0.4072250 0.1527094)
                  if(Test_P==1002)if(DonProc< 5.5 && Levl_Down_DC>=-7.5 && TimeH< 21.5 && TimeH>=16.5 && TimeH< 19.5 && Levl_Close_H1s1N>=2.5) CalcTest=CalcTest+1; //(0.3739837 0.5121951 0.1138211)
                  if(Test_P==1003)if(DonProc<5.5 && Levl_Down_DC>=-7.5 && TimeH<21.5 && TimeH>=16.5 && TimeH>=19.5) CalcTest=CalcTest+1; //(0.3390706 0.4647160 0.1962134)
                //Ещё 70к сравнений
                }

And now it's like this:

         if(Type_Poisk_Tree==Tree_Buy_Filter || Type_Poisk_Tree==Tree_Sell_Filter || Type_Poisk_Tree==Tree_Buy || Type_Poisk_Tree==Tree_Sell)
           {
                  switch(Test_P)
                    {
                     case 1000: if(DonProc<5.5 && Levl_Down_DC<-7.5) CalcTest=CalcTest+1; break; //(0.4810127 0.3037975 0.2151899)
                     case 1001: if(DonProc< 5.5 && Levl_Down_DC>=-7.5 && TimeH< 21.5 && TimeH>=16.5 && TimeH< 19.5 && Levl_Close_H1s1N< 2.5) CalcTest=CalcTest+1; break; //(0.4400657 0.4072250 0.1527094)
                     case 1002: if(DonProc< 5.5 && Levl_Down_DC>=-7.5 && TimeH< 21.5 && TimeH>=16.5 && TimeH< 19.5 && Levl_Close_H1s1N>=2.5) CalcTest=CalcTest+1; break; //(0.3739837 0.5121951 0.1138211)
                     case 1003: if(DonProc<5.5 && Levl_Down_DC>=-7.5 && TimeH<21.5 && TimeH>=16.5 && TimeH>=19.5) CalcTest=CalcTest+1; break; //(0.3390706 0.4647160 0.1962134)
                     //ещё 70к сравнений
                   }

According to first estimates, FX-8350 is 30% faster, but Phenom II processors are about 3 times faster! I will make comparison tests later when machines are free from optimization.

I am attaching the new version of Tree_Brut_TestPL Expert Advisor, and I've added "_Fast" to its name. Please test it too, as it is very interesting to know on what architectures the gain will be. There is a hope, that these researches will help to improve ME compiler.

Files:
 

I have received additional data from forum memberFast528 (currently unable to post on the forum)

Ryzen 2700 not overclocked, memory 3333

Tree_Brut_TestPL 8 cores 16 threads

2019.08.13 10:24:14.813 Tester optimization finished, total passes 11
2019.08.13 10:24:14.824 Statistics optimization done in 1 minutes 56 seconds
2019.08.13 10:24:14.824 Statistics shortest pass 0:01:13.337, longest pass 0:01:20.403, average pass 0:01:15.853
2019.08.13 10:24:14.824 Statistics 8731 frames (3.43 Mb total, 412 bytes per frame) received
2019.08.13 10:24:14.824 Statistics local 11 tasks (100%), remote 0 tasks (0%), cloud 0 tasks (0%)
2019.08.13 10:24:14.864 Tester 11 new records saved to cache file 'tester\cache\Tree_Brut_TestPL.30.E415F787BBBCE67C438526613B41CB4F.opt'

Tree_Brut_TestPL_F8 co res 16 threads

2019.08.13 10:24:14.824 Statistics 8731 frames (3.43 Mb total, 412 bytes per frame) received
2019.08.13 10:31:30.562 Tester optimization finished, total passes 11
2019.08.13 10:31:30.573 Statistics optimization done in 2 minutes 32 seconds
2019.08.13 10:31:30.573 Statistics shortest pass 0:02:12.689, longest pass 0:02:31.529, average pass 0:02:21.243
2019.08.13 10:31:30.573 Statistics 11000 frames (4.32 Mb total, 412 bytes per frame) received
2019.08.13 10:31:30.573 Statistics local 11 tasks (100%), remote 0 tasks (0%), cloud 0 tasks (0%)
2019.08.13 10:31:30.626 Tester 11 new records saved to cache file 'tester\cache\Tree_Brut_TestPL_F.30.E415F787BBBCE67C438526613B41CB4F.opt'

This test is not complete as we need a variant with 8 cores and 8 threads due to activation of 8 agents and also 16 passes should be specified in the "Optimization" tab - according to the number of threads (Start 0, Step 1, Stop 15).

When running the test again, don't forget to clear the cache, which is located at ..\Tester\cache

I will add the intermediate results to the table for now as 8 cores / 8 agents.

 

Unfortunately I can't edit the first post anymore, so I'm posting the rating here

 

Here is the result of fx8320e frequency 4GHz, memory 1866 2 channels, rank 2.

Tree_Brut_TestPL_F_Fast

4 agents 8 passes

DF      0       13:27:26.728    Core 4  pass 6 returned result 1001000.00 in 0:00:28.342
HL      0       13:27:26.732    Core 1  pass 2 returned result 1001000.00 in 0:00:28.414
PE      0       13:27:26.844    Core 3  pass 4 returned result 1001000.00 in 0:00:28.476
PJ      0       13:27:26.936    Core 2  pass 0 returned result 1001000.00 in 0:00:28.619
QP      0       13:27:53.132    Core 4  pass 7 returned result 1001000.00 in 0:00:26.406
KI      0       13:27:53.219    Core 1  pass 3 returned result 1001000.00 in 0:00:26.489
MN      0       13:27:53.337    Core 3  pass 5 returned result 1001000.00 in 0:00:26.495
ND      0       13:27:53.571    Core 2  pass 1 returned result 1001000.00 in 0:00:26.637
OR      0       13:27:53.571    Tester  optimization finished, total passes 8
OF      0       13:27:53.582    Statistics      optimization done in 0 minutes 57 seconds
PI      0       13:27:53.582    Statistics      shortest pass 0:00:26.406, longest pass 0:00:28.619, average pass 0:00:27.484
NM      0       13:27:53.582    Statistics      8000 frames (3.14 Mb total, 412 bytes per frame) received
HL      0       13:27:53.582    Statistics      local 8 tasks (100%), remote 0 tasks (0%), cloud 0 tasks (0%)

8 agents 8 passes

DI      0       13:30:59.789    Core 2  pass 1 returned result 1001000.00 in 0:00:33.072
KN      0       13:30:59.887    Core 1  pass 0 returned result 1001000.00 in 0:00:33.177
PD      0       13:31:00.132    Core 3  pass 2 returned result 1001000.00 in 0:00:33.422
PM      0       13:31:00.245    Core 4  pass 3 returned result 1001000.00 in 0:00:33.531
RR      0       13:31:00.590    Core 8  pass 7 returned result 1001000.00 in 0:00:32.922
IH      0       13:31:00.615    Core 5  pass 4 returned result 1001000.00 in 0:00:33.197
CQ      0       13:31:00.981    Core 6  pass 5 returned result 1001000.00 in 0:00:33.506
GF      0       13:31:01.111    Core 7  pass 6 returned result 1001000.00 in 0:00:33.614
CS      0       13:31:01.111    Tester  optimization finished, total passes 8
KG      0       13:31:01.122    Statistics      optimization done in 0 minutes 35 seconds
RN      0       13:31:01.122    Statistics      shortest pass 0:00:32.922, longest pass 0:00:33.614, average pass 0:00:33.305
NO      0       13:31:01.122    Statistics      8000 frames (3.14 Mb total, 412 bytes per frame) received
HJ      0       13:31:01.122    Statistics      local 8 tasks (100%), remote 0 tasks (0%), cloud 0 tasks (0%)

8 agents almost 2x faster

 
Maxim Romanov:

Here is the result of fx8320e frequency 4GHz, memory 1866 2 channels, rank 2.

Tree_Brut_TestPL_F_Fast

4 agents 8 passes

8 agents 8 passes

8 agents almost 2x faster

Thanks, but add Tree_Brut_TestPL_F and Tree_Brut_TestPL results for evaluation !

Reason: