What data organization to use for finding... - page 3

 
Dominik Christian Egert #:
I am not sure, I properly understand what you are trying to say.

This way you don't need to save (to hash) two numbers for x and y but only one. So one(!) number identifies a single point in an two dimensional space. Kind of 'Gödelisierung'  of the coordinates.

 
Carl Schreiber #:

This way you don't need to save (to hash) two numbers for x and y but only one. So one(!) number identifies a single point in an two dimensional space. Kind of 'Gödelisierung'  of the coordinates.

That is a smart simplification. Now I understand what you are thinking about...
 
Amir Yacoby #:
For triangles for instance, does it need to return only from the borders or also from within?
Good question...

I think, borders could be enough... But that's difficult for elliptical shapes. Probably areas are easier to solve, I guess.
 
Dominik Christian Egert #:
That is a smart simplification. Now I understand what you are thinking about...

But you have to check it. If the divisions of a ulong are performed by a cast to an double value you might get a wrong number.

I once tried to assign Chart-IDs (type long) to the global values (type double) of the terminal (only double) I didn't get back the correct ID number - this works only for smaller numbers :(

 
Carl Schreiber #:
if the coordinates are limited like 0-1000 for both for a hash you can squeeze them together into one new number = x + 1000*y. Then x= number%1000 and y= number/1000 (one need to check the rounding of big numbers)? max of ulong (18 446 744 073 709 551 615) has 20 digits so this would work up to 8 digits (99 999 999) for each coordinate?

I think this code improves your idea by avoiding the numeric type limits of the coordinates:

#include <Generic\Internal\HashFunction.mqh>

// Build a 64-bit hash code by tying together two 32-bit numbers
      
long GetHash(double price, datetime time)
  {
   return ((long)(GetHashCode(price)) << 32)
          | GetHashCode(time);
  }

This equivalent to:  number = x + 4294967296*y

One problem is that it hashes the "exact" coordinates. To hash the "nearest bounding rectangle", we can approximate the price and time inputs (as before) before combining the two 32-bit hash values. 

Compared to using concatenated string keys directly as before #16, the above code does not provide any extra advantage (except that it might be a bit faster), because the CHashMap class eventually hashes the "composite" string key "price"+"time" to a single number. 

Edit:

Hash collisions is not a problem here, because CHashMap class has a built in mechanism for resolving hash collisions (Separate chaining technique). Many hash collisions will just slow down the search, but the accuracy is not affected.

The hash function does not have to be complicated (simple functions are Ok) as collisions are un-avoidable. What matters is how to handle those collisions.
 

What about converting the double prices into integer by:

iPrc = int(close[i]/_Point);

and as this range is for sure less than 0 - 99 999 999 one can reduce this range to enlarge the 'other side' = dateteime.

There you for sure don't nee the seconds from 1970 subtract

sTme[i] = time[i] - D'2000.01.01. 00:00:00';

As I think the %-operation is less critical concerning an internal cast when divided I would put the 'longer' number (time) at the right end of the combined number and the smaller range on the left end of the number - there is no need that both numbers have to have the same size on both sides.

 
amrali #:

I think this code improves your idea by avoiding the numeric type limits of the coordinates:

This equivalent to:  number = x + 4294967296*y

One problem is that it hashes the "exact" coordinates. To hash the "nearest bounding rectangle", we can approximate the price and time inputs (as before) before combining the two 32-bit hash values. 

Compared to using concatenated string keys directly as before #16, the above code does not provide any extra advantage (except that it might be a bit faster), because the CHashMap class eventually hashes the "composite" string key "price"+"time" to a single number. 

Edit:

Hash collisions is not a problem here, because CHashMap class has a built in mechanism for resolving hash collisions (Separate chaining technique). Many hash collisions will just slow down the search, but the accuracy is not affected.

The hash function does not have to be complicated (simple functions are Ok) as collisions are un-avoidable. What matters is how to handle those collisions.
Yes, I see.

I was thinking about using float for price and using an offset for datetime to reduce the amount of bits needed. This way both would fit into a 64 bit ulong.

Instead of using an offset, a reduction to minutes or to 5 seconds per interval would probably be sufficient for coordinates.

The collisions are a bit of a concern to me, as the main goal is to have a fast search. The nature of the data, or the location of ABC patterns makes them, just on its own already have 2 collisions for each pattern. Since ABC patterns tend to share common points, it might end up having multiple patterns share one and the same iE bar's low...

Since collisions are a quite frequent issue, I think it would make sense to take this fact into account from the beginning on, and not push it to an unavoidable "side-effect".

100 ABC patterns would already introduce 200 collisions. Not accounting for patterns sharing the same low or high point.

I think using hashes is a possible solution. But they hardly incorporate with BSP Trees. At least right now, I cannot see how to apply them.

BSP Tree would introduce a O(log-n), which means for searches in around 300k objects, a traversal in the tree would take up to 19 visits of nodes. Each node having some comparison function to process.

The Tree is able to give me the nearest neighbour.

A hash based table would be static in it's access time, and, when applied methods for creating the hash as described above, would be simple in calculations. Thus it would be very fast.

A hash table will give a hit on exact match, or some blurryness, as introduced by the hash function.

To overcome this shortcoming of the hash table, additional entries could be created.
This introduces, depending on the geometric size of the object, and it's shape additional calculational overhead when creating an object.

For my current use-case this would only affect trend lines, but will get really heavy if complex shapes like ellipses or text comes in as well.

Also, the question stays still open, only borders of a rectangle, or also it's surface? - Maybe that's depending on if the object has a filling or not.

Since the task is more complex, and will involve quite some effort, I would like to create a more general solution than just target the current task at hand concerning the patterns.




Reason: