Bibliotecas: JSON Library for LLMs - página 3

 

Thank you for the detailed review . Your points pushed the library forward. Here is the technical breakdown of what shipped in v3.4.0:

Adopted Optimizations

1. Zero-Allocation Serialization (No more strings)


You were absolutely right about IntegerToString() creating temporary MQL strings that pressure the GC. I implemented PutRawInteger and PutRawDouble to write digits directly into the byte buffer.

// Old (Heap Allocation):
PutRaw(IntegerToString(GetInt(idx)), out, pos, cap);

// New v3.4.0 (Zero-Alloc, Direct Buffer Write):
void PutRawInteger(long value, uchar &out[], int &pos, int &cap) {
    // ... writes bytes '0'-'9' directly to out[] ...
}

2. Hybrid Long/Double Parsing + Exp10 Table

I adopted your idea of accumulating integers as long (faster ALU ops) and using a lookup table for the fractional part (better FP precision). However, I added safety guards that were missing in your snippet:

// My implementation of your suggestion (with safety):
if (use_long && int_val < 922337203685477580) { // LONG_MAX / 10
    int_val = int_val * 10 + (c - '0');
} else {
    // Overflow guard: Fallback to double if number > 19 digits
    if (use_long) { val = (double)int_val; use_long = false; }
    val = val * 10.0 + (c - '0');
}

This gives us the speed of integers for 99.9% of cases, but safely handles massive numbers (e.g. 30-digit bigints) without silent overflow.


Rejected Suggestions (and why)

1. Using c > '9' as delimiter

This weakens RFC 8259 compliance. An input like 123abc would be silently accepted as 123 . I kept the explicit checks:

if (c == '.' || c == 'e' || c == 'E') break; // Strict validation


2. Removing & 0xFF from GetType()
It is not redundant. long in MQL5 is signed. If the tape value has the 63rd bit set (large offset), >> 56 performs an arithmetic shift, filling the high bits with 1s. The mask is required for correctness.


3. Hoisting if (i > 0) outside the loop
The CPU branch predictor handles this pattern (1 miss per loop) with negligible cost (~15 cycles). It does not justify code duplication.


v3.4.0 is now live. Thanks for the collaboration.

 
Jonathan Pereira #:

Rejected Suggestions (and why)

1. Using c > '9' as delimiter

This weakens RFC 8259 compliance. An input like 123abc would be silently accepted as 123 . I kept the explicit checks:

I didn't understand the counterargument.
 

The RFC 8259 compliance argument is about Robustness Principle (Postel's Law) vs. Strictness for data interchange.

Ref: RFC 8259, Section 6 (Numbers)
The grammar is defined as:

number = [ minus ] int [ frac ] [ exp ]
int = zero / ( digit1-9 *DIGIT )
frac = decimal-point 1*DIGIT

The grammar does not allow trailing characters like 'a', 'b', '-', etc. A value like 123abc is not a number. It is a malformed token.

If the parser uses c > '9' as a stop condition, it consumes 123 and leaves abc in the buffer. In a high-speed parser, this behavior is ambiguous:

  1. Inside an array [123abc] : The parser reads 123 , then the next token is abc (invalid). This eventually fails, but the error is "Invalid Token 'abc'" instead of the root cause "Malformed Number '123abc'". This makes debugging harder.
  2. Concatenated JSON (NDJSON): If I send {"val":123}abc , a loose parser might accept the object and ignore the garbage, potentially hiding data corruption issues in the stream.

By enforcing c == '.' || c == 'e' || c == 'E' , we explicitly state: "These are the ONLY valid continuations for a number." Anything else triggers an immediate check against the structural delimiters ( , } ] or whitespace) in the caller loop, or errors out instantly.

It is a design choice: Fail Fast vs. Fail Later. For financial data (prices/volumes), I prioritize Failing Fast on any ambiguity over saving 1 CPU cycle per digit.

Hope that clarifies better!

 
Jonathan Pereira #:

Hope that clarifies better!

We're talking about the parser only in FastAtof, where the length n_len is already known. Therefore, my condition can't create even the slightest problem.
 
fxsaber # :
We're talking about the parser only in FastAtof , where the length n_len is already known. Therefore, my condition can't create even the slightest problem.

You are absolutely right. Since n_len is already determined by the parser before calling FastAtof , the limits are strictly controlled, so the condition you suggested ( c > '9' ) is indeed safe and cannot cause problems.

I applied its optimization in the latest version (v3.4.2). To verify the impact on performance and ensure robustness, I ran a live benchmark test using real market data from the Binance API (OHLCV, Depth, Trades) against `JAson`. The results confirm that the analyzer is stable and incredibly fast.


Environment: MetaTrader 5 build 5614, Intel Core i7-10750H @ 2.60GHz

Payload type Size fast_json (Analyze) Jason (Analyze) Accelerate
Klines (OHLCV)
100 candles x 12 fields
17.9 KB 134.7 µs 2268.4 µs 16.8x
Book of Orders (Depth)
100 levels (Nested arrays)
6.4 KB 45.3 µs 626.0 µs 13.8x
Recent transactions
100 exchanges (Various items)
14.7 KB 121.2 µs 1747.6 µs 14.4x

Thanks again for insisting on the details. The code is cleaner and faster now.

🔗 Updated repository: GitHub/Forge

fast_json
fast_json
  • 14134597
  • forge.mql5.io
Uma biblioteca JSON projetada para uso massivo de LLMs e menor latência.
 
Jonathan Pereira #:
the condition you suggested ( c > '9' ) is indeed safe and cannot cause problems.

Interesting result.

template <typename T>
string ToBits( const T Value )
{
  string Str = NULL;
  
  for(uint i = sizeof(T) << 3; (bool)i--;)
    Str += (string)(int)(!!(Value & ((T)1 << i)));
    
  return(Str);
}

bool IsDigit( const uchar Char )
{
  return((bool)(Char & (1 << 4)));
}

void PrintChar( const uchar Char )
{
  Print(CharToString(Char) + ": " + ToBits(Char) + " - " + (string)IsDigit(Char));
}

void OnStart()
{
  for (uchar Char = '0'; Char <= '9'; Char++)
    PrintChar(Char);
    
  PrintChar('.');
  PrintChar('e');
  PrintChar('E');
}


Result.

0: 00110000 - true
1: 00110001 - true
2: 00110010 - true
3: 00110011 - true
4: 00110100 - true
5: 00110101 - true
6: 00110110 - true
7: 00110111 - true
8: 00111000 - true
9: 00111001 - true
.: 00101110 - false
e: 01100101 - false
E: 01000101 - false


Therefore, double checking can be replaced with single checking.

// if (c == '.' || c > '9') // Pre-validated token: c>'9' catches e/E
if (!(c & (1 << 4))) 
 
A more versatile option.
bool IsDigit( const uchar Char )
{
//  return((bool)(Char & (1 << 4)));
  return((Char & 0xF0) == 0x30);
}

void OnStart()
{
  for (int Char = 0; Char <= UCHAR_MAX; Char++)
    PrintChar((uchar)Char);
}


Result.

... - false

.: 00101110 - false
/: 00101111 - false
0: 00110000 - true
1: 00110001 - true
2: 00110010 - true
3: 00110011 - true
4: 00110100 - true
5: 00110101 - true
6: 00110110 - true
7: 00110111 - true
8: 00111000 - true
9: 00111001 - true
:: 00111010 - true
;: 00111011 - true
<: 00111100 - true
=: 00111101 - true
>: 00111110 - true
?: 00111111 - true
@: 01000000 - false
A: 01000001 - false

... - false
 
I took a closer look at your source code.

It would make sense to create a counter for each condition.

int Counter[9]

if (c == '{') {
  // ...
  Counter[0]++;
} else if (c == '[') {
  // ...
  Counter[1]++;
} else if (c == '"') {
  // ...
  Counter[2]++;
} else if (c == 't') {
  // ...
  Counter[3]++;
} else if (c == 'f') {
  // ...
  Counter[4]++;
} else if (c == 'n') {
  // ...
  Counter[5]++;
} else if (g_cc[c] == CC_DIGIT) {
  // ...
  Counter[6]++;
} else {
  // ...
  Counter[7]++;
}

And arrange the conditions in descending order of the counter.
 

On numbers you do a double pass through the array.

        } else if (g_cc[c] == CC_DIGIT) {
          int start = cur;
          bool is_float = false;
          while (cur < len) {
            uchar cc = buffer[cur];
            if (cc == '.' || cc == 'e' || cc == 'E')
              is_float = true;
            else if (cc != '+' && g_cc[cc] != CC_DIGIT)
              break;
            cur++;
          }
          int n_len = cur - start;
          int idx = tape_pos++;
          if (is_float) {
            tape[idx] = ((long)J_DBL << 56) | 2;
            tape[tape_pos++] = DBL2LONG(FastAtof(start, n_len));
          } else {
            tape[idx] = ((long)J_INT << 56) | 2;
            tape[tape_pos++] = FastAtoi(start, n_len);
          }
          sp--;
        }
But you can get by with a single pass through the array.
 


Ideal.

bool IsDigit( const uchar Char )
{
//  return((Char >= '0') && (Char <= '9'));

//  return((Char ^ (UCHAR_MAX - '1')) >= (UCHAR_MAX - 9));
//  return((Char ^ '1') <= 9);

//  return((Char ^ (UCHAR_MAX - '0')) >= (UCHAR_MAX - 9));
  return((Char ^ '0') <= 9);
}