Read and parse an html page

 

Does anyone have experience with reading and parsing an HTML page starting with

<meta charset=utf-8>

to save the table rows in a csv-file.

I try to open the already downloaded, local files with

int fHdl = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON);

This in the sense the the whole file can be read at once but e. g. "€" becomes "€" if I open the csv file by LO Calc

:(

If I try:

int fHdl = FileOpen(fHTML,FILE_READ|FILE_TXT|FILE_ANSI|FILE_COMMON, 0, CP_UTF8);
string wrkHtml = FileReadString(fHdl,(int)FileSize(fHdl));
Print("open "+fHTML+" hdl:"+(string)fHdl+" size: "+(string)StringLen(wrkHtml)+" "+StringSubstr(wrkHtml,0,60));

it doesn't work at all as only the first 40 char are read:

2025.01.03 17:13:31.783    createETF-Tabelle (EURUSD,H1)    106 e:5035 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21:07:08).html Html-Size:619513 String-Size: 40 <!DOCTYPE html> <html lang=en style><!--

If I try

   int fHdl = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON, 0, CP_UTF8);
   if (fHdl<0) {
      Prt(fHTML+" "+(string)fHdl+"  FAILED");
      FileClose(fHdl);
      return(false);
   }
   string wrkHtml = FileReadString(fHdl,(int)FileSize(fHdl));
   Prt("open "+fHTML+" Html-Size:"+(string)FileSize(fHdl)+" String-Size: "+(string)StringLen(wrkHtml)+" "+StringSubstr(wrkHtml,0,60));

I get:

2025.01.03 17:09:43.655    createETF-Tabelle (EURUSD,H1)    106 e:0 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21:05:15).html Html-Size:619471 String-Size: 4095 <!DOCTYPE html> <html lang=en style><!-- Page saved with Si


Instead on reading 619471 char I only get the first 4095 :(

I am adding each table row with:

int fH = FileOpen(fName,FILE_READ|FILE_WRITE|FILE_BIN|FILE_COMMON);
FileSeek(fH,0,SEEK_END);
FileWriteString(fH, line, StringLen(line) );
FileClose(fH);
 
Can you post the html file ?
 
Alain Verleyen #:
Can you post the html file ?

Sure

 
Your topic has been moved to the section: Expert Advisors and Automated Trading
 
Carl Schreiber:

Does anyone have experience with reading and parsing an HTML page starting with

to save the table rows in a csv-file.

I try to open the already downloaded, local files with

This in the sense the the whole file can be read at once but e. g. "€" becomes "€" if I open the csv file by LO Calc

:(

If I try:

it doesn't work at all as only the first 40 char are read:

If I try

I get:

Instead on reading 619471 char I only get the first 4095 :(

I am adding each table row with:

All of that is normal. If you open it as TXT in UTF8 then use FileReadString it reads up to the first "end" character, you can't read all at once this way (the length parameter is ignored).

Seems the FileReadString using a BIN file had a limitation of 4096 characters (4095 + the '\0' end). I was not aware about this, but that seems understandable for a BIN file reading string.

Of course, as usual, the documentation is unclear or incomplete, we have to live with it.

So one way to go is using BIN but with a char array.

  int fHdl   = FileOpen(fName,FILE_READ|FILE_BIN|FILE_ANSI, 0, CP_UTF8);
  char array[];
  uint read  = FileReadArray(fHdl,array);
 
Carl Schreiber:

Does anyone have experience with reading and parsing an HTML page starting with

to save the table rows in a csv-file.

I try to open the already downloaded, local files with

This in the sense the the whole file can be read at once but e. g. "€" becomes "€" if I open the csv file by LO Calc

:(

If I try:

it doesn't work at all as only the first 40 char are read:

If I try

I get:

Instead on reading 619471 char I only get the first 4095 :(

I am adding each table row with:

I suggest you parse your HTML code in the box provided at Blogcrowds HTML Parser

Then you can combine it with your HTML code.

 
Alain Verleyen #:

Seems the FileReadString using a BIN file had a limitation of 4096 characters (4095 + the '\0' end). I was not aware about this, but that seems understandable for a BIN file reading string.

Of course, as usual, the documentation is unclear or incomplete, we have to live with it.

So one way to go is using BIN but with a char array.

Well if you use "FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON);" WITHOUT ", 0, CP_UTF8"

   int fHdl = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON); 
   if (fHdl<0) {
      return(false);
   }
   string wrkHtml = FileReadString(fHdl,(int)FileSize(fHdl));
   debug Prt("open "+fHTML+" Html-Size:"+(string)FileSize(fHdl)+" String-Size: "+(string)StringLen(wrkHtml)+" "+StringSubstr(wrkHtml,0,60));

It reads the whole file at once:

2025.01.03 19:06:31.196    createETF-Tabelle (EURUSD,H1)    106 e:0 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21:05:15).html Html-Size:619471 String-Size: 619471 <!DOCTYPE html> <html lang=en style><!-- Page saved with Si

MQL5 is probably tripping itself up. I'll try your suggestion with char...

 

You can find some useful info in the following places:

All of these do not have problems with reading texts in Unicode.

MQL5 Book: Common APIs / Working with files / Selecting an encoding for text mode
MQL5 Book: Common APIs / Working with files / Selecting an encoding for text mode
  • www.mql5.com
For written text files, the encoding should be chosen based on the characteristics of the text or adjusted to the requirements of external programs...
 
Carl Schreiber #:

Well if you use "FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON);" WITHOUT ", 0, CP_UTF8"

It reads the whole file at once:

MQL5 is probably tripping itself up. I'll try your suggestion with char...

Interesting, probably some bug using UTF8 and introducing this 4096 limit.
 

SOLVED! This works:

   int fHdl   = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON ,0, CP_UTF8);
   if (fHdl<0) {
      Prt(fHTML+" "+(string)fHdl+"  FAILED");
      FileClose(fHdl);
      return(false);
   }
   char cArr[];
   uint read  = FileReadArray(fHdl,cArr);
   string wrkHtml = CharArrayToString(cArr,0,WHOLE_ARRAY,CP_UTF8);
   Print(fHTML+" Html-Size:"+(string)FileSize(fHdl)+" Array-Size: "+(string)ArraySize(cArr)+", read:"+(string)read+", String-Size: "+(string)StringLen(wrkHtml) );
   FileClose(fHdl);

See the size of the file the array and the string are equal:

2025.01.03 19:27:11.867    createETF-Tabelle (EURUSD,H1)    114 e:0 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21:06:19).html Html-Size:619450, Array-Size: 619450, read:619450, String-Size: 619448

and the "€" remains "€":

€ 11,560,000 (EU)
...
¥ 806,380,000 (China)
...
¥ 5,130,000,000 (Japan)
...
₩ 1,010,000,000 (Korea)

It's a bit like “from behind through the chest into the eye” (a German saying for overcomplicated).

 
Carl Schreiber #:

SOLVED! This works:

See the size of the file the array and the string are equal:

and the "€" remains "€":

It's a bit like “from behind through the chest into the eye” (a German saying for overcomplicated).


I guess, you will be using WebRequest later on, and then you probably will have issues again, because the result array is also raw, and not the same as the one you are getting now.

Also, it is probably more efficient to work directly with the char array instead of converting to string type.

(Are string variables still limited in size?)