Read and parse an html page

Carl Schreiber 2025.01.03 17:22

Does anyone have experience with reading and parsing an HTML page starting with

<meta charset=utf-8>

to save the table rows in a csv-file.

I try to open the already downloaded, local files with

int fHdl = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON);

This in the sense the the whole file can be read at once but e. g. "€" becomes "â‚¬" if I open the csv file by LO Calc

If I try:

int fHdl = FileOpen(fHTML,FILE_READ|FILE_TXT|FILE_ANSI|FILE_COMMON, 0, CP_UTF8);
string wrkHtml = FileReadString(fHdl,(int)FileSize(fHdl));
Print("open "+fHTML+" hdl:"+(string)fHdl+" size: "+(string)StringLen(wrkHtml)+" "+StringSubstr(wrkHtml,0,60));

it doesn't work at all as only the first 40 char are read:

2025.01.03 17:13:31.783 createETF-Tabelle (EURUSD,H1) 106 e:5035 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21：07：08).html Html-Size:619513 String-Size: 40 <!DOCTYPE html> <html lang=en style><!--

If I try

   int fHdl = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON, 0, CP_UTF8);
   if (fHdl<0) {
      Prt(fHTML+" "+(string)fHdl+"  FAILED");
      FileClose(fHdl);
      return(false);
   }
   string wrkHtml = FileReadString(fHdl,(int)FileSize(fHdl));
   Prt("open "+fHTML+" Html-Size:"+(string)FileSize(fHdl)+" String-Size: "+(string)StringLen(wrkHtml)+" "+StringSubstr(wrkHtml,0,60));

I get:

2025.01.03 17:09:43.655 createETF-Tabelle (EURUSD,H1) 106 e:0 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21：05：15).html Html-Size:619471 String-Size: 4095 <!DOCTYPE html> <html lang=en style><!-- Page saved with Si

Instead on reading 619471 char I only get the first 4095 :(

I am adding each table row with:

int fH = FileOpen(fName,FILE_READ|FILE_WRITE|FILE_BIN|FILE_COMMON);
FileSeek(fH,0,SEEK_END);
FileWriteString(fH, line, StringLen(line) );
FileClose(fH);

Alain Verleyen 2025.01.03 17:32 #1

Can you post the html file ?

Carl Schreiber 2025.01.03 17:51 #2

Alain Verleyen #:
Can you post the html file ?

Sure

Files:

ETF_TABLE_h2024-12-28_16j52c40o.zip 166 kb

Fernando Carreiro 2025.01.03 18:11 #3

Your topic has been moved to the section: Expert Advisors and Automated Trading

Alain Verleyen 2025.01.03 18:44 #4

Carl Schreiber:

Does anyone have experience with reading and parsing an HTML page starting with

to save the table rows in a csv-file.

I try to open the already downloaded, local files with

This in the sense the the whole file can be read at once but e. g. "€" becomes "â‚¬" if I open the csv file by LO Calc

If I try:

it doesn't work at all as only the first 40 char are read:

If I try

I get:

Instead on reading 619471 char I only get the first 4095 :(

I am adding each table row with:

All of that is normal. If you open it as TXT in UTF8 then use FileReadString it reads up to the first "end" character, you can't read all at once this way (the length parameter is ignored).

Seems the FileReadString using a BIN file had a limitation of 4096 characters (4095 + the '\0' end). I was not aware about this, but that seems understandable for a BIN file reading string.

Of course, as usual, the documentation is unclear or incomplete, we have to live with it.

So one way to go is using BIN but with a char array.

  int fHdl   = FileOpen(fName,FILE_READ|FILE_BIN|FILE_ANSI, 0, CP_UTF8);
  char array[];
  uint read  = FileReadArray(fHdl,array);

how many string lenght FileReadString in binary file StringSubstr(input,startpos) copies only first

Roberto Jacobs 2025.01.03 18:49 #5

Carl Schreiber:

Does anyone have experience with reading and parsing an HTML page starting with

to save the table rows in a csv-file.

I try to open the already downloaded, local files with

This in the sense the the whole file can be read at once but e. g. "€" becomes "â‚¬" if I open the csv file by LO Calc

If I try:

it doesn't work at all as only the first 40 char are read:

If I try

I get:

Instead on reading 619471 char I only get the first 4095 :(

I am adding each table row with:

I suggest you parse your HTML code in the box provided at Blogcrowds HTML Parser

Then you can combine it with your HTML code.

Carl Schreiber 2025.01.03 19:12 #6

Alain Verleyen #:

Seems the FileReadString using a BIN file had a limitation of 4096 characters (4095 + the '\0' end). I was not aware about this, but that seems understandable for a BIN file reading string.

Of course, as usual, the documentation is unclear or incomplete, we have to live with it.

So one way to go is using BIN but with a char array.

Well if you use "FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON);" WITHOUT ", 0, CP_UTF8"

   int fHdl = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON); 
   if (fHdl<0) {
      return(false);
   }
   string wrkHtml = FileReadString(fHdl,(int)FileSize(fHdl));
   debug Prt("open "+fHTML+" Html-Size:"+(string)FileSize(fHdl)+" String-Size: "+(string)StringLen(wrkHtml)+" "+StringSubstr(wrkHtml,0,60));

It reads the whole file at once:

2025.01.03 19:06:31.196 createETF-Tabelle (EURUSD,H1) 106 e:0 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21：05：15).html Html-Size:619471 String-Size: 619471 <!DOCTYPE html> <html lang=en style><!-- Page saved with Si

MQL5 is probably tripping itself up. I'll try your suggestion with char...

Stanislav Korotky 2025.01.03 19:15 #7

You can find some useful info in the following places:

algotrading book;
the article on parsing MQL5 source files, specifically the section FileReader;
and the article on parsing HTML-pages;

All of these do not have problems with reading texts in Unicode.

MQL5 Book: Common APIs / Working with files / Selecting an encoding for text mode

www.mql5.com

For written text files, the encoding should be chosen based on the characteristics of the text or adjusted to the requirements of external programs...

How do a parse Any rookie question, so Discussion of article "How

Alain Verleyen 2025.01.03 19:32 #8

Carl Schreiber #:

Well if you use "FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON);" WITHOUT ", 0, CP_UTF8"

It reads the whole file at once:

MQL5 is probably tripping itself up. I'll try your suggestion with char...

Interesting, probably some bug using UTF8 and introducing this 4096 limit.

Carl Schreiber 2025.01.03 19:39 #9

SOLVED! This works:

   int fHdl   = FileOpen(fHTML,FILE_READ|FILE_BIN|FILE_ANSI|FILE_COMMON ,0, CP_UTF8);
   if (fHdl<0) {
      Prt(fHTML+" "+(string)fHdl+"  FAILED");
      FileClose(fHdl);
      return(false);
   }
   char cArr[];
   uint read  = FileReadArray(fHdl,cArr);
   string wrkHtml = CharArrayToString(cArr,0,WHOLE_ARRAY,CP_UTF8);
   Print(fHTML+" Html-Size:"+(string)FileSize(fHdl)+" Array-Size: "+(string)ArraySize(cArr)+", read:"+(string)read+", String-Size: "+(string)StringLen(wrkHtml) );
   FileClose(fHdl);

See the size of the file the array and the string are equal:

2025.01.03 19:27:11.867 createETF-Tabelle (EURUSD,H1) 114 e:0 open ETF-Tabelle\USA\ETF TABLE (2024-12-27 21：06：19).html Html-Size:619450, Array-Size: 619450, read:619450, String-Size: 619448

and the "€" remains "€":

€ 11,560,000 (EU)
...
¥ 806,380,000 (China)
...
¥ 5,130,000,000 (Japan)
...
₩ 1,010,000,000 (Korea)

It's a bit like “from behind through the chest into the eye” (a German saying for overcomplicated).

Working with Optimization Results Initialization - Eye Working with strings and

Dominik Egert 2025.01.04 09:34 #10

Carl Schreiber #:

SOLVED! This works:

See the size of the file the array and the string are equal:

and the "€" remains "€":

It's a bit like “from behind through the chest into the eye” (a German saying for overcomplicated).

I guess, you will be using WebRequest later on, and then you probably will have issues again, because the result array is also raw, and not the same as the one you are getting now.

Also, it is probably more efficient to work directly with the char array instead of converting to string type.

(Are string variables still limited in size?)

Features of the mql5 Read an xml file Unescaped quotes in WebRequest()

1 2 3

New comment