Read and parse an html page - page 2

 
Carl Schreiber #:
   string wrkHtml = CharArrayToString(cArr,0,WHOLE_ARRAY,CP_UTF8);    Print(fHTML+" Html-Size:"+(string)FileSize(fHdl)+" Array-Size: "+(string)ArraySize(cArr)+", read:"+(string)read+", String-Size: "+(string)StringLen(wrkHtml) );

In MQL5 you can use FileLoad() function:

void OnStart()
  {
   const string fHTML = "ETF TABLE (2024-12-28 16:52:40).html";
   uchar buffer[];
   long count = FileLoad(fHTML, buffer/*, FILE_COMMON */);
   if(count==-1)
     {
      PrintFormat("FileLoad() failed, error=%d",GetLastError());
      return;
     }

   string wrkHtml = CharArrayToString(buffer, 0, WHOLE_ARRAY, CP_UTF8);

   PrintFormat(fHTML+" Buffer-Size: %d, count: %d, String-Size: %d", ArraySize(buffer), count, StringLen(wrkHtml));
  }

// ETF TABLE (2024-12-28 16:52:40).html Buffer-Size: 504059, count: 504059, String-Size: 503861


Also try my String Manipulation Functions Library which will be helpful for parsing the text.


Edit:

If you intend to use regular expressions to parse the text, take a look at this: Regular Expressions Tester for MQL4

 
amrali #:
...

Edit:

If you intend to use regular expressions to parse the text, take a look at this: Regular Expressions Tester for MQL4

Your code is working with the MetaQuotes regular expression library right ? Which is some king of "clone" of C# library I think.

Did you already check how efficiently it's implemented ? I am wondering if it's usable as is or if it could be worth to build a custom library ?

 
Alain Verleyen #:

Your code is working with the MetaQuotes regular expression library right ? Which is some king of "clone" of C# library I think.

Did you already check how efficiently it's implemented ? I am wondering if it's usable as is or if it could be worth to build a custom library ?

Generally, regex is less efficient regarding its speed vs string processing functions, but with regex you can parse text in all imaginable ways, especially when combined with using some general string functions. 

Yes the implementation is a clone of C# which I think it should be efficient. However, we don't have another regex library in MQL to compare the performance against. Writing a custom regex lib is a tedious job.

 
Dominik Egert #:

1) I guess, you will be using WebRequest later on, and then you probably will have issues again, because the result array is also raw, and not the same as the one you are getting now.

2) Also, it is probably more efficient to work directly with the char array instead of converting to string type.

(Are string variables still limited in size?)

add 1) No as the 413 files are already manually downloaded (web crawling is not allowed!) - maybe later with different websites.

add 2) I don't think so as I created a csv file to load and handle in Excel (LO Calc) and strings are easier to handle by the MQ functions like searching, substringing, replaceing, adding ...

 
amrali #:

In MQL5 you can use FileLoad() function:


Also try my String Manipulation Functions Library which will be helpful for parsing the text.


Edit:

If you intend to use regular expressions to parse the text, take a look at this: Regular Expressions Tester for MQL4

Thanks! FileLoad() is really easier and you don't have to carefully manage the file handles. :)

I don't need RegEx. With "while ( "StringFind(wrkHtml, "tabulator-field=",..)" I jump from field to field until I reach the last field "Region". Then add the table row and start with the next row. It's quiet easy.

But it's good to know RefEx exists for MQ!

Thanks!

 
Carl Schreiber #:

add 1) No as the 413 files are already manually downloaded (web crawling is not allowed!) - maybe later with different websites.

Just FYI doing it manually is ALSO web crawling.

 
Alain Verleyen #:

Just FYI doing it manually is ALSO web crawling.

No, no according to; https://en.wikipedia.org/wiki/Web_crawler:

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).[1]
I don't consider my manual mouse clicks to be a bot.
 
Carl Schreiber #:

No, no according to; https://en.wikipedia.org/wiki/Web_crawler:

I don't consider my manual mouse clicks to be a bot.
MQLs ToS do disagree with your interpretation. Specifically 3.7 and 3.9

So, even if you do not consider your actions to be covered by Wikipedia's definition, they are covered by MQs ToS, and specifically by point 3.9
 
Dominik Egert #:
MQLs ToS do disagree with your interpretation. Specifically 3.7 and 3.9

So, even if you do not consider your actions to be covered by Wikipedia's definition, they are covered by MQs ToS, and specifically by point 3.9

??

I didn't download any MQ webpage.

 
Carl Schreiber #:

??

I didn't download any MQ webpage.


I didn't say that, but you gave a hinting impression. MQs ToS are just as an example, as most webpages do have similar clauses.

Therefore I referred to MQs ToS, to show, its not an interpretation based point of view, just looking at a wiki article and calling it a day.

Even if you are not actually crawling, usually you can expect some type of "bummer" in the ToS of a source/webpage to cover such "workarounds".

Edit: Actually, all browsers using local cache violate these ToS, depending on the interpretation. But strictly seen, caching these pages is a violation of these rules.