Discussion of article "Parsing HTML with curl"

 

New article Parsing HTML with curl has been published:

The article provides the description of a simple HTML code parsing library using third-party components. In particular, it covers the possibilities of accessing data which cannot be retrieved using GET and POST requests. We will select a website with not too large pages and will try to obtain interesting data from this site.

One may ask "What's the point?" A simple solution is to access the site page directly from an MQL script and read the already known number of positions at the known page position. Then the received string can be further processed. This is one of the possible methods. But in this case the MQL script code will be tightly bound to the HTML code of the page. What if the HTML code changes? That is why we need a parser which enables a tree-like operation with an HTML document (the details will be discussed in a separate section). If we implement the parser in MQL, will this be convenient and efficient in terms of performance? Can such a code be properly maintained? That is why the parsing functionality will be implemented in a separate library. However, the parser will not solve all problems. It will perform the desired functionality. But what if the site design changes radically and will use other class names and attributes? In this case we will need to change the search object or event multiple objects. Therefore, one of our goals is to create the necessary code as quickly as possible and with the least effort. It will be better if we use ready-made parts. This will enable the developer to easily maintain the code and quickly edit it in case of the above situation.

We will select a website with not too large pages and will try to obtain interesting data from this site. The kind of data is not important in this case, however let us try to create a useful tool. Of course, this data must be available to MQL scripts in the terminal. The program code will be created as a standard DLL.

In this article we will implement the tool without asynchronous calls and multi-threading.

Author: Andrei Novichkov

 

Are there any restrictions on parsing?

I don't know much about this business, I want to parse, let's say, data from these tables. At the same time I need to change the date of the calendar on the page, is it possible to implement it with the tools from the article or do I need something else?

Московская Биржа - Основные параметры срочного контракта
Московская Биржа - Основные параметры срочного контракта
  • www.moex.com
Влияем на развитие, создаем будущее
 
Aleksey Vyazmikin:

Are there any restrictions on parsing?

I don't know much about this business, I want to parse, let's say, data from these tables. At the same time I need to change the date of the calendar on the page, is it possible to implement it with the tools from the article or do I need something else?

It is impossible to create a single mechanism for parsing. You can use the article as a basis. Libraries for parsing and receiving pages can also be taken. The article focuses on how to work with libraries on a concrete example. And the example is not complicated, so that the reader is not confused. We get a page, load it into the parser. And then it is purely individual work, because the structure of pages is different everywhere and it will have to be taken into account. Therefore, the code from the article will have to be corrected.

 

That's a dashing twist !

andrei, can't you just give us

1. description of all functions in this dll - GETANDPARSE.dll

2. examples of calling each of the functions

this will allow not to go into all the details of the project, I for example still have VS 2010,

that's why I can't even expose your project


I want to use your dll to

1. read a page from SSL site

2. write it to a file

3. I will parse it myself, most likely.....

 
Denis Sartakov:

It's a dashing twist.

Good afternoon.

The description of functions from my dll will not help you. Unfortunately, any such dll will be "page dependent". My dll parses only the page I write about in the article. There's nothing you can do, you have to build into the dll an algorithm to search for the necessary information in the already parsed page, of course, it is different every time. There's something about it in the article. Try to get away from this, "generalise" somehow the search scheme - to get a whole powerful standalone application, which will be needed by very few people. You need a new project. Take kurl for page retrieval, take jimbo for parsing. All you have to do is run through the tree that jimbo builds and find the piece you need. And two. My dll is a tutorial. I count on the reader with not much knowledge of the subject, people who know everything better than me. That's why the code is as light as possible - minimum checks, no exception handling, it's inadmissible for a combat variant.

P.S. Still, the 10th studio could be updated. We already have C++ 20 coming in February, so it's high time.

 
Andrei Novichkov:

Good afternoon.

The description of functions from my dll will not help you. Unfortunately, any such dll will be "page dependent". My dll parses only the page I write about in the article. There's nothing you can do, you have to build into the dll an algorithm to search for the necessary information in the already parsed page, of course, it is different every time. There's something about it in the article. Try to get away from this, "generalise" somehow the search scheme - to get a whole powerful standalone application, which will be needed by very few people. You need a new project. Take kurl for page retrieval, take jimbo for parsing. All you have to do is run through the tree that jimbo builds and find the piece you need. And two. My dll is a tutorial. I count on the reader with not much knowledge of the subject, people who know everything better than me. That's why the code is as light as possible - minimum checks, no exception handling, it's inadmissible for a combat variant.

P.S. Still, the 10th studio could be updated. We already have C++ 20 coming in February, so it's high time.

Yeah, thanks.

 

andrei, can you tell me how to download this from libcurl?


libcurl-x32.dll and libcurl-x32.lib.


some rubbish is downloaded...

libcurl.a - what type is this ?

 
Denis Sartakov:

andrei, can you tell me how to download this from libcurl?


libcurl-x32.dll and libcurl-x32.lib


Some rubbish is downloaded.....

libcurl.a - what type is this ?

You have to look carefully there and choose OS and bitness. 32-bit versions will most likely be called just libcurl.dll

I have attached the file for you, but I haven't checked it out

Files:
libcurl.zip  482 kb
 
Denis Sartakov:

I want to use your dll to

1. read a page from the SSL site

2. write it to a file

3. I'll parse it myself, most likely....

You can do it without DLL. What's wrong with downloading via WebRequest? You can parse in MQL5. There are different ready-made codes for HTML/XML. As an example, here is such a variant.

Извлечение структурированных данных из HTML-страниц с помощью CSS-селекторов
Извлечение структурированных данных из HTML-страниц с помощью CSS-селекторов
  • www.mql5.com
Среда разработки MetaTrader позволяет интегрировать программы и внешние данные, в частности, получаемые из сети Интернет через WebRequest. Наиболее универсальным и часто применяемым форматом данных в сети является HTML. В тех случаях, когда тот или иной публичный сервис не предоставляет открытый API для запросов или его протокол трудно...
 
Andrei Novichkov:

There you have to look carefully and choose OS and bitness. 32-bit versions will most likely be called just libcurl.dll

I've attached the file for you, but I haven't checked it out

It's only a dll, you need a .lib library,

so you can make your own projects in C++

 
Stanislav Korotky:

You can do it without DLL. What's wrong with downloading via WebRequest? You can parse in MQL5. There are different ready-made codes for HTML/XML. As an example, here is such a variant.

Parsing - there are no problems here at all, you can always do it poorly or poorly.

there is only one question -

1. to read from the programme a page from SSL site


I have not found adequate working examples....