Комментарии могут оставлять пользователи, купившие или арендовавшие продукт
Stanislav Korotky  

Supported CSS selectors

  • .class - an element with the specified "class" attribute
  • #id -  an element with the specified "id" attribute
  • tag - an element with the specified "tag" name
  • container element - an element inside the specified container on an arbitrary nesting level
  • parent > element - an element with immediate parent (nesting level is 1)
  • e1 + element - an element as a sibling of e1, both within a common parent, e1 is immediately followed by element
  • e1 ~ element - an element as a sibling of e1, both within a common parent
  • [attribute] - an element with existing attribute
  • [attribute=value] - an element with the specified attribute value
  • [attr*=text] - an element with the specified attribute containing value as substring
  • [attr^=start] - an element with the specified attribute starting with value
  • [attr$=end] - an element with the specified attribute ending with value
  • :first-child - an element which is the first child of its parent
  • :last-child - an element which is the last child of its parent
  • :nth-child(n) - a child element with index n in its parent;
  • :nth-last-child(n) - a child element with reverse index n in its parent;
Stanislav Korotky  

Columns settings file

This is CSV file with 3 columns: name, CSS selector, data locator (from which part of DOM element to extract data).

First line is a header. Every next line defines specific data field.

In addition to standard CSS selectors a special selector '.' (dot) is applicable to select the row element itself.

Data locator is a name of attribute to read data from. If it's empty, data is acquired as a text content of the selected element.

Some examples will follow.

Stanislav Korotky  

Substitutions settings file (optional)

This is a CSV file with 3 columns: column number (according to the columns configuration above, the numeration is 1-based), what text to find, which text to use instead.

First line is a header. Every next line defines a specific rule.

The purpose of the substitution rules is to unify values in same fields received from different sites, for example, to replace country names with corresponding currencies abbreviations.

Some examples will follow.

Stanislav Korotky  

Examples Overview

All examples are from the real world, they are taken from setups prepared for some trading sites. This means that they do work at the time of writing, but may fail in the future if a site changes its page formatting. Please consider every example as a "how to" guide, and not as a ready made, everlasting solution.

According to MQL5.com policy any mention of external services (sites) is prohibited. This is why the examples are published without URLs.

Please, note that HTML may contain errors (as many web sites do) and the expert processes them robustly.

Stanislav Korotky  

Example N1. Economic Calendar M

URL: http://[address is skipped due to MQL5.com policy]

HTML fragment:

<!-- considerable but irrelevant part of html codes is skipped -->
...
<!-- here is the table with calendar events starts -->        
<table class="table center td30" cellspacing="0" border="0" id="calendar" align="center">
  <tr>                           <!-- these are the columns of the table -->
    <th>Date</th>
    <th>Time Left</th>
    <th>&nbsp;</th>
    <th>Event</th>
    <th>Impact</th>
    <th>Previous</th>
    <th>Consensus</th>
    <th>Actual</th>
    <th id="calendarAlertMain" style="width: 60px">
      <a onclick="calendarMultiEmailAlert(event, true )" class="pointer">All</a>&nbsp;
      <a onclick="calendarMultiEmailAlert(event, false)" class="pointer">None</a>
    </th>
  </tr>
  <tr class="bg1">
    <td class="bold" colspan="9">Saturday, Sep 05, 2015</td>
  </tr>
  
  <!-- here is the single row with an event -->
  <tr id="calRow387528" onmouseover="this.className='normalActive pointer';"
    onmouseout="this.className='normal pointer';" class="normal pointer"
    onclick="calClick(387528);showHide(this,'data387528',0);getCalendarChart(387528);return false" >
    <td width="100" >Sep 05, 00:00 </td>                  <!-- date and time -->
    <td width="100" >
      <input name="calendarLeft" class="center font11 tranparent" importance="2" time="1441411200000"
       value="Done" style="border:0; width:80px" readonly="readonly" isPassed="1">
    </td>                
    <td width="45" >                                      <!-- country or currency reference -->
      <span id="calendarTip0" class="European Monetary Union">
        <img src="/images/countries/EMU.png">
      </span>
    </td>                
    <td align="left" width="325" >                        <!-- description -->
      <a class="noUnderline"> &nbsp;G20 Meeting</a></td>
    <td >                                                 <!-- importance -->
      <img src="/images/medium.impact.png" alt="Medium Impact"/></td>
    <td id="previousTipbf5f3aa5-a4ef-4962-ba02-17a937efc681" pot="" unit="" relation="true" >
      <!-- previous estimation (it's empty in this case) -->
      </script>     <!-- here is the error in the HTML, the expert will handle this -->
    </td>
    <td id="concensusbf5f3aa5-a4ef-4962-ba02-17a937efc681" concensus=""  consistConcensus="false">
      <!-- forecast (it's empty in this case) -->
    </td>
    <td id="actualTipbf5f3aa5-a4ef-4962-ba02-17a937efc681"  pot="" unit="" >
      <!-- actual value (it's empty in this case) -->
    </td>
    <td width="30" ></td>
  </tr>
  
  <!-- other rows will follow here -->

</table>
To find all rows with economic events the following RowSelector is used: table[id=calendar] tr[id^=calRow] 

It means searching within the table tag with attribute id equal to "calendar" for all nested tr tags with id starting with "calRow" string.

Suitable selectors and locators for the table cells in every row (contents of the file referenced by ColumnSettingsFile):

HEADER,SELECTOR,LOCATOR
DateTime,td:first-child,
TimeZone,,GMT
Currency,td[width=45] > span[id^=calendarTip],class
Event,td[align=left][width=325] > a,
Importance,td > img,alt
Actual,td[id^=actualTip],
Forecast,td[id^=concensus],
Previous,td[id^=previousTip],

Here is some explanations:

  • DateTime is extracted from internal text (because of empty locator) of every first child tag td of the row;
  • TimeZone is the constant GMT (selector is empty, and locator is copied "as is");
  • Currency is taken from the class attribute (locator is "class") of every span tag with id started with "calendarTip" and located just inside a td with width equal to 45;
  • Event description is filled as text inside a link (a tag) located just inside a td with "left" align and width 325;
  • Importance is taken from the alt attribute (note "alt" locator) of an image (img tag) inside td;
  • Actual value, its Forecast and Previous forecast are taken as contents of table cells with ids starting with corresponding strings.

Optional substitution rules are (contents of the file specified in SubstitutionSettingsFile):

COLUMNNUMBER,FROM,TO
3,European Monetary Union,EUR
3,France,EUR
3,Italy,EUR
3,Germany,EUR
3,Spain,EUR
3,United Kingdom,GBP
3,United States,USD
3,Canada,CAD
3,Japan,JPY
3,Australia,AUD
3,Switzerland,CHF
3,New Zealand,NZD
5,Low Impact,Low
5,Medium Impact,Medium
5,High Impact,High

The column number 2 is Currency, and number 4 is Importance.

Stanislav Korotky  

Example N2. Economic Calendar F

URL: http://[address is skipped due to MQL5 policy]

HTML fragment:

<!-- considerable but irrelevant part of html codes is skipped -->
...
<!-- here is the table with calendar events starts -->
<table id="ecEventsTable" class="genTable closedTable ecoCalTable">
  <thead>    
    <tr>                                <!-- these are the columns of the table -->
      <th class="time">Time</th>
      <th class="flagCur">Cur.</th>
      <th class="sentiment">Imp.</th>
      <th class="event">Event</th>
      <th class="act">Actual</th>
      <th class="fore">Forecast</th>
      <th class="prev">Previous</th>
      <th class="diamond">
      </th>
    </tr>
  </thead>
  <tbody pageStartAt>
    <tr>
      <td colspan="8" class="theDay" id="theDay1445126400">Sunday, October 18, 2015</td>
    </tr>
    
    <!-- here is the single row with an event -->
                                                              <!-- date and time inside TR itself -->
    <tr id="eventRowId_317618" event_attr_ID="188" event_timestamp="2015-10-18 21:45:00"
      onclick="javascript:changeEventDisplay(317618, this, 'overview');">
      <td class="first left time" >17:45</td>
      <td class="flagCur">                                    <!-- country or currency reference -->
        <span title="New Zealand" class=" ceFlags New_Zealand">&nbsp;
        </span> NZD
      </td>
      <td class="sentiment" title="Low Volatility Expected">  <!-- importance -->
        <i class="newSiteIconsSprite grayFullBullishIcon middle"></i>
        <i class="newSiteIconsSprite grayEmptyBullishIcon middle"></i>
        <i class="newSiteIconsSprite grayEmptyBullishIcon middle"></i>
      </td>
      <td class="left event">Labor Cost Index (QoQ) (Q3)</td>             <!-- description -->
                                                                          <!-- actual value -->
      <td class="bold act blackFont" title="In Line with Expectation" id="eventActual_317618">0.5%</td>
      <td class="fore" id="eventForecast_317618">0.5%</td>                <!-- forecast -->
      <td class="prev blackFont" id="eventPrevious_317618">0.5%</td>      <!-- previous estimation -->
      <td class="diamond" id="eventRevisedFrom_317618">&nbsp;</td>
    </tr>
    
    <!-- other rows will follow here -->

  </tbody>
</table>

RowSelector can be: table[id=ecEventsTable] tr[event_attr_ID]

Suitable ColumnSettingsFile:

HEADER,SELECTOR,LOCATOR
DateTime,.,event_timestamp
TimeZone,,GMT
Currency,td.flagCur,
Event,td.event,
Importance,td.sentiment,title
Actual,td.act,
Forecast,td.fore,
Previous,td.prev,

What it means is that:

  • DateTime is extracted from event_timestamp attribute of containing row (note selector '.' and corresponding locator);
  • TimeZone is always "GMT";
  • Currency, Event, Actual value, Forecast, and Previous forecast are taken as contents of table cells (td tags) with corresponding class names (specified after dots);
  • Importance is read from the title attribute of a td with class "sentiment".

SubstitutionSettingsFile

COLUMNNUMBER,FROM,TO
5,Low Volatility Expected,Low
5,Moderate Volatility Expected,Medium
5,High Volatility Expected,High

This time (unlike to the example N1) we need to unify only the column with Importance, because Currencies are filled in in the HTML table by proper Forex abbreviations by default ("NZD" in the example row).

Stanislav Korotky  

Example N3. Tester report

URL: [local file name] 

HTML fragment: 

<!-- considerable but irrelevant part of html codes is skipped -->
...
<!-- here is the table with data -->
<table width=820 cellspacing=1 cellpadding=3 border=0 two>
  <tr bgcolor="#C0C0C0" align=right>
    <td>#</td>                         <!-- these are the columns of the table -->
    <td>Time</td>
    <td>Type</td>
    <td>Order</td>
    <td>Size</td>
    <td>Price</td>
    <td>S / L</td>
    <td>T / P</td>
    <td>Profit</td>
    <td>Balance</td>
  </tr>
  <tr align=right>                     <!-- here is the single row -->
    <td>1</td>
    <td class=msdate>2014.01.08 05:45</td><td>buy</td><td>1</td>
    <td class=mspt>0.05</td>
    <td style="mso-number-format:0\.00000;">1.07938</td>
    <td style="mso-number-format:0\.00000;" align=right>0.00000</td>
    <td style="mso-number-format:0\.00000;" align=right>0.00000</td>
    <td colspan=2></td>
  </tr>
  <tr bgcolor="#E0E0E0" align=right>   <!-- here is another one -->
    <td>2</td>
    <td class=msdate>2014.01.08 11:45</td><td>modify</td><td>1</td>
    <td class=mspt>0.05</td>
    <td style="mso-number-format:0\.00000;">1.07938</td>
    <td style="mso-number-format:0\.00000;" align=right>1.06933</td>
    <td style="mso-number-format:0\.00000;" align=right>0.00000</td>
    <td colspan=2></td>
  </tr>
  
  <!-- other rows will follow here -->

RowSelector is: table ~ table tr + tr

It means: select second table in the file (the first one, which is skipped in the example, contains trading results), then inside the table select every row with preceding row, that actually excludes the first row with headers.

ColumnSettingsFile

Name,Selector,Locator
DateTime,td.msdate,
Type,td:nth-child(2),
OrderN,td:nth-child(3),
Size,td:nth-child(4),
Price,td:nth-child(5),
SL,td:nth-child(6),
TP,td:nth-child(7),
Profit,td:nth-child(8),
Balance,td:nth-child(9),

DateTime is extracted by class name "msdate", and all other fields are extracted by their position number in the row (the index in the nth-child selector is 0-based).

There are no substitution rules in this case (SubstitutionSettingsFile is empty). 

Stanislav Korotky  

The source code and in-depth description of the product are published in the article:

EN - Extracting structured data from HTML pages using CSS selectors;

RU - Извлечение структурированных данных из HTML страниц с помощью CSS селекторов;

XX - translations to other languages are also available;

If you need technical support and assistance in custom configuration, please, purchase the product.

Комментарии могут оставлять пользователи, купившие или арендовавшие продукт