Discussing the article: "Data label for timeseries mining (Part 2):Make datasets with trend markers using Python"


Check out the new article: Data label for timeseries mining (Part 2):Make datasets with trend markers using Python.

This series of articles introduces several time series labeling methods, which can create data that meets most artificial intelligence models, and targeted data labeling according to needs can make the trained artificial intelligence model more in line with the expected design, improve the accuracy of our model, and even help the model make a qualitative leap!

At this point, we have done the basic work,but if we want to get more precise data, we need further human intervention, we will only point out a few directions here, and will not make a detailed demonstration.

1.Data integrity checks

Completeness refers to whether data information is missing, which may be the absence of the entire data or the absence of a field in the data. Data integrity is one of the most fundamental evaluation criteria for data quality.For example,if the previous data in the M15 period stock market data differs by 2 hours from the next data, then we need to use the corresponding tools to complete the data.Of course, it is generally difficult to get foreign exchange data or stock market data obtained from our client terminal, but if you get time  series from other sources such as traffic data or weather data , you need to pay special attention to this situation.

The integrity of data quality is relatively easy to assess, and can generally be evaluated by the recorded and unique values in the data statistics. For example, if a stock price data  in the previous period the Close price is 1000, but the Open price becomes 10 in the next period, you need to check if the data is missing.

2.Check the accuracy of data labeling

From the perspective of this article, the data labeling method we implemented above may have certain vulnerabilities, we can not only rely on the methods provided in the pytrendseries library to obtain accurate labeling data, but also need to visualize the data, observe whether the trend classification of the data is too susceptible or dullness, so that some key information is missed, at this time we need to analyze the data, if should be broken down then broken down , if should be merged needs to be merged.This work requires a lot of effort and time to complete, and concrete examples are not provided here for the time being.

Accuracy refers to whether the information recorded in the data and whether the data is accurate, and whether the information recorded in the data is abnormal or wrong. Unlike consistency, data with accuracy issues is not just inconsistencies in rules. Consistency issues can be caused by inconsistent rules for data logging, but not necessarily errors.

3.Do some basic statistical verification to see if the labels are reasonable

  • Integrity Distribution:Quickly and intuitively see the completeness of the data set.
  • Heatmap:Heat maps make it easy to observe the correlation between two variables.
  • Hierarchical Clustering:You can see whether the different classes of your data are closely related or scattered.
Of course, it's not just about the above methods.

Author: Yuqiang Pan