Discussing the article: "Feature selection and dimensionality reduction using principal components"

 

Check out the new article: Feature selection and dimensionality reduction using principal components.

The article delves into the implementation of a modified Forward Selection Component Analysis algorithm, drawing inspiration from the research presented in “Forward Selection Component Analysis: Algorithms and Applications” by Luca Puggini and Sean McLoone.

Financial time series prediction often involves analyzing numerous features, many of which may be highly correlated. Dimensionality reduction techniques like Principal Component Analysis (PCA) can help create a more compact representation of these features. However, PCA has limitations, especially in the presence of highly correlated variables. In such cases, PCA tends to exhibit the grouping effect, wherein a set of highly correlated variables collectively contributes to a given principal component. Instead of highlighting any single variable, PCA distributes the influence relatively evenly across all variables in the correlated group.

This even distribution can be beneficial for noise suppression because the principal components emphasize common patterns rather than the random fluctuations unique to individual variables. However, this noise suppression comes at a cost: it often dilutes the contribution of individual variables to each principal component. Variables that might be significant on their own can appear less important within the transformed space, as their influence is absorbed into the broader structure captured by the group. This can be a significant drawback in tasks like variable selection, where the goal is to identify the most influential features, or in root-cause analysis, where understanding the direct impact of specific variables is crucial.

Author: Francis Dube

 

The topic is, of course, eternal and always relevant.

It would be good to have different methods in the article to compare their effectiveness, not on synthetic data, but on real data.

I tried to increase the number of features to 5000 and rows to 10000 - I waited three days for the result - no result. So I wonder if the quality would suffer significantly if we split the number of features into groups, say 100 examples each, and then bring together the winners from each group for a final selection?