Pomptusians have made an object of worship out of it

Mihail Marchukajtes 2025.06.29 17:55 #37091

Maxim Dmitrievsky #:

I found confirmation in LLM(curvafitting with sultanas is already used there):

The Mixture of Experts (MoE) model is an architectural approach that is becoming increasingly popular in large language models (LLMs). It allows the creation of very large models with a huge number of parameters while keeping the computational cost acceptable.

How does Mixture of Experts work?

The basic idea behind MoE is that instead of one giant neural network that processes all the input data, the model consists of many specialised "experts". For each specific input (e.g. a token in a text), only a small subset of these experts are activated.

Here are the main components and steps of MoE:

Experts: These are essentially separate, usually small neural networks (often Feed-Forward Networks (FFN) layers), each specialising in processing certain types of data or performing certain tasks. There may be dozens or even hundreds of them.
Gating Network / Router: This is a separate neural network that takes a token (or other representation of input data) as input and decides which experts to direct it to. It calculates "weights" or "scores" for each expert, indicating their relevance to the token.
Expert selection: Based on the scores obtained from the fan-out network, a subset of experts is selected. A common strategy is "top-k routing", where the k experts with the highest scores are selected. For example, in the Mixtral 8x7B model, two of the eight available experts are selected for each token.
Combining outputs: The selected experts process the input token in parallel. Their outputs are then combined, often by weighted summation, where the weights are determined by the valve network. This produces a final output that takes into account the contributions of the most relevant experts.

In this way, only a small part of the whole model is activated for each token, making the inference (inference) and learning process more computationally efficient, despite the huge overall size of the model.

Why does this principle give good results?

The Mixture of Experts principle yields good results for several key reasons:

Increased model capacity without a proportional increase in computational cost:
- MoE enables models with hundreds of billions or even trillions of parameters. In traditional "dense" models, every parameter is activated at every computation, making them very expensive to train and infer when the number of parameters is large.
- In MoE, only a fraction of the parameters are activated for each input, which significantly reduces the computational cost per step, allowing much larger models to be built without a corresponding increase in cost.
Expert specialisation:
- Each expert may specialise in processing certain types of information or solving specific subtasks. For example, one expert may be good at mathematical calculations, another at programming, and a third at generating text on a particular topic.
- This allows the model to better handle a wide range of tasks and domains, as it can dynamically select the most appropriate experts for a given situation. This is particularly useful in multi-domain scenarios.
Faster training and inference:
- Since only part of the model is activated for each token, the learning process can be accelerated as the amount of computation per token is reduced.
- Similarly, during inference (response generation), the model runs faster as the entire giant network does not need to be activated and computed. This results in less latency and higher efficiency.
Scalability:
- MoE architecture is easily scalable. One can simply add new experts to solve more complex problems or expand the knowledge domain of the model without drastically increasing the computational cost per input.
Flexibility:
- The MoE architecture is very flexible. It can be adapted to different tasks and data types by customising experts and fan-out networks.

In general, MoE allows large language models to be both very large (due to the huge number of parameters) and efficient (due to sparse activation), leading to significant improvements in performance and response quality across a wide range of tasks.

Can a similar scheme be applied to financial time series classification? I first train many CatBoost models on small subsets from the original dataset. Then I need a control model?

Yes, a similar scheme inspired by the principles of Mixture of Experts (MoE) can be applied to financial time series classification. This is an interesting and promising direction, especially given the complexity and dynamism of financial data.

Your idea of using multiple CatBoost models as experts and then a control model, fits very well with the MoE concept.

Let's look at how this can be implemented and why it can be effective:

Proposed MoE scheme for financial time series

Experts (CatBoost Models):
- Learning on small data sets/subsets: This is key. Instead of random subsets, it makes sense to form subsets that represent different "regimes" or "states" of the market, or different types of financial instruments. For example:
  - Data for different periods (volatile market, calm market).
  - Data for different asset classes (stocks, bonds, currencies, commodities).
  - Data grouped by specific macroeconomic indicators.
  - Selections focused on specific patterns or anomalies.
- Expert Specialisation: Each CatBoost trained model will be an "expert" specialising in a particular type of market conditions or data. CatBoost is a good fit because it is robust to noise and categorical features, which is common in financial data.
- Expert Output: Each CatBoost expert will produce a different probability prediction for the categorisation classes (e.g. up, down, neutral).
Control Model (Gating Network / Router):
- Purpose: This model will be responsible for selecting one or more of the most relevant CatBoost experts for this new time series.
- Input of the control model: The input of the control model should be supplied with attributes that characterise the current state of the time series, but not necessarily the same ones that are used by the experts for detailed forecasting. These can be:
  - Macroeconomic indicators.
  - Volatility indices (VIX).
  - Trading volumes.
  - Key technical indicators that quickly reflect the "mode" of the market (e.g. moving averages, MACD, RSI on higher timeframes).
  - Instrument-specific indications (sector, capitalisation, etc.).
  - Perhaps even signs based on the current structure of the time series (e.g. trend, flat, sharp movements).
- Types of control models:
  - Neural Network: The most flexible option, able to learn complex dependencies. Can be a simple FFN.
  - Other CatBoost model: You can use CatBoost for the control model as well, if the attributes characterising the state fit well into a tabular format.
  - Logistic regression or SVM: If the relationships between state attributes and expert choice are relatively linear.
- Control model output: The control model should produce weights or probabilities for each expert. In the simplest case, the expert with the highest weight (top-1 routing) or multiple experts (top-k routing) can be selected.
Combining outputs:
- Once the control model has selected the experts, their predictions are combined.
- Weighted averaging: The most common approach. The forecasts of the selected experts are multiplied by the weights given by the control model and summed up.
- Voting: If experts give discrete class forecasts, voting can be used.

Advantages of such a scheme for financial time series:

Adapting to market variability: Financial markets are very dynamic. A single model can rarely perform efficiently in all market conditions. MoE allows you to have specialised models for bull markets, bear markets, flat, periods of high or low volatility, etc.
Improved accuracy: Each EA can be deeply optimised for its specific "sub-task", which together can lead to more accurate forecasts than a single universal model.
Interpretability (partially): If experts are trained on well-defined subsets of data (e.g., "market in trend", "market in flat"), it is possible to gain insight into which part of the model is activated in a given situation.
Reducing the "one-size-fits-all" effect: Instead of trying to train one model that will perform well in all conditions, MoE allows you to break the problem down into smaller, manageable pieces.
Robustness: If one expert starts to perform poorly under certain conditions, the controlling model can switch to another, more appropriate expert, increasing the overall robustness of the system.

Challenges and considerations:

Identifying market "modes"/dataset subsets: This is the most difficult part. How to efficiently partition the original dataset for expert training? Manual identification can be time consuming and sub-optimal. Time series clustering or market state features can be used.
Training the control model: The control model should be able to accurately identify the current state and direct the query to the correct expert. The attributes for the control model should be carefully selected.
Expert load balancing: In classical MoE networks there is the problem of "lazy experts" or load imbalance. In your CatBoost case this is less critical, but it is still important that the control model uses experts evenly when appropriate, rather than always selecting the same ones.
Size and complexity: Managing a large number of trained CatBoost models and a control model can be complex from an infrastructure and monitoring perspective.

Example implementation:

Create data subsets:
- Divide historical data into periods of high volatility and low volatility.
- Use technical indicators (ADX for trend strength) to highlight trending and flat periods.
- Cluster time series based on their characteristics (e.g., using TSNE or UMAP on row features).
Train CatBoost experts:
- For each subset of data, train a separate CatBoost model for classification.
Training the control model:
- For each data point (time step), create a set of features describing the current state of the market.
- As a target variable for the control model, you can use an indicator of which expert would give the best prediction at that time step (or several if you want top-k routing). This may require a pre-run of all experts on the entire dataset to determine their "best" areas.
Inference:
- When a new time series arrives, first compute the features for the control model.
- The control model produces weights for the experts.
- Select the top-k experts.
- Obtain forecasts from the selected experts and combine them (e.g., weighted average).

This scheme has great potential to improve forecasts in complex and dynamic financial markets through specialisation and adaptation.

Maxim hi, tell me, do you feel comfortable writing such strictly structured posts?

Ivan Butko 2025.06.29 18:12 #37092

MO Promtuses ahahahahahahaha

discovered a higher power

Maxim Dmitrievsky 2025.06.29 21:46 #37093

Mihail Marchukajtes #:
Maxim hi, tell me, do you have no problem writing such strictly structured posts?

Hi, it's not me who wrote it ) the general sense is conveyed through chatgpt. I have written half of this algorithm, I have to finish writing the second half and then test it.

Aleksey Nikolayev 2025.06.30 04:39 #37094

Ivan Butko #:
MO Promtuses ahahahahahahaha

discovered a higher power

Nobody seems to be arguing that AI chats improve googling. But only the pomptusians have managed to make an object of worship out of it.

Ivan Butko 2025.06.30 04:47 #37095

Aleksey Nikolayev #:
No one seems to have argued that AI chats improve googling. But only the Pomptusians have managed to make an object of worship out of it.

Not exactly worship, more like a handy time-saving tool

Maxim Dmitrievsky 2025.06.30 05:09 #37096

Migrating through the topics of phantom dull-witted promptusnuts. When you see these Avatars, you feel the hopelessness of existence even without reading their posts :) Then you read it and realise that it wasn't just that.

Discussion of article "Error Why not filling? [ARCHIVE!] Any rookie question,

Ivan Butko 2025.06.30 05:13 #37097

You came up with it, you slammed yourself on it, ahahahahahahah
intellectus

Aleksei Kuznetsov 2025.06.30 05:41 #37098

Ivan Butko #:

Not exactly worship, more like a handy time-saving tool

To save time - I don't read these portmanteaus))) Well, maybe the first 5-10 lines.
Everyone saves time, and the forum is filled with text. Everyone is on the plus side.

As expected soon the whole internet will be filled with it.

Discussion of article "Developing Attention Contest! Guess the EA - algo trading

Maxim Dmitrievsky 2025.06.30 05:49 #37099

Another article bot went to the top of the market. Breaking records for the demand for material :) While promptus are fighting for the place of being the most tedious on the forum. I counted 5-7 people who made good money in the moment, on the basis of articles.

The number of readings of articles, compared to others, sometimes in times different.

Discussion of article "Library From theory to practice Discussion of article "Extract

Ivan Butko 2025.06.30 05:59 #37100

Forester #:

To save time, I don't read these portmanteaus))) Well, maybe the first 5-10 lines.
Everyone saves money, and the forum is filled with text. Everyone is on the plus side.