Dependency statistics in quotes (information theory, correlation and other feature selection methods) - page 10

 

Talk about a different kind of dependency.

How does "a" depend on "b" outside of any text? It doesn't, i.e. you can't get "a" from other characters.

What about depending on say 1, 2, 3, 4, 5, 6? Obviously, it's not a very appropriate set for the alphabet, however you label it.

Isn't that right?

 
TheXpert:

Talk about a different kind of dependency.

How does "a" depend on "b" outside of any text? It doesn't, i.e. you can't get "a" from other characters.

What about depending on say 1, 2, 3, 4, 5, 6? Obviously, it's not a very appropriate set for the alphabet, no matter how you label it.

Isn't it so?

Why not? The hexadecimal number system. It's a normal alphabet - as good as binary :)

And the Russian letters Y, Y, Y can be derived from other letters.

 
Avals:

And the Russian letters yu, ya, yo can be derived from other letters.

Not letters, but sounds :)

Or is it like that joke? "What a simple Russian language -- the word "yosh" is spelled with two letters!"

 
TheXpert:

Not letters, but sounds :)

Or is it like that joke? "What a simple Russian language - the word "yosh" is spelled in two letters!"

well, don't bitch))). There are also examples in other languages where one character is replaced by several, i.e. one character can be derived from the others. I just don't quite understand the principle of this. And anyway, how do you determine if it's the right alphabet or not?
 

I'm a bit confused too, but something tells me that HideYourRichess is right.

The comparison with a number system is probably not quite right. A number can be represented by a single number, while quotes have many representations, i.e. a symbol can be expressed by a huge (infinite to be more exact) number of variants via other symbols, i.e.

a == tsdrmiikepi == fsrpl == mflncp == javlporpor == fwlfrmilfpf == .

It's not right, imho.

 

Gentlemen, I saw an article by German researchers just in the vein of the topic. I'll post it when I find it. That is, I do not propose anything new, everything has already been studied for at least 10 years.

There is a researcher Battiti (you can search the article by the words Mutual Information Feature Selection). He is the father of the methodology of variable selection with the help of mutual information. There comrade works with different sources of experimental data, in particular, with data about solar activity (it is generally a popular source of values). And the results confirm the usefulness of I (X,Y) statistics for prognostication. I'll have to read up on how it discretises random values there and creates an alphabetic. No one seems to have bothered so much with the theory yet (like the local old-timers).

 

What's the number system got to do with it, TheXpert? I don't understand why the conversation has turned to number systems.

Honestly, I don't see any of HideYourRichess' arguments that in any way interfere with the application of TI to quotes.

 
Mathemat:

What's the number system got to do with it, TheXpert? I don't understand why the conversation has shifted to number systems.

Not a flip, just a matter of opinion. What's wrong with numbers as an alphabet?

Honestly, I don't see any of HideYourRichess' arguments that in any way interfere with the application of TI to quotes.

Alphabet choice.

______

Taki I'd probably rather read it.

 
TheXpert:

I'm a bit confused too, but something tells me that HideYourRichess is right.

The comparison with a number system is probably not quite correct. A number is represented by a single number, while quotes are represented by many variants, i.e. a symbol can be expressed by a huge (infinite to be more exact) number of variants via other symbols, i.e.

a == tsdrmiikepi == fsrpl == mflncp == yawlporpor == fwlfrmilfpf == .

Not good, imho.


write the word "Disorder" in different languages and the same is true :) and even the same alphabet can give examples of synonyms, or obsolete words

s.w. A number can also be represented in an infinite number of ways, depending on the calculus, which is in fact an alphabet.

Alphabet is a notional thing - invented by man to list a large number of objects and phenomena with a smaller number of characters. Of course, characters must be a discrete set. There are no other stringent requirements for it - it's a question of usability.

 
Mathemat:

Mathemat:

HideYourRichess, if you think the whole terver is down to Bernoulli's series or the law of large numbers, you are very wrong.

I don't think it, I know it for a fact.

That's five! I want two!
HideYourRichess: Don't you understand that we are talking about a sequence of independent events there?

What independent events are you talking about? About a sequence of alphabetic characters from the source? No, they are not necessarily independent, it has already been explained to you. An ordinary Russian literary text is a sequence of dependent letters. If they were independent, literary texts would be much worse compressed by the archiver than they really are. Take some literary text and shuffle it and compare the results of archiving the original and the shuffled.

Or do you think that source and receiver ensembles are independent variables?

The notion of information entropy was introduced by Shannon for independent characters. If you don't believe me, consult an academic dictionary. I will not argue with you on this subject any more. You cannot calculate the information entropy for the market, as you do not know the alphabet, you do not know the frequency of symbols, and the independence of symbols is also unknown (but we know that the actions of market participants are very dependent).

The next question, conditional entropy, is just the case when there are dependencies between the characters of the original alphabet. This thing is not the same as information entropy, which was discussed.

I do not understand what conclusions the example of the archiver leads you to, but I will say this. The task of the archiver is to translate conditional entropy into informational entropy. That is to create a perfectly defined limited alphabet, the characters from which, in the resulting sequence, would be as independent as possible. If you mix up the ordered structure of a literary text at the letter level, of course those letter sequences would be broken and the compression would deteriorate. To the extent that a completely random set of letters can no longer be compressed. So what? What's that got to do with it?

Reason: