INFORMATION THEORY POLYNESIAN REVISITED Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
-
Upload
lucas-jefferson -
Category
Documents
-
view
227 -
download
0
Transcript of INFORMATION THEORY POLYNESIAN REVISITED Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Information TheoryPolynesian RevisitedThomas Tiahrt, MA, PhDCSC492 Advanced Text Analytics
Hello and Welcome to CSC 492 Advanced Text Analytics. We continue our overview of information theory by revisiting simplified Polynesian.
1Models vs. Reality2Simplified Polynesian:Not actually a random variableBut can be modeled by a random variable
When we use statistics to represent a phenomena we always keep in mind that we are simplifying the phenomena in order to work with it. We create a model, but the model is not reality. Recall the George E.P. Box adage that all models are false, some models are useful. We want to work with useful models.
In our earlier session, we approximated the simplified Polynesian language by assuming that we can model it as a random variable. It wont be a complete representation of reality, but it should be good enough for the purposes we want to pursue.2Models vs. Reality3
Suppose that we are provided with new information. Linguists living among Polynesians have discovered that Simplified Polynesian has a syllable structure, and that syllable structure is always a consonant followed by a vowel. This new information allows us to construct a better model using syllables than we had using just letters alone.3Polynesian Syllable Model4
Given that we know that all syllables are consonant-vowel sequences we can model the language with 2 random variables. We have a joint distribution and marginal distributions with our new model.4Joint and Marginal Distributions5
In the upper table we have the joint distribution in the intersection of each letter pair, and the marginal distributions in the margins. The bottom table compares the per letter probabilities to the per syllable probabilities, noting that the per syllable probabilities are marginal probabilities. Because the marginal probabilities are on a per-syllable basis, they are double the per-letter probabilities. We must keep that doubling factor in mind when we get to our model-to-model comparison.5Joint Entropy6
Recall that we derived equation 15 in our last session. Now we want to use that result.6Joint Entropy7
Because equation 15 is applicable to our new model of Polynesian, we just need to substitute our consonant and vowel notation for the generic S and T notation. We will use that previous result in our entropy calculation.78
The first of our two values is the entropy of the probabilities of the consonants. We use the marginal probabilities to compute the entropy.89
On this slide we are just finishing the calculation we began on the previous slide.910
For the second component of our syllable model with need the entropy of the conditional probability of the vowels given the probabilities of the consonants.
Equation 16 is just equation 13 with the consonant and vowel set identifiers. We want to ensure that we verify where all the numbers that go into our calculation come from. We will compute the components of equation 16 separately so that it is easier to follow the computation. The table will aid us as a handy reference to the probabilities we need.1011
Before we can compute the log of the vowel probabilities given the consonant probabilities we need to have those vowel probabilities given the consonant probabilities. We show the summations here, but of course they are just the marginal probabilities of each consonant. 1112
Next we use those marginal probabilities as the denominators to calculate each of the vowel probabilities given the consonant probabilities. We use the nine probabilities from the Cartesian product of the consonant set C and the vowel set V to compute the conditional probabilities of the vowels given the probabilities of the consonants. We take those probabilities and place them in the table for easy reference in our next step. 1213
Next we simply take the log of each value. Note that this is log base 2 value even though the 2 is not shown with the log operator here. 1314
We perform the multiplication and add up the results to obtain the second of our two components of the entropy of our Simplified Polynesian Language syllable model.14Joint Entropy15
At last we make the final addition for the entropy calculation. We find that the entropy is 2.43625 bits per syllable. To compare that to our per letter model we must multiply the per letter entropy by two. This is because we have two-letter syllables. The reason for the reduction in entropy is that our new model reduces uncertainty. The reduction in uncertainty means that, on average, we are less surprised by Polynesian that we were before when we used the per-letter model. 15Entropy Rate16
The information amount or quantity conveyed in a message will depend on the message length. A longer message will, on average, have more information than a shorter message will have. Consequently we want to use a per unit value, where units may be letters or may be words. Here the 1n subscript refers to the fact that this is a per unit measure.16Entropy of Human Language17
We assume that human language is a stochastic process consisting of a token sequence. Imagine that we have a Web crawler that continually collects new samples of a language. As we collect more and more data the entropy of the language approaches a limit, which we use as the our entropy estimate for the language.17References18Sources:Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich SchtzeThe MIT PressFundamentals of Information Theory and Coding Design, by Roberto Togneri and Christopher J.S. deSilvaChapman & Hall / CRC
18The end of the Conditional Entropy slide show has come.End of the Slides19
This ends our Joint and Conditional Entropy slide sequence.19