The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Introduction to information theory LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/06 1.
-
Upload
brandon-mosley -
Category
Documents
-
view
215 -
download
0
Transcript of Introduction to information theory LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/06 1.
Information theory
• Reading: M&S 2.2
• It is the use of probability theory to quantify and measure “information”.
• Basic concepts:
– Entropy
– Cross entropy and relative entropy
– Joint entropy and conditional entropy
– Entropy of the language and perplexity
– Mutual information 4
Entropy
• Entropy is a measure of the uncertainty associated with a distribution.
• The lower bound on the number of bits that it takes to transmit messages.
• An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results.
x
xpxpXH )(log)()(
5
An example
• Uniform distribution: pi=1/8.
• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64)
bitsXH 3)8
1log8
1(*8)( 2
bitsXH 2)64
1log
64
1*4
16
1log
16
1
8
1log8
1
4
1log4
1
2
1log2
1()(
(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
Uniform distribution has higher entropy.MaxEnt: make the distribution as “uniform” as possible. 6
Cross Entropy
• Entropy:
• Cross Entropy:
• Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).
xc
x
xqxpXH
xpxpXH
)(log)()(
)(log)()(
)()( XHXH c 7
Relative Entropy
• Also called Kullback-Leibler divergence:
• Another “distance” measure between probability functions p and q.
• KL divergence is asymmetric (not a true distance):
)()()(
)(log)()||( 2 XHXH
xq
xpxpqpKL c
),(),( pqKLqpKL 8
Reading assignment #1
• Read M&S 2.2: Essential Information Theory
• Questions: For a random variable X, p(x) and q(x) are two distributions: Assuming p is the true distribution.– p(X=a)=p(X=b)=1/8, p(X=c)=1/4, p(X=d)=1/2– q(X=a)=q(X=b)=q(X=c)=q(X=d)=1/4
(a) What is H(X)?(b) What is H(X, q)?(c) What is KL divergence D(p||q)?(d) What is D(q||p)?
9
Joint and conditional entropy
• Joint entropy:
• Conditional entropy:
x y
yxpyxpYXH ),(log),(),(
)(),(
)|(log),(
)|(log)|()(
)|()()|(
XHYXH
xypyxp
xypxypxp
xXYHxpXYH
x y
x y
x
13
Entropy of a language(per-word entropy)
• The entropy of a language L:
• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
n
xpxp
LH nxnn
n
1
)(log)(
lim)(11
n
xp
n
xpLH nn
n
)(log)(loglim)( 11
14
Perplexity
• Perplexity is 2H.
• Perplexity is the weighted average number of choices a random variable has to make.
=> We learned how to calculate perplexity in LING570.
16
Mutual information
• It measures how much is in common between X and Y:
• I(X;Y)=KL(p(x,y)||p(x)p(y))
• I(X;Y) = I(Y;X)
);(
),()()(
)()(
),(log),();(
XYI
YXHYHXH
ypxp
yxpyxpYXI
x y
17
Summary on Information theory
• Reading: M&S 2.2
• It is the use of probability theory to quantify and measure “information”.
• Basic concepts:
– Entropy
– Cross entropy and relative entropy
– Joint entropy and conditional entropy
– Entropy of the language and perplexity
– Mutual information 18
Hw1
• Q1-Q5: Information theory
• Q6: Condor submit
• Q7: Hw10 from LING570. – You are not required to turn in anything for Q7. – If you want feedback on this, you can choose to turn it in.– It won’t be graded. You get 30 points for free.
Q6: condor submission
• http://staff.washington.edu/brodbd/orientation.pdf
• Especially Slide #22 - #28.
21
For a command we can run as:
mycommand -a -n <mycommand.in >mycommand.out
The submit file might look like this: save it to *.cmd
Executable = mycommand The command
Universe = vanilla
getenv = true
input = mycommand.in STDIN
output = mycommand.out STDOUT
error = mycommand.error STDERR
Log = /tmp/brodbd/mycommand.log A log file that stores the results
of condor sumbission
arguments = "-a -n“ The arguments for the command
transfer_executable = false
Queue
22
Submission and monitoring jobs on condor
• Submission:
condor_submit mycommand.cmd
=> get a job number
• List the job queue:
condor_q
Status changes from “I” (idle) to “R” (run) to – “H”: means the job fails. Look at the log file specified
in *.cmd– Disappeared from the queue: You will receive an email
• Use “man condor_q” etc. to learn more about those commands. 23