RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l...

5
Internat. J. Math. & Math. Sci. Vol. 6 No. 2 (1983)403-406 603 RESEARCH NOTES ON A PROPERTY OF PROBAglLISTIC CONTEXT-FREE GRAMMARS R. CHAUDHURI Computer Science Group Department of Mathematics East Carolina University Greenville, North Carolina 27834 U.S.A. A.N.V. RAO Centre of Applied Mathematics Department of Mathematics University of South Florida Tampa, Florida 33620 U.S.A. (Received February 14, 1983) ABSTRACT. It is proved that for a probabilistic context-free language L(G), the population density of a character (terminal symbol) is equal to its relative density in the words of a sample S from L(G) whenever the production probabilities of the grammar G are estimated by the relative frequencies of the corresponding productions in the sample. KEY WORDS AND PHRASES. Probabiltic grammars, stochastic expectation matrix, consistency and character density. 1980 THEMATICS SUBJECT CLASSIFICATION CODES. 68F05, 68F15. 1. INT RODUCT ION. The production probabilities of a context-free grammar may be estimated by the relative frequencies of the corresponding productions in the words of a sample from the language generated by the grammar. It was conjectured by Wetherell [3] that in such a case the expected derivation length and the expected word length of the words in a language would be exactly equal to the mean derivation length and the mean word length of the words in the corresponding sample. The conjecture of Wetherell was proved to be true by Chaudhuri, Pham and Garcia [2]. In this note, we prove that an analogous result holds for character densities also. 2. PRELIMINARIES. It is assumed that the reader is familiar with elements of formal language theory. Let G (V N,V T, o, P) be a context-free grammar. Assume that, V N {o N I,N 2,..,Nr

Transcript of RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l...

Page 1: RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l sample S. Note hat, iir (j) (I 0 0 01 M. ... oum Applied athematics Journa of Hindawi

Internat. J. Math. & Math. Sci.Vol. 6 No. 2 (1983)403-406

603

RESEARCH NOTESON A PROPERTY OF PROBAglLISTIC

CONTEXT-FREE GRAMMARSR. CHAUDHURI

Computer Science GroupDepartment of MathematicsEast Carolina University

Greenville, North Carolina 27834 U.S.A.

A.N.V. RAOCentre of Applied MathematicsDepartment of Mathematics

University of South FloridaTampa, Florida 33620 U.S.A.

(Received February 14, 1983)

ABSTRACT. It is proved that for a probabilistic context-free language L(G), the

population density of a character (terminal symbol) is equal to its relative density

in the words of a sample S from L(G) whenever the production probabilities of the

grammar G are estimated by the relative frequencies of the corresponding productions

in the sample.

KEY WORDS AND PHRASES. Probabiltic grammars, stochastic expectation matrix,consistency and character density.

1980 THEMATICS SUBJECT CLASSIFICATION CODES. 68F05, 68F15.

1. INTRODUCTION.

The production probabilities of a context-free grammar may be estimated by the

relative frequencies of the corresponding productions in the words of a sample from

the language generated by the grammar. It was conjectured by Wetherell [3] that in

such a case the expected derivation length and the expected word length of the words

in a language would be exactly equal to the mean derivation length and the mean

word length of the words in the corresponding sample. The conjecture of Wetherell

was proved to be true by Chaudhuri, Pham and Garcia [2]. In this note, we prove

that an analogous result holds for character densities also.

2. PRELIMINARIES.

It is assumed that the reader is familiar with elements of formal language theory.

Let G (VN,VT, o, P) be a context-free grammar. Assume that, VN

{o NI,N2,..,Nr

Page 2: RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l sample S. Note hat, iir (j) (I 0 0 01 M. ... oum Applied athematics Journa of Hindawi

404 R. CHAUDHURI and A.N.V. RAO

and VT {tl,t2,..,ts0 are the finite sets of non-terminal and terminal symbols

respectively. For each Ni VN, let Nol Uia (a=l,2,..,ni) be the productions that

rewrite the non-terminal N.. The set P consists of all the productions and o is the1

special start symbol. The language corresponding to G will be denoted by L(G).

Definition i. A probabilistic context-free grammar is a pair (G,) where G is

a context-free grammar and (l,2,..,r) is a vector satisfying the following

properties

i) for each i, i (Pil,Pi2,..,Pin.) where O-<Pia_<l1

a 1,2 ni n.and Pia Prob(Ni Uia)ii) for each i, i.

a=lPia

In the following, we always assume G to be unambiguous, i.e. each word w L(G)

has a unique derivation starting from . The distribution vector of a probabilistic

context-free grammar (G,) induces a measure on L(G). For each word w L(G),

(w) is precisely the product of the probabilities of the productions used to derive

w from the start symbol o. For details,. [i]. A probabilistic context-free

grammar (G,) is called consistent iff (w) I.(G)

Definition 2. Let (G,) be a probabilistic context-free grammar. Define the

matrix M (mij) l_<i,j_<r, as follows:n.

m.. -p f(i)lJ ia aj

where no is the number of productions of the type No Uo and f(i) is the number ofi la a3

occurrences of the non-terminal N in u.j a

The (rXr) matrix M is called the stochastic expectation matrix corresponding to

(G,). If the spectral radius R(M) of the matrix M is less than unity, then (G,)

is called strongly consistent. Also, it was noted in [1,3] that if R(M)<I then the

matrix series I + M + M2+ converges to a matrix M-- (I-M) -1. The matrix M

must be known in order to find the expected derivation length, EDL(o), and the

expected word length, EWL(o), and many other statistical characteristics of the

language L(G). For details, see [1,2,3].

The production probabilities of a grammar (G,) can be estimated by the relative

frequencies of the corresponding productions in the derivation processes of all the

words in a sample S from L(G). The estimated probability of the production No u.i la

is:

n’biy,I

=b ),Pia iay=l

where bia is the frequency of the production N u. Thus, each sample S fromi la

L(G) will induce a distribution vector and will give rise to a probabilistic

context-free grammar (G,).

Page 3: RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l sample S. Note hat, iir (j) (I 0 0 01 M. ... oum Applied athematics Journa of Hindawi

PROBABILISTIC CONTEXT-FREE GRAMMARS 405

Definition 3. Let S be a sample of size % from L(G). The mean derivation

length and the mean word length of the words is S are defined as follows:

MDL(S) (sum of the derivation lengths of the words in S)/%

MWL(S) (sum of the word lengths of the words in S)/%.

The following conjecture of Wetherell [3] was proved in [2].

Theorem I. Let G be any unambiguous context-free grammar and let S be any

sample from L(G). Then the probabilistic context-free grammar (G,) is strongly

consistent and

EDL() MDL(S)

and EWL(O) MWL(S)

where o is the start symbol of G and is the distribution vector indmced by S on G.

3. MAIN RESULT.

The concept of character density was introduced in [i].

Definition 4. The character density d(to) of a character (terminal symbol)

toj VT

is the relative number of times the character t. appears in the words of L(G)._(i) ’sLet gaj be the number of tj in the consequence Uia of the production N

I/ Uia"

Let T be the column vector whose i-th element isn.

(j) p _(i)Ti ia gaj (i=l,2,..,r).

a=l

This term may be interpreted as the average number of t’s produced directly fromJ

the non-terminal N.. If (G,) is a strongly consistent grammar, then it was proved1

in [i] that for any character ti

VT

d(t.) (i 0 0 ...0) M T (j) / EWL(O).

Definition 5. The relative density of a terminal symbol (character) t VT

in

a sample S from L(G) is defined as

Total number of times t. appears in the words of SRD(tj ,S)

Sum of the word lengths of all words in SNote that,

r

/RD(tj,S) ( hij) h..)

i=l i=l j=ln

hij ia’s produced from the non-terminal N

i(i)

is the number of twhere, b gaia=l

in the derivation processes of all the words in the sample S.

The following theorem relates the population density and relative density of a

terminal symbol (character) t..3

Theorem 2. Let G be any unambiguous context-free grammar and let S be any

Page 4: RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l sample S. Note hat, iir (j) (I 0 0 01 M. ... oum Applied athematics Journa of Hindawi

406 R. CHAIN)HURl and A.N.V. RAO

sample from L(G). Then for any terminal symbol (character) t. we have:3

d(to) RD(t.,S)J 3

Proof: As noted in [2], the estimation of the production probabilities of a

context-free grammar G by choosing a sample from L(G) gives rise to a probabilistic

context-free granar (G,) which is strongly consistent. Let M be the stochastic

expectation matrix corresponding to (G,). It was proved in [2] that in such a case

the matrix series I + M + M2 + converges to a matrix Mand moreover for each i,ei (% sample size)

n. Mli %

where ei = bia bia

a=l

sample S. Note hat,r

ii (j)(I 0 0 01 M. T (j)Mli T

i

being the frequency of the production No U in the1 la

r r- e. hiji=I i

i=l

By virtue of Theorem i, we have

s r

EWL(o) MWL(S) hijj=l i=l

Hence, d(t.) (M. r (j)) / EWL(o)j ii

r

hiji=ls r

RD(t.,S)J

REFERENCES

I. BOOTH, T.L. and THOMPSON, R.A. Applying Probability Measures to Abstract

Languages, IEEE Trans. Comp. C-22 (1973) 442-450.

2. CHAUDHURI, R., PHAM, S., and GARCIA, O.N. Solution of an Open Problem on

Probabilistic Context-Free Grammars, IEEE Trans. Comp. (1982) to appear.

3. WETNERELL, C.S. Probabilistic Languages: A Review and Some Open Questions,

Computing. Surveys i__2, No. 4 (1980) 361-379.

Page 5: RESEARCH NOTESdownloads.hindawi.com/journals/ijmms/1983/485236.pdf · Mli % where e i = bia bia a=l sample S. Note hat, iir (j) (I 0 0 01 M. ... oum Applied athematics Journa of Hindawi

Submit your manuscripts athttp://www.hindawi.com

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttp://www.hindawi.com

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

CombinatoricsHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014 Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Stochastic AnalysisInternational Journal of