Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In...
Transcript of Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In...
Represe
nta
tion
Str
uctu
res
for
Com
puta
tional
Lin
guistic
s
Gerard
Huet
ESSLLI
2002,Tre
nto
-1
-
What
the
course
isab
out
•A
computation
alplatform
forSan
skrit
•T
he
ZE
Ncom
putation
alm
orphology
toolk
it
•P
idgin
ML
•T
he
function
alprogram
min
gparad
igmfor
CL
•C
oncrete
program
min
gissu
esin
Objective
Cam
l+
Cam
lp4
•G
eneral
architectu
reissu
esfor
aC
Lplatform
•C
oop
erationon
freeC
Lresou
rces
Tw
osp
ecific
applicative
technologies:
•Local
pro
cessing
offo
cused
data
•Sharin
g
-2
-
What
shall
not
be
discu
ssed
•M
Lvs
C+
+
•M
Lvs
Java
•M
Lvs
Prolog
-3
-
What
shall
not
be
discu
ssedat
length
•O
bjective
CA
ML
vs
SM
L
•M
Lvs
Haskell
•M
Lvs
C
•P
idgin
ML
vs
Objective
CA
ML
-4
-
Basics:
listsvs
stacks
value
l5
=[1;
2;
3;
4;
5];
value
s5
=[5;
4;
3;
2;
1];
value
rec
unstack
ls
=
match
lwith
[[]
->
s
|[h::t]
->
unstack
t[h::s]
];
value
rev
l=unstack
l[];
value
state3
=([3;
2;
1],[4;
5]);
-5
-
Turin
gm
achin
es,E
macs,
and
Zip
pers
Zip
pers.
First
presen
tationat
FLoC
’96.P
ublish
edas:
G.H
uet.
The
Zip
per.
J.Function
alP
rogramm
ing
7,5(1997),
549-554.
Large
scaleim
plem
entation
sin
syntax
editors
with
incom
putation
al
lingu
isticsplatform
s:
•G
.H
uet.
Lex
icalm
orphism
sw
ithth
eZen
platform
.
•A
.R
anta.
Gram
matical
framew
orks.
-6
-
Con
texts
aszip
pers
type
tree
=[Tree
of
forest
]
and
forest
=list
tree;
type
tree_zipper
=
[Top
|Zip
of
(forest
*tree_zipper
*forest)
];
type
focused_tree
=(tree_zipper
*tree);
Afo
cused
treeis
atree
with
afo
cus
poin
tof
interest,
i.e.a
treean
d
astacked
contex
t.
-7
-
Operation
son
focu
sedtrees
value
down
(z,t)
=match
twith
[Tree(forest)
->
match
forest
with
[[]
->
raise
(Failure
"down")
|[hd::tl]
->
(Zip([],z,tl),hd)
]
];
value
up
(z,t)
=match
zwith
[Top
->
raise
(Failure
"up")
|Zip(l,u,r)
->
(u,
Tree(unstack
l[t::r]))
];
-8
-
More
operation
son
focu
sedtrees
value
left
(z,t)
=match
zwith
[Top
->
raise
(Failure
"left")
|Zip(l,u,r)
->
match
lwith
[[]
->
raise
(Failure
"left")
|[elder::rest]
->
(Zip(elders,u,[t::r]),rest)
]
];
value
right
(z,t)
=match
zwith
[Top
->
raise
(Failure
"right")
|Zip(l,u,r)
->
match
rwith
[[]
->
raise
(Failure
"right")
|[young::rest]
->
(Zip([t::l],u,rest),young)
]
];
-9
-
Applicative
updatin
g
value
del_l
(z,_)
=match
zwith
[Top
->
raise
(Failure
"del_l")
|Zip(l,u,r)
->
match
lwith
[[]
->
raise
(Failure
"del_l")
|[elder::elders]
->
(Zip(elders,u,r),elder)
]
];
value
replace
(z,_)
t=(z,t);
-10
-
Poin
tsof
view
abou
tfo
cused
structu
res
•M
anip
ulation
offo
cused
data
islo
cal
•R
edundan
trep
resentation
-effi
ciency
•T
he
Interaction
Com
bin
atorsParad
igm
Rem
ark.
Zip
pers
arelin
earcon
texts.
They
aresu
perior
toΩ
-terms,
notab
lybecau
seth
eap
prox
imation
orderin
gis
substru
ctural.
The
Natu
ralTran
sformation
fromtree
functors
tozip
per
functors
is
Diff
erentiation
;Zip
pers
may
alsobe
seenas
the
linear
Function
sover
Trees.
-11
-
Back
tolin
guistics
We
wan
tto
pro
cess(p
arsean
dgen
erate)natu
rallan
guage
senten
ces,
dialogu
es,corp
uses
ofvariou
skin
ds
(oral,w
ritten,new
s,book
s,w
eb
sites,etc).
We
assum
eth
atth
edata
isalread
ydigitalised
and
discretized
asa
streamof
letters(p
hon
emes
fororal
data,
lettersfor
written
one).
Afu
ndam
ental
entity
inth
ispro
cessing
isth
ew
ord.
One
tradition
allydistin
guish
espro
cessing
betw
eenstream
sof
lettersan
d
word
s(m
orphology,
lexical
analy
sis)an
dpro
cessing
betw
eenw
ords
and
senten
ces(sy
ntax
,parsin
g).H
owever,
the
natu
reof
the
word
is
ellusive.
-12
-
What
Tesn
ierehas
tosay
The
lingu
istTesn
iere,in
his
Elem
ents
de
Syntax
eStru
cturale,
says:
“Pou
rsim
ple
qu’elle
paraisse,
lanotion
de
mot
estune
de
cellesdon
t
ladefi
nition
estla
plu
sdelicate
pou
rle
lingu
iste.C
’estpeu
t-etreque
tropsou
vent
onpart
de
lanotion
de
mot
pou
rarriver
ala
notion
de
phrase,
aulieu
de
partir
de
lanotion
de
phrase
pou
rarriver
ala
notion
de
mot.
Or
onne
saurait
defi
nir
laphrase
apartir
du
mot,
mais
seulem
ent
lem
ota
partir
de
laphrase.
Car
lanotion
de
phrase
estlogiq
uem
ent
anterieu
rea
cellede
mot .”
-13
-
Ontological
Prob
lem
What
Tesn
ierereally
says
isan
evid
ence:
itis
the
ontological
priority
ofth
eC
orpus
overth
eLex
icon.
The
word
sare
found
inth
eC
orpus,
then
copied
toth
eLex
icon;th
eLan
guage
isdefi
ned
by
itsC
orpus.
The
preem
inen
ceof
the
Corp
us
overth
eLex
iconis
unden
iable.
Neverth
eless,th
ew
ords
arerecogn
izedin
the
corpus
relativelyto
the
generative
dev
icesof
morp
hology
;th
ein
versionof
these
generative
relations
exten
ds
the
strictcoverin
gof
the
corpus
by
the
generative
capab
ilitiesof
the
gramm
ar;an
dth
us
there
isa
tension
betw
eenth
e
co-inductive
structu
reof
the
lexicon
asa
repository
ofutteran
cesan
d
the
inductive
structu
reof
word
sas
generated
by
morp
hological
dev
icesof
stems
inth
elex
icon.
-14
-
Philosop
hical
consid
erations
Anek
dot.
The
Tham
adas
inG
eorgia.
Puzzles.
The
‘oui’
prob
lem.
The
‘oiu’prob
lem.
Research
topic.
Defi
ne
the
functor
the
fixpoin
tof
which
is
constru
cted.
Tech
nology.
Chase
out
hap
axes.
Or
rather,
index
prop
erlyth
e
diach
ronical
dim
ension
ofth
elan
gageunder
consid
eration.
-15
-
Back
toth
eLex
icon
Words.
Word
sare
represen
tedas
listof
positive
integers.
type
letter
=int
and
word
=list
letter;
We
prov
ide
coercion
sencode
:string
->
word
and
decode
:word
->
string.
Here
islex
icographic
orderin
g.
value
rec
lexico
l1
l2
=match
l1
with
[[]
->
True
|[c1
::
r1]
->
match
l2
with
[[]
->
False
|[c2
::
r2]
->
if
c2<c1
then
False
else
if
c2=c1
then
lexico
r1
r2
else
True
]];
-16
-
Diff
erential
word
s
type
delta
=(int
*word);
Adiff
erential
word
isa
notation
perm
itting
toretrieve
aw
ordw
from
anoth
erw
ordw
′sh
aring
acom
mon
prefi
x.
Itden
otesth
em
inim
al
path
connectin
gth
ew
ords
ina
tree,as
aseq
uen
ceof
ups
and
dow
ns:
ifd
=(n
,u)
we
goup
ntim
esan
dth
endow
nalon
gw
ordu.
We
compute
the
diff
erence
betw
eenw
and
w′as
adiff
erential
word
dif
fw
w′=
(|w1|,w
2)w
here
w=
p.w
1an
dw
′=
p.w
2,w
ith
max
imal
comm
onprefi
xp.
The
converse
ofdiff
:word
->
word
->
delta
is
patch
:delta
->
word
->
word:
w′m
aybe
retrievedfrom
wan
d
d=
dif
fw
w′as
w′=
patch
dw
.-17
-
Tries
Tries
storesp
arsesets
ofw
ords
sharin
gin
itialprefi
xes.
They
aredue
toR
ene
de
laB
riantais
(1959).W
euse
avery
simple
represen
tation
with
listsof
siblin
gs.
type
trie
=[Trie
of
(bool
*forest)
]
and
forest
=list
(Word.letter
*trie);
Tries
arem
anaged
(search,in
sertion,etc)
usin
gth
ezip
per
technology.
-18
-
Importan
trem
arks
Tries
may
be
consid
eredas
determ
inistic
finite
stateau
tomata
graphs
foraccep
ting
the
(finite)
langu
ageth
eyrep
resent.
This
remark
isth
e
basis
form
any
lexicon
pro
cessing
libraries.
Such
graphs
areacy
clic(trees).
But
more
general
finite
state
autom
atagrap
hs
may
be
represen
tedas
annotated
trees.T
hese
annotation
saccou
nt
fornon
-determ
inistic
choice
poin
ts,an
dfor
virtu
alpoin
tersin
the
graph.
-19
-
Lex
icon
Here
isa
simplistic
lexicon
compiler
make_lex
:list
string
->
trie:
value
make_lex
=
List.fold_left
(fun
lex
c->
Trie.enter
lex
(Word.encode
c))
Trie.empty;
For
instan
ce,w
ithenglish.lst
storing
alist
of173528
word
s,as
a
text
file
ofsize
2Mb,th
ecom
man
d
make_lex
<english.lst
>english.rem
pro
duces
atrie
represen
tationas
afile
of4.5M
b.
Tries
share
the
word
sby
there
prefi
xes,
but
comm
onsu
ffixes
account
fora
lotof
redundan
cyin
the
structu
re.W
esh
allelim
inate
this
redundan
cyby
sharin
g.
-20
-
The
Share
Functor
module
Share
:functor
(Algebra:sig
type
domain
=’a;
value
size:
int;
end)
->
sig
value
share:
Algebra.domain->int->Algebra.domain;
end;
That
is,Share
takesas
argum
ent
am
odule
Algeb
raprov
idin
ga
type
dom
ainan
dan
integer
value
size,an
dit
defi
nes
avalu
esh
areof
the
statedty
pe.
We
assum
eth
atth
eelem
ents
fromth
edom
ainare
presen
tedw
ithan
integer
keybou
nded
by
Algeb
ra.size.T
hat
is,
share
xk
will
assum
eas
precon
dition
that
0≤
k<
Max
with
Max
=Algebra.size.
We
shall
constru
ctth
esh
aring
map
with
the
help
ofa
hash
table,
mad
eup
ofbuckets
(k,[e
1 ;e2 ;...e
n])
where
eachelem
ent
ei
has
keyk.
-21
-
Mem
oizing
type
bucket
=list
Algebra.domain;
value
memo
=Array.create
Algebra.size
([]
:bucket);
We
shall
use
aserv
icefu
nction
search,su
chth
atsearch
elretu
rns
the
first
yin
lsu
chth
aty
=e
oror
elseraises
the
excep
tion
Not_found.
value
search
e=
List.find
(fun
x->
x=e);
-22
-
The
share
function
value
share
element
key
=
let
bucket
=memo.(key)
in
try
search
element
bucket
with
[Not_found
->
do
memo.(key):=[element::bucket];
element
];
Sharin
gis
just
recalling!
-23
-
Com
pressin
gtrees
asdags
We
may
forin
stance
instan
tiateShare
onth
ealgeb
raof
trees,w
itha
sizehash
max
dep
endin
gon
the
application
:
module
Dag
=Share
(struct
type
domain=tree;
value
size=hash_max;
end);
And
now
we
compress
atrie
into
am
inim
aldag
usin
gshare
by
a
simple
bottom
-up
traversal,w
here
the
keyis
computed
along
by
hash
ing.
For
this
we
defi
ne
agen
eralbottom
-up
traversalfu
nction
,
which
applies
aparam
etriclookup
function
toevery
node
and
its
associated
key.
-24
-
Dynam
icprogram
min
g
Bottom
-up
traversing
with
inductive
hash
-code
computation
.
value
hash1
key
index
sum
=sum
+index*key
and
hash
forest
=forest
mod
hash_max;
value
traverse
lookup
=travel
where
rec
travel
=fun
[Tree(forest)
->
let
f(tries,index,span)
t=
let
(t0,k)
=travel
t
in
([t0::tries],index+1,hash1
kindex
span)
in
let
(forest0,_,span)
=List.fold_left
f([],1,0)
forest
in
let
key
=hash
span
in
(lookup
(Tree(rev
forest0))
key,
key)
];
-25
-
Com
pressin
ga
treeas
adag
Now
,com
pressin
ga
treeop
timally
asa
min
imal
dag
issim
ply
effected
by
ash
aring
traversal:
value
compress
=traverse
Dag.share;
value
minimize
tree
=let
(dag,_)
=compress
tree
in
dag;
-26
-
Advan
tagesan
dex
tension
s
Hash
ing
keys
and
sizeis
onth
eclien
tsid
e:
we
do
not
delegate
hash
ing
toShare,
which
isju
stan
associative
mem
ory.T
his
has
two
advan
tages:
•T
he
computation
isfu
llylin
ear
•It
isad
apted
toth
estatistics
ofth
edata
Exten
sion:
Auto-sh
aring
types
(controlled
hash
-consin
g).Suggests
a
mon
adof
shared
hash
edstru
ctures
accomm
odatin
gen
tropy
ofth
e
data.
-27
-
Dagifi
edlex
icons
We
may
dagify
alex
icona
posteriori
inon
epass:
value
rec
dagify
()
=
let
lexicon
=(input_value
stdin
:Trie.trie)
in
let
dag
=Mini.minimize
lexicon
in
output_value
stdout
dag;
Or
we
may
main
taina
dagifi
edstru
cture
by
sharin
gdynam
ically
when
insertin
gw
ords
by
approp
riatem
odifi
cationof
the
zipper
operation
s.
And
now
ifw
eap
ply
this
techniq
ue
toou
ren
glishlex
icon,w
ith
comm
anddagify
<english.rem
>small.rem,w
enow
getan
optim
alrep
resentation
which
only
need
s1M
bof
storage,half
ofth
e
original
ASC
IIstrin
grep
resentation
.
-28
-
Pub
The
recursive
algorithm
sgiven
sofar
arefairly
straightforw
ard.
They
areeasy
todeb
ug,
main
tainan
dm
odify
due
toth
estron
gty
pin
g
safeguard
ofM
L,an
deven
easyto
formally
certify.T
hey
are
non
etheless
efficien
ten
ough
forpro
duction
use,
than
ks
toth
e
optim
izing
native-co
de
compiler
ofO
bjective
Cam
l.
Inou
rSan
skrit
application
,th
etrie
of11500
entries
issh
runk
from
219Kb
to103K
bin
0.1s,w
hereas
the
trieof
120000flex
edform
sis
shru
nk
from1.63M
bto
140Kb
in0.5s
ona
864MH
zP
C.O
ur
trieof
173528E
nglish
word
sis
shru
nk
from4.5M
bto
1Mb
in2.7s.
Measu
remen
tssh
owed
that
the
time
complex
ityis
linear
with
the
size
ofth
elex
icon(w
ithin
comparab
lesets
ofw
ords).
-29
-
Variation
s
Man
yvariation
son
triesex
ist.O
ptim
isations
oflex
icalan
alysers
for
program
min
glan
guages
aredescrib
edin
the
Dragon
book
.B
ut
the
dragon
book
ofcom
putation
allin
guistics
has
not
been
written
yet.
Variation
with
ternary
trees.Tern
arytrees
arein
spired
fromB
entley
and
Sed
gewick
.Tern
arytrees
arem
orecom
plex
than
tries,but
use
slightly
lessstorage.
Access
ispoten
tiallyfaster
inbalan
cedtrees
than
tries.A
good
meth
odology
seems
touse
triesfor
edition
,an
dto
translate
them
tobalan
cedtern
arytrees
forpro
duction
use
with
a
fixed
lexicon
.
The
ternary
versionof
our
english
lexicon
takes3.6M
b,a
savin
gsof
20%over
itstrie
versionusin
g4.5M
b.
After
dag
min
imization
,it
takes1M
b,a
savin
gsof
10%over
the
triedag
versionusin
g1.1M
b.
For
our
sansk
ritlex
iconin
dex
,th
etrie
takes221K
ban
dth
etertree
180Kb.
Shared
asdags
the
trietakes
103Kb
and
the
tertree96K
b.
-30
-
Decos,
Lex
map
s,A
utos
We
understan
dth
eTrie
structu
reof
aset
ofW
ords
asa
special
case
ofa
finitely
based
map
pin
gD
eco=
Word
→A
nnotation
inth
ecase
ofB
oolean
annotation
ssh
aredby
prefi
xargu
men
ts(an
dby
comm
on
subex
pression
sw
hen
shared
).
We
storem
orphology
constru
ctions
asbein
gof
this
type,
and
we
investigate
the
reversem
appin
gby
generalisin
gth
emto
relations,
typically
inductively
defi
ned
throu
ghfinite
statem
achin
es.
The
more
sharin
gw
eget
the
better
we
optim
iseth
isdata
layout.
It
isth
us
ofparam
ount
importan
ceth
atth
ean
notation
sbe
local
quasi-m
orphism
sdecoration
s.
-31
-
Decos
type
deco
’a
=[
Deco
of
(list
’a
*dforest
’a)
]
and
dforest
’a
=list
(Word.letter
*deco
’a);
We
thin
kof
the
decoration
ofty
pelist
’a
asan
inform
ation
associated
with
the
word
storedat
that
node.
We
caneasily
generalize
sharin
gto
decorated
tries.H
owever,
substan
tialsav
ings
will
result
only
ifth
ein
formation
ata
givennode
isa
function
ofth
esu
btrie
atth
atnode,
i.e.if
such
inform
ationis
defi
ned
asa
triem
orphism
.
Defi
nition
.A
deco
isa
treem
orphism
ifth
ein
formation
atevery
node
isa
function
ofth
ecorresp
ondin
gsu
b-tree.
Such
decos
preserve
the
sharin
gof
the
treesth
eydecorate.
-32
-
Enco
din
gm
orphological
param
etersas
decoration
s
We
thus
profi
tof
the
regularity
ofm
orphological
transform
ations
to
have
terserep
resentation
sof
the
lexicon
decorated
by
gramm
atical
inform
ation.
Thus
ifall
plu
ralsare
obtain
edby
addin
g‘s’
toth
e
singu
larstem
excep
tfor
afew
excep
tions,
we
do
not
pay
any
costin
enco
din
gth
isplu
ralin
formation
asan
explicit
instru
ction
[pl:suffix
s]
decoratin
gth
estem
s,sin
ceit
will
not
createan
ynew
node
excep
tfor
the
fewex
ception
s.A
sop
posed
tolistin
gex
plicitly
the
plu
ralform
,w
hich
wou
ldundo
allsh
aring.
Inou
rsan
skrit
implem
entation
,th
evariou
sgen
ders
associated
with
a
nou
nstem
aredefi
ned
ina
deco
used
forpro
ducin
gth
eflex
edform
s.
The
flex
edform
sare
then
generated
usin
gan
ad-h
oc
intern
alsan
dhi
algorithm
,diffi
cult
toen
code
asa
finite-state
pro
cess,an
dth
us
diffi
cult
toin
verse.
-33
-
(Asid
e)T
he
scopin
gstru
cture
ofth
elex
icon
How
tofind
the
stemasso
ciatedw
itha
gender
inth
elex
iconin
one
clickso
that
morp
hology
may
be
disp
layed-w
ithno
need
ofscrip
tor
applet.
Sim
ple
distrib
uted
architectu
re-
allth
ecom
putation
isdon
eon
the
serversid
e.
Main
tainin
gcom
putation
alin
variants
inth
elex
iconau
gmen
tsits
robustn
ess.
-34
-
Explicit
morp
hology
vs
implicit
morp
hology
By
explicit
morp
hology
Im
eanlistin
gex
plicitly
the
forms
generated
by
morp
hology
operation
sfrom
root
stems,
prefi
xes
and
suffi
xes.
By
implicit
morp
hology
Im
eanju
sthav
ing
program
sw
hich
will
generate
these
flex
edform
son
dem
and.
Implicit
morp
hology
isnot
enou
ghto
recognize
the
segmen
tsof
senten
cesid
entical
with
aflex
edform
:th
em
orphological
function
s
must
be
invertib
le.
-35
-
Com
prom
ise
On
the
other
han
d,th
edelim
itationbetw
eenim
plicit
and
explicit
is
blu
rredsin
cee.g.
afinite-state
mach
ine
stategrap
hm
aybe
both
consid
ereda
program
and
apiece
ofdata;
forin
stance,
atrie
stores
word
s,but
actually
the
word
sare
“recognized
asbein
gin
the
lexicon
”by
“runnin
gth
elex
iconover
them
asin
put
data”.
Thus
we
shall
represen
t“ex
plicitly
”flex
edform
san
dth
ein
formation
onhow
they
arederived
fromro
otstem
sas
atrie
bearin
gas
decoration
sin
struction
son
how
to“u
ndo
morp
hology
”lo
cally.For
this
purp
ose,w
esh
alluse
the
notion
ofdiff
erential
word
above.
We
may
now
storein
versem
aps
oflex
icalrelation
s(su
chas
morp
hology
derivation
s)usin
gth
eLex
map
structu
re.
This
way
we
bypass
the
(hard
)prob
lemof
intern
alsan
dhifsm
axiom
atisation.
-36
-
Lex
map
s
type
inverse
’a
=(Word.delta
*’a)
and
inverse_map
’a
=list
(inverse
’a);
type
lexmap
’a
=[Map
of
(inverse_map
’a
*mforest
’a)
]
and
mforest
’a
=list
(Word.letter
*lexmap
’a);
Typically,
ifw
ordw
isstored
ata
node
Map([...;(d
,r);...],...),th
is
represen
tsth
efact
that
wis
the
image
by
relationr
of
w′=
patch
dw
.Such
ale
xm
ap
isth
us
arep
resentation
ofth
eim
age
by
rof
asou
rcelex
icon.
This
represen
tationis
invertib
le,w
hile
preserv
ing
max
imally
the
sharin
gof
prefi
xes,
and
thus
bein
g
amen
able
tosh
aring.
Exam
ple:
catsan
ddogs
sharin
gth
eir‘s’
node
while
implicitly
referring
toth
eirresp
ectivesin
gular
stem.
-37
-
Lex
iconrep
ositoriesusin
gtries
and
decos
Ina
typical
computation
allin
guistics
application
,gram
matical
inform
ation(p
artof
speech
role,gen
der/n
um
ber
forsu
bstan
tives,
valency
and
other
subcategorization
inform
ationfor
verbs,
etc)m
ay
be
storedas
decoration
ofth
elex
iconof
roots/stem
s.From
such
a
decorated
triea
morp
hological
pro
cessorm
aycom
pute
the
lexm
apof
allflex
edform
s,decorated
with
their
derivation
inform
ationen
coded
asan
inverse
map
.T
his
structu
rem
ayitself
be
used
by
ataggin
g
pro
cessorto
constru
ctth
elin
earrep
resentation
ofa
senten
ce
decorated
by
feature
structu
res.Such
arep
resentation
will
support
furth
erpro
cessing,
such
ascom
putin
gsy
ntactic
and
function
al
structu
res,ty
pically
assolu
tions
ofcon
straint
satisfactionprob
lems.
-38
-
Exam
ple:
San
skrit
The
main
compon
ent
inou
rto
olsis
astru
ctured
lexical
datab
ase.
From
this
datab
ase,variou
shypertex
tdocu
men
tsm
aybe
pro
duced
mech
anically.
The
index
CG
Ien
gine
searches
forw
ords
by
nav
igating
ina
persisten
ttrie
index
ofstem
entries.
The
curren
tdatab
ase
comprises
12000item
s,an
dits
index
has
asize
of103K
B.
When
computin
gth
isin
dex
,an
other
persisten
tstru
cture
iscreated
.
Itrecord
sin
adeco
allth
egen
ders
associated
with
anou
nen
try.A
t
presen
t,th
isdeco
records
genders
for5700
nou
ns,
and
ithas
asize
of
268KB
.
We
iterateon
this
genders
structu
rea
gramm
aticalen
gine,
which
generates
declin
edform
s.T
his
lexm
aprecord
sab
out
120000su
ch
flex
edform
sw
ithasso
ciatedgram
matical
inform
ation,an
dit
has
a
sizeof
341KB
.A
compan
iontrie,
with
out
the
inform
ation,keep
sth
e
index
offlex
edw
ords
asa
min
imized
structu
reof
140KB
.
-39
-
Fin
iteState
Lore
Com
putation
alphon
ologyare
morp
hology
use
exten
sivelyfinite
state
technology
:ration
allan
guages
and
relations,
transd
ucers,
bim
achin
es,etc.
•Sch
utzen
berger
•K
oskenniem
i
•K
aplan
and
Kay
Fin
itestate
toolsets
have
been
develop
ed,w
here
word
transform
ations
aresy
stematically
compiled
ina
low-level
algebra
of
finite-state
mach
ines
operators.
Such
toolsets
have
been
develop
edat
Xerox
,Paris
VII,
Bell
Lab
s,M
itsubish
iLab
s,etc.
Com
pilin
g
complex
rewrite
rules
inration
altran
sducers
may
be
subtle.
We
dep
artfrom
this
fine-grain
edm
ethodology
and
prop
osem
oredirect
translation
spreserv
ing
the
structu
reof
the
lexicon
.
-40
-
Fin
iteState
Mach
ines
asLex
iconM
orphism
s
We
startw
ithth
erem
arkth
ata
lexicon
represen
tedas
atrie
is
directly
the
statesp
acerep
resentation
ofth
e(d
etermin
istic)finite
statem
achin
eth
atrecogn
izesits
word
s,an
dth
atits
min
imization
consists
exactly
insh
aring
the
lexical
treeas
adag.
We
arein
acase
where
the
stategrap
hof
such
finite
langu
agesrecogn
izersis
an
acyclic
structu
re.Such
apure
data
structu
rem
aybe
easilybuilt
with
out
mutab
lereferen
ces,w
hich
has
computation
alan
drob
ustn
ess
advan
tages.
Inth
esam
esp
irit,w
edefi
ne
autom
ataw
hich
implem
ent
non
-trivial
rational
relations
(and
their
inversion
)an
dw
hose
statestru
cture
is
non
etheless
am
oreor
lessdirect
decoration
ofth
elex
icontrie.
The
crucial
notion
isth
atth
estate
structu
reis
alex
iconm
orphism
.
-41
-
Unglu
eing
We
startw
itha
toyprob
lemw
hich
isth
esim
plest
caseof
junctu
re
analy
sis,nam
elyw
hen
there
areno
non
-trivial
junctu
reru
les,an
d
segmen
tationcon
sistsju
stin
retrievin
gth
ew
ords
ofa
senten
ceglu
ed
together
inon
elon
gstrin
gof
characters
(orphon
emes).
Con
sider
for
instan
cew
rittenE
nglish
.Y
ouhave
atex
tfile
consistin
gof
aseq
uen
ce
ofw
ords
separated
with
blan
ks,
and
youhave
alex
iconcom
plete
for
this
text
(forin
stance,
‘spell’
has
been
successfu
llyap
plied
).N
ow,
suppose
youm
akesom
eed
iting
mistake,
which
removes
allsp
aces,
and
the
taskis
toundo
this
operation
torestore
the
original.
The
transd
ucer
isdefi
ned
asa
functor,
takin
gth
elex
icontrie
structu
reas
param
eter.
-42
-
Unglu
e
module
Unglue
(Lexicon:
sig
value
lexicon
:Trie.trie;
end)
=struct
type
input
=Word.word
(*
input
sentence
as
aword
*)
and
output
=list
Word.word;
(*
output
is
sequence
of
words
*)
type
backtrack
=(input
*output)
and
resumption
=list
backtrack;
(*
coroutine
resumptions
*)
exception
Finished;
We
defi
ne
our
unglu
eing
reactiveen
gine
asa
recursive
pro
cessw
hich
nav
igatesdirectly
onth
e(fl
exed
)lex
icontrie
(typically
the
compressed
trieresu
lting
fromth
eD
agm
odule
consid
eredab
ove).
-43
-
The
reactiveen
gine
The
reactiveen
gine
takesas
argum
ents
the
(remain
ing)
input,
the
(partially
constru
cted)
listof
word
sretu
rned
asou
tput,
aback
track
stackw
hose
items
are(in
put,o
utp
ut)
pairs,
the
path
occ
inth
estate
graph
stackin
g(th
ereverse
of)th
ecu
rrent
comm
onprefi
xof
the
candid
atew
ords,
and
finally
the
curren
ttr
ienode
asits
curren
t
state.W
hen
the
stateis
acceptin
g,w
epush
iton
the
back
track
stack,becau
sew
ew
ant
tofavor
possib
lelon
gerw
ords,
and
sow
e
contin
ue
readin
gth
ein
put
until
either
we
exhau
stth
ein
put,
orth
e
nex
tin
put
character
isin
consisten
tw
ithth
elex
icondata.
-44
-
The
reactiveen
gine
code
value
rec
react
input
output
back
occ
=fun
[Trie(b,forest)
->
if
bthen
let
pushout
=[occ::output]
in
if
input=[]
then
(pushout,back)
(*
solution
found
*)
else
let
pushback
=[(input,pushout)::back]
in
continue
pushback
else
continue
back
where
continue
cont
=match
input
with
[[]
->
backtrack
cont
|[letter
::
rest]
->
try
let
next_state
=List.assoc
letter
forest
in
react
rest
output
cont
[letter::occ]
next_state
with
[Not_found
->
backtrack
cont
]
]]
-45
-
Back
track
and
backtrack
=fun
[[]
->
raise
Finished
|[(input,output)::back]
->
react
input
output
back
[]
Lexicon.lexicon
];
Now
,unglu
eing
asen
tence
isju
stcallin
gth
ereactive
engin
efrom
the
approp
riatein
itialback
tracksitu
ation.
value
unglue
sentence
=backtrack
[(sentence,[])];
-46
-
Rem
ark
Non
-determ
inistic
program
min
gis
no
big
deal.
Why
shou
ldyou
surren
der
control
toa
PR
OLO
Gblack
box
?
The
three
golden
rules
ofnon
-determ
inistic
program
min
g:
•Id
entify
well
your
searchstate
space
•R
epresen
tstates
asnon
-mutab
ledata
•P
roveterm
ination
The
lastpoin
tis
essential
forunderstan
din
gth
egran
ularity
and
enforcin
gcom
pleten
ess.
-47
-
More
onstate
space
consid
erations
This
non
-determ
inistic
pro
cess(recogn
izing
L∗)
uses
the
sam
estate
space
asth
elex
icon/trie
(recognizin
gL
).
This
correspon
ds
toth
efact
that
anau
tomaton
forL
∗m
aybe
obtain
edfrom
the
autom
atonfor
Lby
insertin
gε-m
ovesfrom
acceptin
gnodes
toth
ein
itialnode.
But
such
transition
sm
aybe
kept
completely
implicit.
All
youhave
todo
isto
man
ageth
enecessary
non
-determ
inism
(contin
uin
gin
Lw
hich
isnot
ingen
erala
prefi
x
langu
age(i.e.
ifm
ayhap
pen
that
both
wan
dw·s
arein
L)
versus
iterating)
inth
eback
trackstack
,but
youdo
not
have
tom
odify
at
allth
estate
space
data
structu
re.It
isju
sta
shift
inpoin
tof
view
concern
ing
this
data.
-48
-
Still
more
onstate
space
consid
erations
Rem
ember
that
dagifi
edtries
defi
ne
the
min
imal
autom
atonof
a
finite
langu
ageL
.
But
itis
not
the
caseth
atth
isau
tomaton
,com
pleted
with
ε
transition
s,is
min
imal
forL∗.
Con
sider
forin
stance
L=a,a
a.
How
ever,note
that
we
areusin
git
asa
transd
ucer
computin
g
justifi
cations
fora
word
inL∗
tobe
acon
catenation
ofprecise
word
s
ofL
,an
dth
em
inim
alau
tomaton
does
not
keepen
ough
inform
ation
forth
at:distin
ctsegm
entation
sof
asen
tence
must
be
separated
.
-49
-
Child
talk
module
Childtalk
=struct
value
lexicon
=Lexicon.make_lex
["boudin";"caca";"pipi"];
end;
module
Childish
=Unglue(Childtalk);
let
(sol,_)
=Childish.unglue
(Word.encode
"pipicacaboudin")
in
Childish.print_out
sol;
We
recoveras
expected
:pipi
caca
boudin.
-50
-
Gen
erating
severalsolu
tions
We
resum
ea
resum
ption
with
resume
:(resumption
->
int
->
resumption).
value
resume
cont
n=
let
(output,resumption)
=backtrack
cont
in
do
print_string
"\n
Solution
";
print_int
n
;print_string
":\n";
print_out
output
;resumption
;
value
unglue_all
sentence
=restore
[(sentence,[])]
1
where
rec
restore
cont
n=
try
let
resumption
=resume
cont
n
in
restore
resumption
(n+1)
with
[Finished
->
if
n=1
then
print_string
"No
solution
found\n"
else
()
];
-51
-
Solv
ing
ach
arade
module
Short
=struct
value
lexicon
=Lexicon.make_lex
["able";
"am";
"amiable";
"get";
"her";
"i";
"to";
"together"];
end;
module
Charade
=Unglue(Short);
Charade.unglue_all
(Word.encode
"amiabletogether");
Solution
1:amiable
together
Solution
2:amiable
to
get
her
Solution
3:am
iable
together
Solution
4:am
iable
to
get
her
-52
-
Junctu
reeu
phon
yan
dits
discretization
When
successive
word
sare
uttered
,th
em
inim
izationof
the
energy
necessary
torecon
figu
rateth
evo
calorgan
sat
the
junctu
reof
the
word
sprovo
ques
aeu
phon
ytran
sformation
,discretized
atth
elevel
of
phon
emes
by
acon
textu
alrew
riteru
leof
the
form:
[x]u|v→
w
This
junctu
reeu
phon
y,or
extern
alsan
dhi,
isactu
allyrecord
edin
sansk
ritin
the
written
renderin
gof
the
senten
ce.T
he
first
lingu
istic
pro
cessing
isth
ereforesegm
entation
,w
hich
generalises
unglu
eing
into
sandhian
alysis.
-53
-
uv
wx
-54
-
zu
v
w
u v
x
-55
-
Auto
type
lexicon
=trie
and
rule
=(word
*word
*word);
The
rule
triple(rev
u,
v,
w)
represen
tsth
estrin
grew
riteu|v→
w.
Now
forth
etran
sducer
statesp
ace:
type
auto
=[State
of
(bool
*deter
*choices)
]
and
deter
=list
(letter
*auto)
and
choices
=list
rule;
module
Auto
=Share
(struct
type
domain=auto;
value
size=hash_max;
end);
-56
-
Com
pilin
gth
elex
iconto
am
inim
altran
sducer
(*
build_auto
:word
->
lexicon
->
(auto
*stack
*int)
*)
value
rec
build_auto
occ
=fun
[Trie(b,arcs)
->
let
local_stack
=if
bthen
get_sandhi
occ
else
[]
in
let
f(deter,stack,span)
(n,t)
=
let
current
=[n::occ]
(*
current
occurrence
*)
in
let
(auto,st,k)
=build_auto
current
t
in
([(n,auto)::deter],merge
st
stack,hash1
nk
span)
in
let
(deter,stack,span)
=fold_left
f([],[],hash0)
arcs
in
let
(h,l)
=match
stack
with
[[]
->
([],[])
|[h::l]
->
(h,l)]
in
let
key
=hash
bspan
h
in
let
s=
Auto.share
(State(b,deter,h))
key
in
(s,merge
local_stack
l,key)
];
-57
-
Segm
entin
gTran
sducer
Data
Stru
ctures
type
transition
=
[Euphony
of
rule
(*
(rev
u,v,w)
st
u|v
->
w*)
|Id
(*
identity
or
no
sandhi
*)
]
and
output
=list
(word
*transition);
type
backtrack
=
[Next
of
(input
*output
*word
*choices)
|Init
of
(input
*output)
]
and
resumption
=list
backtrack;
(*
coroutine
resumptions
*)
exception
Finished;
-58
-
Runnin
gth
eSegm
entin
gTran
sducer
value
rec
react
input
output
back
occ
=fun
[State(b,det,choices)
->
(*
we
try
the
deterministic
space
first
*)
let
deter
cont
=match
input
with
[[]
->
backtrack
cont
|[letter
::
rest]
->
try
let
next_state
=List.assoc
letter
det
in
react
rest
output
cont
[letter::occ]
next_state
with
[Not_found
->
backtrack
cont
]
]in
let
nondets
=if
choices=[]
then
back
else
[Next(input,output,occ,choices)::back]
in
if
bthen
let
out
=[(occ,Id)::output]
(*
opt
final
sandhi
*)
-59
-
in
if
input=[]
then
(out,nondets)
(*
solution
*)
else
let
alterns
=[Init(input,out)
::
nondets
]
(*
we
first
try
the
longest
matching
word
*)
in
deter
alterns
else
deter
nondets
]
and
choose
input
output
back
occ
=fun
[[]
->
backtrack
back
|[((u,v,w)
as
rule)::others]
->
let
alterns
=[
Next(input,output,occ,others)
::
back
]
in
if
prefix
winput
then
let
tape
=advance
(length
w)
input
and
out
=[(u
@occ,Euphony(rule))::output]
in
if
v=[]
(*
final
sandhi
*)
then
if
tape=[]
then
(out,alterns)
else
backtrack
alterns
-60
-
else
let
next_state
=access
v
in
react
tape
out
alterns
vnext_state
else
backtrack
alterns
]
and
backtrack
=fun
[[]
->
raise
Finished
|[resume::back]
->
match
resume
with
[Next(input,output,occ,choices)
->
choose
input
output
back
occ
choices
|Init(input,output)
->
react
input
output
back
[]
automaton
]
];
-61
-
Exam
ple
ofSan
skrit
Segm
entation
process
"tacchrutvaa";
Chunk:
tacchrutvaa
may
be
segmented
as:
Solution
1:
[tad
with
sandhi
d|"s
->
cch]
["srutvaa
with
no
sandhi]
-62
-
More
exam
ples
process
"o.mnama.h\"sivaaya";
Solution
1:
[om
with
sandhi
m|n
->
.mn]
[namas
with
sandhi
s|"s
->
.h"s]
["sivaaya
with
no
sandhi]
process
"sugandhi.mpu.s.tivardhanam";
Solution
1:
[sugandhim
with
sandhi
m|p
->
.mp]
[pu.s.ti
with
no
sandhi]
[vardhanam
with
no
sandhi]-
63
-
San
skrit
Taggin
g
process
"sugandhi.mpu.s.tivardhanam";
Solution
1:
[sugandhim
<
acc.
sg.
m.
[sugandhi]
>with
sandhi
m|p
->
.mp]
[pu.s.ti
<
iic.
[pu.s.ti]
>with
no
sandhi]
[vardhanam
<
acc.
sg.
m.
|acc.
sg.
n.
|nom.
sg.
n.
|voc.
sg.
n.
[vardhana]
>with
no
sandhi]
-64
-
Statistics
The
complete
autom
atoncon
struction
fromth
eflex
edform
slex
icon
takeson
ly9s
ona
864MH
zP
C.W
eget
avery
compact
autom
aton,
with
only
7337states,
1438of
which
acceptin
gstates,
fittin
gin
746KB
ofm
emory.
With
out
the
sharin
g,w
ew
ould
have
generated
abou
t200000
statesfor
asize
of6M
B!
The
totalnum
ber
ofsan
dhiru
lesis
2802,of
which
2411are
contex
tual.
While
4150states
have
no
choice
poin
ts,th
erem
ainin
g
3187have
anon
-determ
inistic
compon
ent,
with
afan
-out
reachin
g
164in
the
worst
situation
.H
owever
inpractice
there
arenever
more
than
2ch
oicesfor
agiven
input,
and
segmen
tationis
extrem
elyfast.
-65
-
Overgen
erationP
roblem
s
Very
short
particles
have
tobe
treateddiff
erently,
oroth
erwise
there
wou
ldbe
intolerab
leovergen
eration.
Prob
ably
proso
dy
will
have
to
come
toth
erescu
e.T
he
caseof
vedic
“u”.
Com
pou
nds.
The
bah
uvrıh
iprob
lem.
Intrin
sicovergen
eration.
a+a=
a+a=
a+a=
a+a=
aM
osts.m
.en
d
with
a,m
any
s.f.en
dw
itha,
the
preverb
a(tow
ards)
isfreq
uen
t,th
e
prefi
xa
iscom
mon
(negation
).So
there
isoften
room
for
interp
retation!
E.g.
na
asatovid
yatebhavo
na
abhavo
vid
yatesatah.
vs
na
asatovid
yateab
havo
na
abhavo
vid
yatesatah.
Dou
ble
enten
dre
poetry.
-66
-
Sou
ndness
and
Com
pleten
essof
the
Algorith
ms
Theorem
.If
the
lexical
system
(L,R
)is
strictan
dw
eakly
non
-overlappin
gs
isan
(L,R
)-senten
ceiff
the
algorithm
(segm
ent
all
s)retu
rns
asolu
tion;con
versely,th
e(fi
nite)
setof
all
such
solution
sex
hib
itsall
the
pro
ofsfor
sto
be
an(L
,R)-sen
tence.
Fact.
Inclassical
San
skrit,
extern
alsan
dhiis
strongly
non
-overlappin
g.
Cf.http://pauillac.inria.fr/~huet/FREE/tagger.ps
-67
-
Where
isth
ein
formation
?
Mel’cu
ksay
s“E
veryth
ing
isin
the
lexicon
”.
The
keycon
cept
islex
icondirected
.So
most
ofth
ein
formation
is
indeed
inth
elex
icon.
But
alot
ofphon
ologicalin
formation
(sandhi
rules)
and
gramm
aticalknow
ledge
isin
the
code.
Iftim
eperm
its.A
tour
ofth
ediction
arystru
ctures.
-68
-
Enjoy
!
•San
skrit
site:http://pauillac.inria.fr/~huet/SKT/
•San
dhiA
naly
sispap
er:
http://pauillac.inria.fr/~huet/FREE/tagger.ps
•C
ourse
notes:
http://pauillac.inria.fr/~huet/ZEN/esslli.ps
•C
ourse
slides:
http://pauillac.inria.fr/~huet/ZEN/Trento.ps
•ZE
Nlib
rary:http://pauillac.inria.fr/~huet/ZEN/zen.tar
•O
bjective
Cam
l:http://caml.inria.fr/ocaml/
-69
-
What
nex
t(on
the
San
skrit
front)
•San
skrit
1V
erbm
orphology,
Corp
us
testing,
Lex
iconacq
uisition
mode,
Segm
entation
trainin
g,P
hilology
assistant
(Sch
arf,Sm
ith)
•San
skrit
2Sen
tinels,
Proso
dy,
Valen
cych
eckin
g,D
epen
den
cy
synth
esis
•San
skrit
3D
iscourse
analy
sis:R
eference,
Scop
e,T
hem
e,Focu
s,
Anap
hora
resolution
,E
xtra-lin
guistic
inform
ation
•San
skrit
∞D
istributed
develop
men
tof
multilin
gual
tools,
Sav
ing
the
Pune
diction
arypro
ject
-70
-
What
nex
t(on
the
Zen
front)
•Zen
main
tenan
ceD
istribution
,H
otline,
Users’
club,C
oord
ination
ofex
tension
s
•Zen
imm
ediate
exten
sions
Graftin
gof
regular
relations,
Rules
compiler
•Tow
ards
am
orecom
preh
ensive
generic
platform
for
computation
allin
guistics ,
accomm
odatin
gth
elevels
ofSyntax
,
Sem
antics,
and
Discou
rseIn
formation
Dynam
ics
-71
-