Hindi-Urdu Treebank
Transcript of Hindi-Urdu Treebank
![Page 1: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/1.jpg)
The Hindi-Urdu Treebank
Lecture 7: 7/29/2011
1
![Page 2: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/2.jpg)
Multi-representational, Multi-layered treebank
• Traditional approach:
– Syntactic treebank: PS or DS, but not both
– Layers are added one-by-one
• Our approach:
– Syntactic treebank: both DS and PS
– DS, PS, and PB are developed at the same time
– Automatic conversion from DS+PB to PS
• Why?
– DS and PS are both useful
– Annotating them together allows us to maintain “consistency” and reduce annotation time
2
![Page 3: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/3.jpg)
The team
• DS team: IIIT
• PB team: Univ of Colorado at Boulder
• PS team: UMass, Columbia Univ
• Conversion: Univ. of Washington
• Biweekly conference calls
• Group meetings every six months
3
![Page 4: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/4.jpg)
Outline
• Overview of the treebank
• Three Representations
– Dependency
– Proposition Bank
– Phrase Structure
• Conversion
4
![Page 5: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/5.jpg)
Dependency structure (DS) and Phrase Structure (PS)
• DS: all nodes are labeled with words or empty strings
• PS: leaf nodes are labeled with words or empty strings, internal nodes are labeled with non-terminal symbols (special alphabet)
5
![Page 6: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/6.jpg)
Information in PS and DS
PS
(e.g., PTB)
DS
(some target
DS)
POS tag yes yes
Function tag
(e.g., -SBJ)
yes yes
Syntactic tag yes no
Empty category
and co-indexation
Often yes Often no
Allowing crossing Often no Often yes
6
![Page 7: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/7.jpg)
Motivation 1: Two Representations
• Both phrase-structure treebanks and dependency treebanks are used in NLP – Collins/Charniak/Bikel parser for PS – CoNLL task on dependency parsing
• Problem: currently few treebanks (no?) with PS and DS which
are independently motivated Our project: build treebank for Hindi/Urdu for which PS and
DS are linguistically motivated from the outset – Dependency: Paninian grammar (Panini 400 BC) – Phrase structure: variant of Minimalism (Chomsky 1995)
7
![Page 8: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/8.jpg)
Motivation 2: Two Content Levels
• Everyone (?) wants syntax
• Recent popularity of PropBank (Palmer et al 2002): lexical predicate-argument structure; “semantics as surfacy as it gets”
• Recent experience: PropBank may inform some treebanking decisions
Our project: build treebank with all levels from
the outset
8
![Page 9: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/9.jpg)
Goals
• Hindi/Urdu Treebank:
– DS, PB, and PS for
• 400K-word Hindi
• 150K-word Urdu
– Unified annotation guidelines
– Frame files for PropBank
• Better understanding of DS=>PS conversion
9
![Page 10: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/10.jpg)
Outline
• Overview of the project
• Three Representations – Dependency
– Proposition Bank
– Phrase Structure
• Conversion
10
![Page 11: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/11.jpg)
Hindi Paninian Framework (Dipti Sharma, Hyderabad)
There are 6 main karakas (karaka relations): • karata (k1): Activity of the verb resides in karta. • karma (k2): Result of the verb resides in karma. • karana(k3): Instrument helping in achieving the activity of the verb is karana • sampradaan (k4): Receiver of the action is sampradaan • apaadan (k5): Point of separation from which an entity has moved away in an action is apaadan • adhikaran (k7): Place (k7p) or time (k7t) where the action is located
11
![Page 12: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/12.jpg)
Full Set of Relations
12
![Page 13: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/13.jpg)
Sample Paninian Analysis
13
![Page 14: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/14.jpg)
Basic Clause Structure
अतिफ
न कििाब िो पढा
Atif ne kitaab ko paRhaa
Atif Erg book Acc read.Pfv
Atif read the book
14
![Page 15: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/15.jpg)
Basic Clause Structure: DS
पढा
अतिफ-न कििाब-िो
k1 k2
15
(read)
(Atif) (book)
![Page 16: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/16.jpg)
Outline
• Overview of the project
• Three Representations – Dependency
– Proposition Bank
– Phrase Structure
• Conversion
16
![Page 17: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/17.jpg)
PropBank: Lexical Semantic Annotation
• Dependency annotation on top of DS – PropBank is a dependency representation, but the arc labels are
different from DS
• Captures diathesis alternations:
– John loaded the cart with hay. – John loaded hay on the cart.
hay has same relation to predicate load in all these sentences
• PropBank annotates verb-meaning specific verbal roles
17
![Page 18: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/18.jpg)
Basic Clause Structure: PropBank
कििाब-िो
पढा Roleset: पढना.01
अतिफ-न
Arg0 Arg1
पढना.01
Arg0 reader
Arg1 what is read
18
(Atif) (book)
(read)
![Page 19: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/19.jpg)
Phrase Structure
• Inspired by Chomskyan Principles-and-Parameters approach
• (Mostly) binary branching
• Small number of non-terminals
• Key structural assumptions:
– Only two marked argument positions for verbs, all other NPs are adjuncts and can appear anywhere
– Use of traces for displacement from normal position
– Case assigned under c-command 19
![Page 20: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/20.jpg)
Basic Clause Structure: Phrase Structure
20
(Atif)
(book)
(read)
![Page 21: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/21.jpg)
Unaccusatives
दरवाजा खऱ गया darwaaza khul gayaa
door open go.Pfv.MSg
The door opened.
21
![Page 22: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/22.jpg)
Unaccusative: Dependency Structure
खऱ गया
दरवाजा
K1
(door)
(open go)
22
![Page 23: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/23.jpg)
Unaccusative: PropBank
खऱ गया
दरवाजा
arg1
(door)
(open go)
23
![Page 24: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/24.jpg)
Unaccusative: Phrase Structure
(door)
(open) (go)
24
![Page 25: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/25.jpg)
Support Verb Constructions
गहन चोरी हो गय geheneN chorii ho gaye
jewels (m) theft do go.Pfv.MPl
The jewels got stolen
25
![Page 26: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/26.jpg)
Support Verb Constructions: Dependency Structure
हो गय
गहन चोरी
k2 pof
(do go)
(jewels) (theft) 26
![Page 27: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/27.jpg)
Support Verb Constructions: PropBank
हो.sv (do)
Arg0 agent of true predicate
Arg1 true predicate
Arg2 patient of true predicate
(jewels)
(theft)
27
![Page 28: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/28.jpg)
Support Verb Constructions: Phrase Structure
28
(jewels)
(theft) (do go)
![Page 29: Hindi-Urdu Treebank](https://reader036.fdocuments.net/reader036/viewer/2022082213/586cb2401a28abaa238b617a/html5/thumbnails/29.jpg)
Where we are now
• Guidelines:
– DS and PS guidelines are complete and checked
– PropBank guidelines under development
• Annotation:
– Finished 353K-word Hindi and 60k-word Urdu
• Automatic conversion from DS + PropBank in progress.
• Close co-operation in development of the three components essential
29