WP 2: Semi-automatic metadata generation driven by Language Technology Resources

33
WP 2: Semi-automatic metadata generation driven by Language Technology Resources Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007

description

WP 2: Semi-automatic metadata generation driven by Language Technology Resources. Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007. Our Background. Experience in corpus annotation and information extraction from texts Experience in grammar development Experience in statistical modelling - PowerPoint PPT Presentation

Transcript of WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Page 1: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

WP 2: Semi-automatic metadata generation driven by Language

Technology Resources

Lothar LemnitzerProject review, Utrecht, 1 Feb 2007

Page 2: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Our Background• Experience in corpus annotation and

information extraction from texts• Experience in grammar development• Experience in statistical modelling• Experience in eLearning

Page 3: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

WP2 Dependencies

• WP 1: collection and preparation of LOs

• WP 3: WP 2 results are input to this WP

• WP 4: Integration of tools• WP 5: Evaluation and validation

Page 4: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

M12 Deliverables

Language Resources:

• > 200 000 running words per language• With structural and linguistic

annotation• > 1000 manually selected keywords• > 300 manually selected definitions• Local grammars for definitory contexts

Page 5: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

M12 Deliverables

Documentation:

• Guidelines linguistic annotation• Guidelines keyword annotation• Guidelines annotation of definitions• (Guidelines evaluation)

Page 6: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

M12 Deliverables

Tools:

• Prototype Keyword Extractor• Prototype Glossary Candidate

Detector

Page 7: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Lexikon

CZ

CZCZEN

ENCONVERTOR 1

Documents SCORM

Pseudo-Struct.

Basic XML LING. PROCESSOR

Lemmatizer, POS, Partial Parser

CROSSLINGUAL RETRIEVAL

LMS User Profile

Documents SCORM

Pseudo-Struct

Metadata (Keywords)

Ling. Annot XML

Ontology

CONVERTOR 2

Documents HTML

Lexikon

PT

Lexikon

RO

Lexikon

PL

Lexicon

GE

Lexikon

MT

Lexikon

BG

Lexikon

DT

Lexicon

EN

PLPL

GEGE

BGBG

PTPT

MTMT

DTDT

RORO

ENDocuments User

(PDF, DOC, HTML,

SCORM,XML)

REPOSITORY

Glossary

Page 8: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Linguistically annotated learning objects

• Structural annotation: par, s, chunk, tok

• Linguistic annotation: base, ctag, msd attributes

Example 1

• Specific annotation: marked term, defining text

Page 9: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Part of the DTD• <!ELEMENT markedTerm (chunk | tok | markedTerm)+ >• <!ATTLIST markedTerm • %a.ana;• kw (y|n) "n"• dt (y|n) "n"• status CDATA #IMPLIED• comment CDATA #IMPLIED >

• <!ELEMENT definingText (chunk | tok | markedTerm)+ >• <!ATTLIST definingText • id ID #IMPLIED• xml:lang CDATA #IMPLIED• lang CDATA #IMPLIED• rend CDATA #IMPLIED• type CDATA #IMPLIED• wsd CDATA #IMPLIED• def IDREF #IMPLIED• continue CDATA #IMPLIED• part CDATA #IMPLIED• status CDATA #IMPLIED• comment CDATA #IMPLIED >

Page 10: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Linguistically annotated learning objects

Use:

• Linguistically annotated texts are input to the extraction tools

• Marked terms and defining texts are used as training material and / or as gold standard for the evaluation

Page 11: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Characteristics of keywords

1. Good keywords have a typical, non random distribution in and across LOs

2. Keywords tend to appear more often at certain places in texts (headings etc.)

3. Keywords are often highlighted / emphasised by authors

Page 12: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Distributional features of keywords

We are using the following metrics to measure keywordiness by distribution

• Term frequency / inverse document frequency (tf*idf),• Residual Inverse document frequency (RIDF)• An adjusted version of RIDF (adjustment by term frequency)to model inter text distribution of KW

• Term burstiness

to model intra text distribution of KW

Page 13: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Structural and layout features of keywords

We will use:• Knowledge of text structure used to

identify salient regions (e.g., headings)• Layout features of texts used to identify

emphasised words ( Example 2)We will weigh words with such features

higher

Page 14: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Complex keywords

• Complex, multi-word keywords are relevant, differences between languages

• The keyword extractor allows the extraction of n-grams of any length

• Evaluation showed that the including bi- or even trigrams word increases the results, with larger n-grams the performance begins to drop

• Maximum keyword length can be specified as a parameter

Page 15: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Complex keywords

Language Single-wordkeywords

Multi-word keywords

German 91 % 9 %

Polish 35 % 65 %

Page 16: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Language settings for the keyword extractor

• Selection of single keywords is restricted to a few ctag categories and / or msd values (nouns, proper nouns, unknown words and some verbs for most languages)

• Multiword patterns are restricted wrt to position of function words (style of learning is acceptable; of learning behaviours is not)

Page 17: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Output of Keyword Extractor

List, ordered by “keywordiness” value, with the elements

• Normalized form of keyword• (Statistical figures)• List of attested forms of the keyword Example 3

Page 18: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Evaluation strategy

We will proceed in three steps:

1. Manually assigned keywords will be used to measure precision and recall of key word extractor

2. Human annotators will judge results from extractor and rate them

3. The same document(s) will be annotated by many test persons in order to estimate inter-annotator agreement on this task

Page 19: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

SummaryWith the keyword extractor,

1. We are using several known statistical metrics in combination with qualitative and linguistic features

2. We give special emphasis on multiword keywords3. We evaluate the impact of these features on the

performance of these tools for eight languages4. We integrate this tool into an eLearning application5. We have a prototype user interface to this tool

Page 20: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Identification of definitory contexts

• Empirical approach based on linguistic annotation of LO

• Workflow– Definitory contexts are searched and marked in LOs– Recurrent patterns are characterized quantitatively

and qualitatively ( Example 4)– Local grammars are drafted on the basis of these

recurrent patterns– Extraction of definitory context performed by

lxtransduce (University of Edinburgh - LTG)

Page 21: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Characteristics of local grammars

• Grammar rules match and wrap subtrees of the XML document tree

• One grammar rule refers to subrules which match substructures

• Rules can refer to lexical list to constrain categories further

• The defined term should be identified and marked

Page 22: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Example

Page 23: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

<rule name="simple_NP" >

<seq>

<and>

<ref name="art"/>

<ref name="cap"/>

</and>

<ref name="adj" mult="*"/>

<ref name="noun" mult="+"/>

</seq>

</rule>

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Page 24: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

<query match="tok[@ctag='V' and @base='zijn' and @msd[starts-with(.,'hulpofkopp')]]"/>

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Page 25: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

<rule name="noun_phrase">

<seq>

<ref name="art" mult="?"/>

<ref name="adj" mult="*" />

<ref name="noun" mult="+" />

</seq>

</rule>

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Page 26: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

<rule name="is_are_def">

<seq>

<ref name="simple_NP"/>

<query match="tok[@ctag='V' and @base='zijn' and

@msd[starts-with(.,'hulpofkopp')]]"/>

<ref name="noun_phrase" />

<ref name="tok_or_chunk" mult="*"/>

</seq>

</rule>

Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

Page 27: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Output of Glossary candidate detector

• Ordered List of words• Defined Term marked• (Larger context – one preceding, one

following sentence)( Example 10)

Page 28: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Evaluation Strategy

We will proceed in two steps:

1. Manually marked definitory contexts will be used to measure precision and recall of the glossary candidate detector

2. Human annotator to judge results from the glossary candidate detector and rate their quality / completeness

Page 29: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Results

Precision RecallOwn LOs 21.5 % 34.9 %Verbs only 34.1 % 29.0 %

Page 30: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Evaluation

Questions to be answered by a user-centered evaluation:

• Is there a preference for higher recall or for higher precision?

• Do user profit from seeing a larger context?

Page 31: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

Integration of functionalities

ILIAS ServerJava Webserver (Tomcat)

ApplicationLogic

User Interface

KW/DC/OntoJava

Classes/ Data

Webservices

AxisnuSoap

Servlets/JSP

Development Server (CVS)

KW/DC

Code Code/Data

Ontology

Code

ILIAS

Content Portal

LOs

LOs

Accessfunctionalities directly

Evaluate functionalitiesin ILIAS

Nightly Updates

Usefunctionalities

throughSOAP

Migration Tool

ThirdPartyTools

Page 32: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

User Interface Prototypes

Page 33: WP 2: Semi-automatic metadata generation driven by Language Technology Resources

User Interface Prototypes