1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON...

34
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

Transcript of 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON...

Page 1: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

1

Language Technologies (2)

Valentin TablanUniversity of Sheffield, UK

ACAI 05 ADVANCED COURSE ON KNOWLEDGE

DISCOVERY

Page 2: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

2

Overview

• Examples of HLT for the Semantic Web in use

• Work in context of EU SEKT and PrestoSpace projects

• Mixed Initiative Information Extraction

• RichNews (automated annotation of news programs)

Page 3: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

3

Mixed Initiative IE

- Using Machine Learning for Information Extraction

- Human annotator and the system can take the initiative

- HA provides some bootstrap examples- MI Engine learns and starts suggesting

annotations - HA corrects these annotations- And so on…

Page 4: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

4

What is Mixed Initiative IE?• Also known as adaptive IE• Not active learning !

System selects the next document to be annotated by the user. Improves the performances

• Active learning is not a part of MI API but will use it

Page 5: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

5

Requirements

A MI engine must :• Work as a background task • Suggest annotations only when a

given performance level is reached• Be easily usable for a non – expert

user • Fined grained parameters for

experts

Page 6: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

6

OBIE example

• Find instances in a document of entities and relations from an ontology

• Usable by a non-expert end user• No learning corpus available• Quick adaptation to a new

ontology

Page 7: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

7

Specifics of the MI API

• Train a statistical model• Use several ML algorithmsSVM – Decision Trees – Neural Nets – etc …

• Compare the ML models and use the one which performs the best at time t

• Combine the ML models

Page 8: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

8

Expected behaviourP

erfo

rman

ce

Time – Size of learning corpus

Engine 2

Engine 3

Engine 1

MI Engine

Minimal performance level tolerated

Page 9: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

9

Limitations of the ML API

• Configure a file per engine> not suitable for a non expert

• Set the class definition in the file> problem for OBIE : ontology is not NE> dynamic settings

• Engine characteristics : binary, numeric, nominal> uniform declaration, automatic conversion

• Operate on tokens> cannot annotate spans

• One class per engine> how to set several possible values for an entity with a binary engine?

Page 10: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

10

Meta Engine

• Combines several instances of ‘simple’ engines(has to be same engine type e.g Maxent)

• Accepts rich descriptions of class & attributes

• Converts into suitable format for ‘simple’ engine

• Merges results of embedded engines• Behaves like a simple engine• Hides the dirty job

Page 11: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

11

MI API Architecture

Mixed Initiative Engine

Mixed Initiative API

GUI / client code

Meta Engine Meta Engine Meta Engine

Orchestrator Evaluation Module

DataSet

Page 12: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

12

Data Set

• Information stored as examples• No documents• Used by Meta Engines• Possibly converted to a native Data Set format

(e.g. SVMlight)

• Possibly reuse an existing implementation (WEKA, Yale, …)

Page 13: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

13

MI API Architecture

Mixed Initiative Engine

Mixed Initiative API

GUI / client code

Meta Engine Meta Engine Meta Engine

Orchestrator Evaluation Module

DataSet

Page 14: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

14

Evaluation module

• Operate on Data Set• Choice for corpus splitting (has K-fold

cross validation)

• Different evaluation metrics

Page 15: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

15

MI API Architecture

Mixed Initiative Engine

Mixed Initiative API

GUI / client code

Meta Engine Meta Engine Meta Engine

Orchestrator Evaluation Module

DataSet

Page 16: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

16

Orchestrator

• Core of a MI Engine• Manages the Meta Engines• Uses the Data Set and Evaluation

Module• Return information about the Meta

Enginesconfusion matrix – performances – etc …

• Combines the ML models• Convert from / to annotations

Page 17: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

17

Allegory : MI Engine = Orchestra

Music School

ME ME ME ME

Teacher

Orchestra Conductor

ME ME ME ME

1- Learn one/some/all instruments (entities)?

2- Exams for all at the same time ?

3- Good enough and better than existing orchestra?

1- Combine their skills

2- Play for an audience

Page 18: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

18

Summary of MI IE

• Required component for Ontology based Information Extraction

• State-of-the-art functionalities• Reach high performance level by

combining classification algorithms

Page 19: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

19

RichNews

• RichNews aims to automate the annotation of news programs

• Start from recordings of broadcasts.• Produces annotations that can be

included in a semantic repository (i.e. KIM/Sesame)

• Works for English but most processing resources can be adapted for other languages.

Page 20: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

20

Key Problems

• Speech recognition produces poor quality transcripts with many mistakes.

• A news broadcast contains several stories. How do we work out where one starts and another one stops?

• How can we make a summary or headline for each story from a poor quality transcript? How can we work out what kind of news it reports?

Page 21: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

21

Augmented Television News

New Broadcasts are often augmented with textual content.

– Usually only limited content is available.– The TV company controls content production.

Rich News finds content automatically.– Developed on BBC news.– Recordings of broadcasts go in one end.– Relevant news web pages are associated with the

stories in the broadcasts fully automatically.

Page 22: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

22

Semantic Indexing of News

Systems already exist that can index news broadcasts in terms of ‘named entities’ that they refer to.

– e.g. Mark Maybury’s Broadcast News Navigator.

– Entities such as cities, people, organizations are marked as such.

Rich News can improve annotation:– Annotation is in terms of an ontology.

– Uses Automatic Speech Recognition, so can be applied when no subtitles are available.

– Web pages are used to help find named entities.

Page 23: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

23

Using ASR Transcripts

ASR is performed by the THISL system.• Based on ABBOT connectionist

speech recognizer.• Optimized specifically for use on BBC

news broadcasts.• Average word error rate of 29%.• Error rate of up to 90% for out of

studio recordings.

Page 24: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

24

SA General Architecture

Source Extractor

Source Extractor

Source Extractor

...

Media Object

Information Source

Information Source

Information Source

IE

IE

IE

SemanticIndex

Multi-source IE

Merger(?)

Story Segmentation

Source Detection

Page 25: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

25

Multi-source IESource Detection

RichNews Architecture

...

Media Object

StorySegmenter

Story 1

Story 2

Story N

ASRASR

TranscriptASR

IE System

Web MinerRelated

Web Pages

KIMOntologicalIE System

Entity2

Instance

SemanticIndex

THISLASR

GATE/ELANManual

Annotation(optional)

...

Page 26: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

26

Topical Segmentation

Uses C99 segmenter:• Removes common words from the ASR

transcripts.• Stems the other words to get their roots.• Then looks to see in which parts of the

transcripts the same words tend to occur.

These parts will probably report the same story.

Page 27: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

27

Key Phrase Extraction

Term frequency inverse document frequency (TF.IDF):

• Chooses sequences of words that tend to occur more frequently in the story than they do in the language as a whole.

• Any sequence of up to three words can be a phrase.

• Up to four phrases extracted per story.

Page 28: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

28

Web Search

The Key-phrases are used to search on the BBC, and the Times, Guardian and Telegraph newspaper websites for web pages reporting each story in the broadcast.

• Searches are restricted to the day of broadcast, or the day after.

• Searches are repeated using different combinations of the extracted key-phrases.

The text of the returned web pages is compared to the text of the transcript to find matching stories.

Page 29: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

29

Evaluation

Success in finding matching web pages was investigated.

• Evaluation based on 66 news stories from 9 half-hour news broadcasts.

• Web pages were found for 40% of stories.• 7% of pages reported a closely related

story, instead of that in the broadcast.

Results are based on earlier version of the system, only using BBC web pages.

Page 30: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

30

Using the Web Pages

Web pages can be made available to the viewer as additional content.

The web pages contain:• A headline, summary and section for each story.• High quality text that is readable, and contains

correctly spelt proper names.• They give more in depth coverage of the stories.Web pages could be included in the broadcast by

the TV company.Or discovered by a device in viewers’ homes.

Page 31: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

31

Semantic Annotation

• KIM can semantically annotate the text derived from the web pages:

• KIM will identify people, organizations, locations etc.

• KIM performs well on the web page text, but very poorly when run on the transcripts directly.

• This allows for semantic ontology-aided searches for stories about particular people or locations etcetera.

• So we could search for people called Sydney, which would be difficult with a text-based search.

Page 32: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

32

Search for Entities

Page 33: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

33

Story Retrieval

Page 34: 1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.

34

Summary of RichNews

• Rich News can automatically segment, describe and classify news broadcasts:

• Requires an on-line textual source that closely parallels the broadcasts.

• High precision, moderate recall (so far).

• Easy to adapt to other languages.