Introducing CRL

90
Introducing CRL Computing Research Laborator y

description

Introducing CRL. Computing Research Laboratory. The Computing Research Laboratory at NMSU Jim Cowie – Director Steve Helmreich – Deputy Director / 505-646-2141 [email protected] http://crl.nmsu.edu. Established in 1983 by New Mexico Legislature as a Center of Technical Excellence - PowerPoint PPT Presentation

Transcript of Introducing CRL

Page 1: Introducing CRL

Introducing CRL

Computing

Research

Laboratory

Page 2: Introducing CRL

The Computing Research Laboratory at NMSUJim Cowie – Director

Steve Helmreich – Deputy Director / [email protected] http://crl.nmsu.edu

• Established in 1983 by New Mexico Legislature as a Center of Technical Excellence

• CRL is a Research Department in the College of Arts and Sciences at New Mexico State University

• From 1983 to 1989 received more than $6.5 million in state funding.

• Since 1990, entirely self-supporting on research grants and contracts.

C R LC O M P U T IN G

R E S E A R C HL A B O R A T O R Y

Page 3: Introducing CRL

CRL Capabilities and Expertise

• Multi-lingual text processing• Speech processing and generation• Human Computer Interaction• Team of Computer Scientists, Psychologists,

Linguists, Computational Linguists, Geographers, Biochemists and Mathematicians; (~40)

• capable of delivering complex, working, prototype systems.

C R LC O M P U T IN G

R E S E A R C HL A B O R A T O R Y

Page 4: Introducing CRL

Information retrievalLanguage learning and language

teachingAutomatic translationSummarizationQuestion answeringDictionary developmentKnowledge discovery

Language Engineering at CRL

Page 5: Introducing CRL

Overview of Talk

• Projects related to Machine Translation– Pragmatics-based Machine Translation– Jargon analysis project– IL Annotation project

• Projects using Machine Translation– Expedition / Boas– MOQA (Question / Answering)

Page 6: Introducing CRL

Machine Translation triangle

Source Language

Target Language

Direct Translation

Transfer

Interlingual (IL)

Analysis Generation

Page 7: Introducing CRL

Machine Translation triangle

Source & Target Language

Target & Source Language Direct Translation

Transfer

Interlingual (IL)

Analysis & Generation

Generation & Analysis

Page 8: Introducing CRL

Machine Translation triangle

Source & Target Language

Target & Source Language Direct Translation

Transfer

Interlingual (IL)

Analysis & Generation

Generation & Analysis

Page 9: Introducing CRL

CRL Machine Translation Projects

• XTRA – Chinese-English IL, 1986-88

• ULTRA – five languages IL, 1988-90

• Pangloss – multi-site Spanish-English IL, 1992-95

• Mikrokosmos – Spanish-English IL, 1995-98

• Corelli – multi-lingual transfer, 1998-2001

Page 10: Introducing CRL

Characteristics of IL MT

• Analysis to and generation from “meaning” of the input

• Disambiguation to an unambiguous language-independent representation (IL)

• Use of world knowledge to disambiguate

• World knowledge stored and manipulated through an Ontology

Page 11: Introducing CRL

Jesus of Montreal

• Woman to priest guiltily coming out of her bedroom (in French): “Come on out, we’re not playing a scene from Feydeau.”

• English subtitle: “Come on out. This isn’t a bedroom farce.”

Page 12: Introducing CRL

Which floor is this?

• In a Spanish newspaper article about expensive real estate rental in Moscow:

“Nothing’s available on the “segundo piso” but there’s still some space left

on the “tercero piso.”

• T1: second floor / third floor

• T2: third floor / fourth floor

Page 13: Introducing CRL

Earthquakes – who is to blame?

• Acumulación de víveres por anuncios sísmicos en Chile

• Hoarding Caused by Earthquake Predictions in Chile

• STOCKPILING OF PROVISIONS BECAUSE OF PREDICTED EARTHQUAKES IN CHILE

Page 14: Introducing CRL

Pragmatics-based MT hypothesis

• Translations are made on the basis of interpretations

• Interpretations are a set of coherent inferences about the content and the context of the message

• These inferences are based on – Beliefs of the translator about the beliefs of the author

– Beliefs of the translator about the beliefs of the target audience

– Beliefs of the translator about the world

Page 15: Introducing CRL

Machine Translation triangle

Source & Target Language

Target & Source Language Direct Translation

Transfer

Interlingual (IL)

Analysis & Generation

Generation & Analysis

Interpretations

Page 16: Introducing CRL

Terrorist/Freedom Fighter

• sindicalistas: Union Members / Labor Leaders

• asesino: killer / assassin

• asesinados: murdered / assassinated

• campesinos: small farmers / peasants

• sin tierras: without land / landless

• terrateniente: landowner / landholder

Page 17: Introducing CRL

Hypothesis

• It is possible to identify an author's viewpoint from the vocabulary (jargon) used, particularly in the use of alternate lexical items referring to the same concept or object

Page 18: Introducing CRL

Hypothesis

• Social groups are organized not just around topics but also around points of view and

• develop jargons to express those points of view • Members of those social groups generally hold to

those points of view and• Use the jargons to express themselves• THUS identifying an author’s jargon also

identifies the groups he/she belongs to and the beliefs he/she is likely to hold

Page 19: Introducing CRL

Training Corpus

• Issue: Abortion

• Text Size: approximately 8000 tokens each

• Text Size (types): 2273 pro-choice / 2168 pro-life

• Significant unique vocabulary: 79 pro-choice / 68 pro-life

• Significant common vocabulary 113 / 37

Page 20: Introducing CRL

Approach

• Unique vocabulary: 1581 pro-choice/1476 pro-life

• Common vocabulary: 692

• Significant unique vocabulary:– 79 pro-choice– 68 pro-life

• Significant common vocabulary: 113 (37)

Page 21: Introducing CRL

Unique Vocabulary – Pro-life

• abnormalities, aborted, abortifacient, abortifacients, abortion-inducing, abortionist, abortionists, adultery, amniotic, bible, blessed, cancer-causing, chastisement, chastisements, chastises, complication, complications, contrite, creator, depression

Page 22: Introducing CRL

Unique Vocabulary – Pro-choice

• activism, activists, alley, anti-abortion, anti-choice, anti-democratic, antiabortionists, arson, arsonist, arsons, attorney, attorney’s, blockade, blockaders, blocked, blocking, bomb, bombing, bombings

Page 23: Introducing CRL

Significant Common Vocabulary

• Pro-life• clinic(s) 3• fetus 22• parenthood 2• planned 2• unborn 15• week(s) 37• woman(‘s) 9

• Pro-choice• unborn 1• clinic(s) 46• fetus 7• parenthood 14• planned 15• week(s) 8• woman(‘s) 27

Page 24: Introducing CRL

One-year project

• Using sounder statistical measurements– Base line corpus– Statistically significant differences– Other methods of measuring differences

• Using collocations as well as single words• Looking for “synonymous” terms

– WordNet– Ontology– Rogets

Page 25: Introducing CRL

Experiments

• Differentiate opinions in a binary opposition within texts on the subject of opposition

• Differentiate opinions among a plurality of views within texts on the subject

• Differentiate opinions in a binary opposition within texts on a different subject

• Differentiate opinions among a plurality of views within texts on a different subject

• Differentiate multiple viewpoints in any article

Page 26: Introducing CRL

Problems with IL approach

• Idiosyncratic – no common understanding of what IL should be or look like

• Limited automatic acquisition – most of the knowledge-based and lexicon is hand-coded

Page 27: Introducing CRL

Interlingual Annotation of Multilingual Text Corpora

Computing Research Laboratory – NMSUMitre Corporation

UMIACS – U MarylandColumbia University

Language Technologies Institute – CMUInformation Sciences Institute – USC

Page 28: Introducing CRL

Approach

• Collection of texts in six languages

• Three translations of each into English

• Tools to analyze grammatical aspects– Morphological analysis– Name recognition– Chunking

Page 29: Introducing CRL

Develop IL Representation

• Through study of texts

• Through examination of current Ils

• Develop formal definition– Rich representation– Compatible with under-specification

• Develop coding manuals and guarantee inter-coder reliability

Page 30: Introducing CRL

Annotate the Corpus

• All sites / all texts

• One site in charge of one aspect of IL

• Frequent interaction

• Regular joint meetings

Page 31: Introducing CRL

Evaluate the results

• Inter-coder reliability

• Growth rate

• Grain size

• Quality of generation

Page 32: Introducing CRL

Trends in HLT Research Funding

• Focus on sub-tasks– Named entity recognition– Coreference resolution– Word sense disambiguation

• Bring multi-lingual capabilities to parallel technologies– Multi-lingual IR/IE/summarization

• Bring multiple technologies into one project

Page 33: Introducing CRL

Three such projects at CRL

• Expedition / Boas

• MOQA – Meaning-Oriented Question/Answering

• Personal Profiler

Page 34: Introducing CRL

Expedition: A tool for building Machine Translation systems

The ProblemGiven two people, a linguist who knows a language,

and a programmer, provide a support system which allows them to build a machine translation system for that language in six months.

Project is completed and we are now using it to build translation systems for Somali and Urdu.

You can try out the system at http://aiaia.nmsu.edu

Page 35: Introducing CRL

Boas: “A Linguist in the Box”

Boas is a semi-automatic knowledge elicitation system that guides a language speaker through the process of developing the static knowledge sources for a moderate-quality, broad-coverage MT system from any “low-density” language into English in about six months.

Some of the tasks include providing a list of characters and morphological features, paradigms for inflected classes, equivalents of closed-class items, translation of place names and open class items from English into the source language.

Page 36: Introducing CRL

LLanguage knowledge acquisition has been a bottleneck for MT development and deployment for over 40 years. At the same time, the dearth of data resources has strongly limited the deployment of any of the recent corpus-based techniques in practical MT environments.

Expedition is a “quick ramp-up” MT environment between “low density” languages and English which is a step to alleviating these problems.

Boas, the main knowledge acquisition module inside Expedition, includes resident knowledge about

•a set of potential source languages

•generalized parametric typological knowledge about languages in general and

•methods and configurations for human-computer interaction.

It is designed for use by a team which does not include trained computational linguists.

Page 37: Introducing CRL

Boas contains knowledge about human language and means of realization of its phenomena in a number of specific languages and is, thus, a kind of a “linguist in the box” that helps non-professional acquirers with the task whose complexity is well-known.

Page 38: Introducing CRL

The ethnologist and linguist Franz Boas was the founder of the American school of descriptive linguistics.

In this photo, circa 1900?, he is shown posing for a model which was being made of a Kwakuitl Winter Ceremonial dancer in which the dancer emerges from within a circular hole cut in the dancing screen.

Page 39: Introducing CRL
Page 40: Introducing CRL
Page 41: Introducing CRL
Page 42: Introducing CRL

Meaning-Oriented Question-Answering with Ontological Semantics

An AQUAINT Project from

ILIT

Page 43: Introducing CRL

Development Strategy

• Meaning oriented question answering• Rapid Prototyping using pre-existing components• Evaluation of end-to-end system performance for

specific tasks (collaboration with AWARE project, Bill Ogden, CRL)

• Project commenced August 2002• Current system runs on Linux or Windows 2000

Page 44: Introducing CRL

Meaning-Oriented Question-Answering with Ontological Semantics

• Initial Domain: travel and meetings– question understanding and interpretation– determining the answer and

– presenting the answer

• two kinds of data source

– Structured Fact Repository containing instances of ontological entities

– open text (in English, Arabic and Farsi)

Page 45: Introducing CRL

System Overview (V0)

Fact Repository

TextAnalyzer

Query Interface

& Answer

Formulation

DocumentSources

HumanAcquisition

humanmachine

Document Retrieval

Page 46: Introducing CRL

System Overview (V1 now)

Document Retrieval

Fact Repository

TextAnalyzer

Query Interface

& Answer

Formulation

DocumentSources

human

batchreal-time

questions

Page 47: Introducing CRL

System Overview (V2)

Document Retrieval

Fact Repository

TextAnalyzer

Query Interface

& Answer

Formulation

DocumentSources

human

batchreal-time

questions& texts

Page 48: Introducing CRL

Batch Processing Overview

Web Spider

Keizai Indexing

Documents

DocumentCollection

Keizai Retrieval DocumentSubset

Text Analysis Text MeaningRepresentation

TMR to FRConverter Fact Repository

Page 49: Introducing CRL

Batch Mode - Fact Repository Population

• Spidered contemporary text• Retrieval done using Keizai retrieval

system (Unicode based)• Uses a list of interesting people and

travel keywords• Selected documents saved and

automatically processed using UMBC’s analyzer (which produces text meaning representations)

• Instances of concepts from TMR extracted and stored in Fact Repository

Page 50: Introducing CRL

Interactive Processing Overview

Query Interface

Information Server

NLQuery

Analyzer

TMR

Instances

Instance Finder

Fact Repository

XML Answer

Answer formulation

Page 51: Introducing CRL
Page 52: Introducing CRL
Page 53: Introducing CRL
Page 54: Introducing CRL

Interactive Mode – Question Answering

Question submitted – text or structured queryRouted to Fact Repository (Structured Queries)

orTo Text Analyzer (NL queries)Question converted to TMRTMR to:

• Structured query (if good match and sent to user for validation), or

• Converted to a direct Fact Repository query

Answer retrieved from FR and displayed• Fall back queries if basic query cannot be

answered

Follow up queries can be further questions or use the multi-modal facilities of the interface.

A trace of the dialog is maintained.

Page 55: Introducing CRL

Information Server • Mediates between User Interface and all

System Components• Fact Repository• Question Analysis• TMR Production

• Uses XML to communicate with Answer Formulation Component

• Java structures communicate with fact repository interface

• Java-lisp interface communicates with text analyzer

Page 56: Introducing CRL

Structured Fact Repository

• Uniform format for all kinds of data

• Uniform support for multiple applications and tools

• Semantically anchored in general ontology

• Implemented using PostgreSQL

Page 57: Introducing CRL

(REQUEST-INFO-842

(THEME (VALUE (MEMBER-OF-842.DOMAIN)))

(INSTANCE-OF (VALUE (REQUEST-INFO)))

)

(MEMBER-OF-842

(TIME (VALUE ((FIND-ANCHOR-TIME))))

(RANGE (VALUE (POLITICAL-ENTITY-842)))

(INSTANCE-OF (VALUE (MEMBER-OF)))

)

(POLITICAL-ENTITY-842

(OBJECT-NAME (VALUE ("Al Qaeda")))

(INSTANCE-OF (VALUE (POLITICAL-ENTITY)))

)

TMR for “Who is in al Qaeda?”

Page 58: Introducing CRL
Page 59: Introducing CRL

try-v3syn-struc

root trycat vsubj root $var1

cat nxcomp root $var2

cat vform OR infinitive gerund

sem-strucset-1 element-type refsem-1

cardinality >=1refsem-1 sem event

agent ^$var1effect refsem-2

modalitymodality-type epiteucticmodality-scope refsem-2modality-value < 1

refsem-2 value ^$var2sem event

Page 60: Introducing CRL

REQUEST-INFO-130 THEME DEVELOP-2601.PURPOSE DEVELOP-2601.REASON TEXT-POINTER why INSTANCE-OF REQUEST-INFO

DEVELOP-2601THEME SET-2555AGENT NATION-97PHASE CONTINUOUS

TIME FIND-ANCHOR-TIME INSTANCE-OF DEVELOP

TEXT-POINTER developing

NATION-97HAS-NAME Iraq

INSTANCE-OF NATIONTEXT-POINTER Iraq

SET-2555 ELEMENT-TYPE WEAPONCARDINALITY > 1

INSTRUMENT-OF KILL-1864 THEME-OF DEVELOP-2601 INSTANCE-OF WEAPON

TEXT-POINTER weapons

KILL-1864 THEME SET-2556 INSTRUMENT SET-2555 INSTANCE-OF KILL

TEXT-POINTER destruction

SET-2556 THEME-OF KILL-1225 ELEMENT-TYPE HUMAN

CARDINALITY > 100 INSTANCE-OF HUMAN

TEXT-POINTER mass

“Why is Iraq developing weapons of mass destruction?”

Page 61: Introducing CRL

Resume GeneratorGenerating a resume for an individual:1. Collect and prepare the data

Gather documents from the web in English, Russian and Spanish.Filter the documents to reduce the data to a collection of related documents.

2. Individual Document Summarization (This is done for each document in the collection)Determine a date for the documentSelect concise relevant pieces of information from the filtered collection of

documents.Determine a date for each of the selected extracts.Translate the pieces of text into English (our target language).

3. Profile GenerationMerge the translated text extracts in chronological order to produce the cross

document summary.Generate the output form for the end user.

Page 62: Introducing CRL
Page 63: Introducing CRL
Page 64: Introducing CRL
Page 65: Introducing CRL
Page 66: Introducing CRL
Page 67: Introducing CRL
Page 68: Introducing CRL
Page 69: Introducing CRL

Language Engineers in Short Supply

• Emerging field – combining Linguistics, Computational Linguistics, Computer Science, Systems Analysis, and Human Factors

• Masters Degrees being offered at – – University of Southern California– Arizona University– University of Colorado– Carnegie Mellon University

• Potential for both for supporting research and developing applications.

Page 70: Introducing CRL

Former CRL Staff and Students are working on language applications at -

• Microsoft Natural Language Group• Systran• AT&T• Telelogue (talking yellow pages)• Westlaw (Spanish language processing group)• General Electric• Motorola Chinese Telephony Group• The Institute for Genetic Research (TIGR) (bio-

informatics)• University of Maryland Baltimore County• University of Sheffield

Page 71: Introducing CRL
Page 72: Introducing CRL

Appendix

Ecology Development

Page 73: Introducing CRL

Challenges and Needs• Research into appropriate processing methods for

language ecology is needed. Only a tiny handful of languages have had any kind of research/evaluation effort for these topics (English, French, Japanese, Spanish).

• Research corpora need to be produced and made publically available. Main source of materials at the the moment is the Linguistic Data Consortium

• (http://www.ldc.upenn.edu/ )• Language processing resources such as proper

name lists (onomastica), lexicons, morphological analyzers, and patterns of features for names need to be produced

Page 74: Introducing CRL

Requirements for Basic Analysis

• Corpora

• Markup

• Character sets

• Punctuation

• Part of Speech Tagging

• Noun Phrase Recognition

Page 75: Introducing CRL

Requirements (Continued)

• Numbers and Dates

• Onomastica

• Un-attributed Proper Names

• Syntax

• General Guidelines

• Challenges and Problems

Page 76: Introducing CRL

Why do We Need Corpora?

• Ground our development in reality

• Provide basis for statistical processing– testing– learning

Page 77: Introducing CRL

Types of Corpus

• Raw - only markup from the source - e.g. newswire

• Cleaned - standardized markup - e.g. TREC corpora

• NLP specific markup - e.g. Penn treebank, Wall Street Journal

• Parallel Corpora with alignment markup

Page 78: Introducing CRL

Sources - Standard

• Linguistic Data Consortium - LDC

• European Language Resources Agency - ELRA

• National Institute for Standards and Technology - NIST

• Gutenberg Archive

• Oxford Text Archive

• International Computer Archive for Modern English - ICAME

• + Many national initiatives

Page 79: Introducing CRL

Sources - Do It Yourself

• Participation in evaluations– TREC, MUC, Amaryllis, Semeval

benefits are tagged corpora focused on a specific task

• Web spidering– Site grabbing – web spiders

– Language grabbing - CRL language recognizing web spider

• Newswire capture

• Parallel Corpora – Embassies, Company web sites

– United Nations, Pan American Health Organization

Page 80: Introducing CRL

Character Sets• For 8 bit

– Various ISO standards – Latin 1 – Latin 5– Microsoft variants– Others – e.g. KOI8 for Russian

• Various 16 bit Japanese and Chinese standards – EUC, SJIS, Big5….

• Unicode– UTF8 – mixed 8 and 16 bit– UCS2 – 16 bit (although many characters can

be composed of multiple characters)

Page 81: Introducing CRL

Character Sets

• Eight bit character sets may be simpler if processing only one language, or one language + English

• Unicode offers the possibility of universal tokenization (recognizing words), based on character classifications

• Key is to make sure resources and data being processed use the same character set

Page 82: Introducing CRL

Sentence Segmentation

• Essential step in analysis• Complicated by ambiguous use of punctuation and

by document headings and sub-headings (which should be processed separately)

• For language with “.” used as an abbreviation marker needs list of abbreviations + automatic recognition of abbreviations using lexicon

• Still requires heuristics to handle abbreviations at the end of a sentence.

Page 83: Introducing CRL

Part of Speech Tagging

• Either statistical based on tagged corpora or rule based. Tags here are based Penn treebank

(‘november’,’NP’) ( ‘24’,’CD’) ( ‘,’,’,’) ( ‘1989’,’CD’) (‘,’,’,’) (‘friday’,’NP’) (‘bridgestone’,’NP’) (‘sports’,’NPS’) (‘co’,’NP’) (‘said’,’VBD’) (‘friday’,’NP’) (‘it’,’PP’) (‘has’,’VBZ’) (‘set’,’VBN’) (‘up’,’RP’) (‘a’,’DT’) (‘joint’,’JJ’) (‘venture’,’NN’) (‘in’,’IN’)

Page 84: Introducing CRL

Phrase Recognition

• Goal is to reduce the complexity of text processed by Semantic Analysis to processing heads of phrases

• To recognize, for example, noun phrases describing companies -– “the third Japanese electric appliance concern”

– “the new company”

• and to recognize noun phrases in general– “golf clubs”, “metal woods”

Page 85: Introducing CRL

Morphological Analysis

• Inflection analysis + part of speech tagging

• Needed to detect various features– Number, tense, gender, role …..

• And to produce a citation form for lexical lookup

• MORE?

Page 86: Introducing CRL

Numbers and Dates• Numbers in numeric and alphabetic form can be

recognized and grouped with punctuation and qualifiers using simple regular expressions– Percentages, money, temperatures, weights etc.

• Dates can also be recognized by regular expressions by adding months and a few separator characters to the set of tokens used by the regular expressions– Thus NUM SLASH NUM SLASH NUM would be an

acceptable date expression, tests on ranges could be added

• Many languages support multiple calendars and these all need to be supported (Japanese, Arabic)

Page 87: Introducing CRL

Onomastica

• Lists of proper names are an essential resource for text processing.

• Do not need to be huge as many can be recognized automatically using context patterns – e.g. “we enjoyed our visit to Plaster, Texas”

• A large list of place names + well known people and company names that can be regularly found in abbreviated form (Ford, Bush etc.)

• Transliteration software may be useful to help understanding in translated texts

Page 88: Introducing CRL

Un-attributed Proper Names

For each language the following resources are required

• databases of proper name components

– Human names, company terminators, company start and end words, all the contents of the Onomasticon

• patterns to combine proper name components

– Mostly regular expressions

• name abbreviation algorithms

Toyota Motor Corporation -> Toyota Motor -> Toyota

International Business Machines -> IBM

• context based patterns

– A spokesman for eBay said ….

Page 89: Introducing CRL

Syntax

• Simple syntax probably sufficient before semantic (user oriented) steps– Noun phrases– Compound verbs– Subordinate clauses

Page 90: Introducing CRL

General Guidelines

• The main guideline is to preserve a “reasonable” amount of ambiguity for resolution by the semantic analysis process– Toyota – might be a product or a company

– Washington – might be a place or a person

– Taj Mahal – might be a mausoleum, or a casino

• But definite decisions should be made where possible to reduce the load on the analyzer