NLDB-04 Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na,...

32
NLDB-04 Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang Dept. of Computer Science & Engineering Pohang University of Science and Technology (POSTEC H) R. of KOREA
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of NLDB-04 Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na,...

NLDB-04

Lightweight Natural Language Database Interfaces

Jun. 23, 2004

In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang

Dept. of Computer Science & Engineering

Pohang University of Science and Technology (POSTECH)

R. of KOREA

NLDB-04

Contents

Motivations

Introduction to NLDBI

Issues & our concerns

Two motivations

Lightweight architecture

Lightweight NLDBI

Domain adaptation

Question answering

Conclusion

NLDB-04

Introduction

Natural Language DataBase Interfaces (NLDBI)

Access database data in natural languages [Androutsopoulos,1995]

Main components

Analysis Translation

Natural LanguageQuestion

LinguisticKnowledge

TranslationKnowledge

DBMS

DatabaseQuery

MeaningRepresentation

Answer

NLDB-04

Terminology

Domain class

Refers to a table or a column

(e.g.) T_Customer, C_ID, C_Name

Domain class instance

Individual column value

(e.g.) 1034, 1035, “Bill Clinton”, “Jimmy Carter”

Class term

A lexical term referring to a domain class, such as “customer”

Value term

A lexical term indicating a domain class instance, such as “Bill”

C_ID C_Name

T_Customer

Bill Clinton1034

Jimmy Carter1035

NLDB-04

NLDBI Issues & Our Concerns

Process

Natural language understanding

Spoken language & meaning representation

Discourse analysis & dialogue model

Database query conversion (NL DB)

Paraphrase problem : M-to-1

Translation ambiguity problem: 1-to-N

Natural language generation

Co-operative answering

Knowledge management

Linguistic knowledge

Translation knowledge

Representation, acquisition

Domain transportability problem

NLDB-04

Motivation 1

Previous translation knowledge acquisition

Complex translation knowledge representation

Expensive expertise required (AI/NLP/DBMS/Domain knowledge)

– (e.g.) Devise conversion rules from parse trees to database query exp.

– (e.g.) Define database relations for logical predicates

Difficulty in initial creation and scalable expansion

– Cause domain transportability problem

No general solution

As one solution , domain tool methods are tightly coupled with underlying

NLDBI systems

– (e.g.) IRUS, CHAT-80, ASK, EUFID, TEAM, MASQUE, …

Our proposal

Semi-Automatic Acquisition by Simplifying Translation Knowledge Structures

NLDB-04

Motivation 2

Translation ambiguity

Class term ambiguity

A class term refers to several domain classes– ‘address’ TB_Customer.Address, or TB_Employee.Address

Value term ambiguity

A value term refers to several domain class instances– ‘London’ TB_Flight.departure, or TB_Flight.arrival

Resolution of translation ambiguity

So far, no systematic disambiguation scheme

We propose a Noun Translation Technique based on an Information Retrieval Framework

NLDB-04

Lightweight NLDBI Architecture

NLDB-04

Contents

Motivations

Lightweight NLDBI

Domain adaptation

Semi-automatic acquisition of translation knowledge

Physical Entity-Relationship Schema (pER schema)

Translation knowledge structures

Translation knowledge construction

Examples

Question answering

Conclusion

NLDB-04

Semi-Automatic Acquisition

Procedure

Linguistic annotation by domain experts

DBPhysical schema

ReverseEngineering

LinguisticAnnotation

pER schema

AutomaticExtraction Initial

Trans.Know.

To each domain class, give a Linguistic Name (in the form of NP)Make any linguistic description (called Domain Sentence) about or

among domain classes (in the form of simple sentences).In , an NP referring to a domain class should be either its linguistic

name defined in , or a domain class itself

Within a DB modeling tool

Input Guidelines

NLDB-04

Physical Entity-Relationship (pER) Schema

pER schema = pER graph + pER description

pER graph = a physical schema

Encode structural constraints among DB objects– Property-of b/w an entity and its attributes– Semantic relationship among entities and/or attributes

pER description = linguistic annotations on a pER graph

Bridge b/w DB objects and natural language expressions

NLDB-04

Translation Knowledge Structures

Class-referring info. (for paraphrase problem)

Class document for each domain class

Synonymous class terms, and their concept codes

Value document for each column

All-length ngrams / pattern-based 2grams generated from column data

Class-constraining info. (for translation ambiguity problem)

Valency-based selection restrictions

Domain verbs or case markers impose on domain classes order, {T_Customer, T_Product, T_Order.Date} from, {T_Flight.Departure} , to, {T_Flight.Arrival}

Collocation document for each domain class

Linguistic collocations of a domain class

NLDB-04

Translation Knowledge Construction

DB column data

N-gramValue Indexing

Value Doc.

Class TermExtraction

Linguistic names Domain sentences

Valency-based

ClassTerms

ValueTerms

NL Description

Class-Referring information Class-Constraining information

Class Doc. Collocation Doc.Concepthierarchy

Syntactic Analysis

NLDB-04

Contents

Motivations

Lightweight NLDBI

Domain adaptation

Semi-automatic acquisition of translation knowledge

Physical Entity-Relationship Schema (pER schema)

Translation knowledge structures

Translation knowledge construction

Examples

Question answering

Conclusion

NLDB-04

Semi-Automatic Acquisition: Physical DB Schema

Physical DB schema for a university course domainPhysical Database Schema Object Examples of Tuples

T1 (T1C1, T1C2, T1C3) Student(1999-0011, Richard, 1999)

(2001-0027, Tom, 2001)

T2 (T2C1, T2C2, T2C3) Course(ST201A, Statistics, Richard)

(ST310B, Algorithms, Joan)

T3 (T3C1, T3C2, T3C3, T3C4) Grade(1999-0011, ST201A, 1999, A)

(2001-0027, ST201A, 2003, C)

Reverse-Engineeringby a DB modeling tool

NLDB-04

Semi-Automatic Acquisition:NL Descriptions

Domain experts annotate NL descriptions on physical schema

Domain Class Natural Language Description

Linguistic Name Domain Sentence

T1

T1C1

T1C2

T1C3

Student

Student identification number

Student name

Entrance yearStudents take courses in ‘T3C3’

Students get grades in ‘T3C3’

Students enter a school in ‘T1C3’

Instructors teach courses

Instructors give grades

Courses are open in ‘T3C3’

T2

T2C1

T2C2

T2C3

Course

Course number

Course name

Instructor, Professor

T3

T3C1

T3C2

T3C3

T3C4

Grade

Student identification number

Course number

Grade year

Grade

NLDB-04

Semi-Automatic Acquisition:Initial Translation Knowledge

Class-Referring Translation Knowledge

‘Course name’

‘Course name’‘Name’

Class document: T2C2C

StatisticsAlgorithms

Value document: T2C2V

Statistics, Algorithms

Domain class: T2C2

Linguistic name All column values (non-alphanumeric)

NLDB-04

Semi-Automatic Acquisition:Initial Translation Knowledge

Class-Referring Translation Knowledge

Class and Value documents from linguistic names and DB tuples

Domain Class

Class-Referring Translation Knowledge

Class Document Value Document

T1 Student NULL

T1C1Student identification number, identification number, number

n4s1n4, n4, s1, n4s1, s1n4

T1C2

T1C3

T2

T2C1

T2C2

T3C4

Student name, name

Entrance year, year

Course

Course number, number

Course name, name

Grade

Richard, Tom

1999, 2003

NULL

c2n3c1, c2, n3, c1, c2n3, n3c1

Statistics, Algorithms

A, C

NLDB-04

Class-Constraining Translation Knowledge

Semi-Automatic Acquisition:Initial Translation Knowledge

All domain sentences

student, course, T3C3

T1, T2, T3C3

Take

Take

ClassDocuments

Entrance, student

Collocation document: T1C3

Entrance year

Linguistic name: T1C3

Student

Class document: T1

“Students take courses in ‘T3C3’”

Take-studentTake-courseTake-(in) T3C3

NLDB-04

Semi-Automatic Acquisition:Initial Translation Knowledge

Class-Constraining Translation Knowledge

Predicate or postposition

Set of Domain Classes

Take T1, T2, T3C3

Get T1, T3, T3C4, T3C3

Enter T1, T1C3

Teach T2C3, T2

Give T2C3, T3, T3C4

Open T2, T3C3

In T1C3, T3C3

Domain Class

Collocation Document

T1 NULL

T1C1 Student, identification

T1C2 Student

T1C3 Student, Entrance

T2 NULL

T2C1 Course

… …

Valency-based selection restriction Collocation-based selection restriction

NLDB-04

Semi-Automatic Acquisition: Expansion of Initial Translation Knowledge

Instructor, professor

Initial Class Document: T1C3

Instructor, teacher0, educator1, pedagogue1, professional2, professional_person2, adult3, grownup3, person4, individual4, someone4, somebody4, mortal4, human4, soul4

Professor, academician1, academic1, faculty_member1, educator2, pedagogue2, professional3, professional_person3, adult4, grownup4, person5, individual5, someone5, somebody5, mortal5, human5, soul5

Extended Class Document: T1C3

Instructor, …

Educator, …

Adult, …

Person, …

Paraphrase expansion by WordNet

NLDB-04

Contents

Motivations

Lightweight NLDBI

Domain adaptation

Question answering

Question analysis

Noun translation– Class retrieval– Class disambiguation

Query graph & SQL generation

Conclusion

NLDB-04

Question Analysis & Noun Translation

Question analysis by parsing

A set of question nouns

Each noun has features: question focus, value operator, etc.

A set of predicate-argument (P-A) pairs

Noun translation (or Domain class tagging)

Given a question noun, find the most probable domain class

Class retrieval

Retrieve candidate domain classes for each question noun

Lexically or conceptually equivalent domain classes

Class disambiguation

Select the most likely domain class

NLDB-04

Question Analysis & Noun Translation

Question : “Show me the names of students who got A in

statistics from 1999”

Question Analysis Noun Translation

Question

Noun

Head

Verb

Question

Focus

Value

Operator

Relevant Domain Classes

Disambiguated Domain Classes

Name

Student

A

Statistics

1999

Show

Get

Get

Get

Get

Yes

No

No

No

No

=

=

=

>=

T1C2c, T2C2c

T1c

T3C4v

T2C2v

T1C3v, T3C3v

T1C2c

T1c

T3C4v

T2C2v

T3C3v

NLDB-04

Class Retrieval

Information Retrieval (IR) framework

Translation knowledge a target document collection

Class/value/collocaton documents, valency-based selection restrictions

A question noun an IR query

Class term a surface word form & concept codes– ‘customer’, ‘product’

Linguistic value term all-length n-grams for Korean– ‘Bill’, ‘Bush’–

Alphanumeric value term pattern-based 2-grams–

C1 : 1-byte character, C2 : 2-byte character, N : decimal, S : special character

NLDB-04

Class Disambiguation

Definition of a class retrieval function

Notation RC(t) means a set of domain classes retrieved from a docum

ent collection C using a query term t

Rref(t): retrieves from ref (a set of class/value documents)

Rval(t): retrieves from val (valency-based constraints)

– Consider valency-based constraints as documents

Rcol(t): retrieves from col (collocation-based documents)

Class disambiguation by Boolean retrieval model

Valency-based

Rref(t) Rval(head(t))

Collocation-based

Rref(t) Rcol(adjacent(t))

NLDB-04

Class Retrieval & Class Disambiguation

Question Noun‘1999’

Relevant Domain Classes{T1C3v, T3C3v }

Valency-BasedConstraints

Head Verb of ‘1999’‘Get’

Valency-Based ConstraintGet: {T1, T3, T3C4, T3C3}

DisambiguationRref(‘1999’) Rval(head(‘1999’)) = {T3C3v }

Value TermAmbiguity

Q: Show me the names of students who got A in statistics from 1999

Class/ValueDocuments

NLDB-04

Class Retrieval & Class Disambiguation

Question Noun‘Name’

Relevant Domain Classes{T1C2c, T2C2c}

Class/ValueDocuments

Adjacent Word of ‘Name’‘Student’

Collocation-Based Constraint{T1C1, T1C2, T1C3}

DisambiguationRref(‘Name’) Rcol(adjacent(‘Name’)) = {T1C2c }

Class TermAmbiguity

CollocationDocuments

Q: Show me the names of students who got A in statistics from 1999

NLDB-04

Query Graph & SQL Generation

Query graph

A minimal connected sub-graph

A node is a disambiguated domain class for each question noun

Query graph is located from a physical schema graph using a Meng’s method (Meng et al. 1999)

SQL generation from a query graph

Entity nodes SQL-FROM

Arcs b/w entity nodes Join operations in SQL-WHERE

From question analysis

Domain class having question focus feature SQL-SELECT

Domain class having value operator feature SQL-WHERE

NLDB-04

Query Graph & SQL Generation

SELECT T1C2FROM T1, T2, T3WHERE T1.T1C1 = T3.T3C1and T2.T2C1 = T3.T3C2and T2C2 = ‘Statistics’and T3C3 = ‘A’and T3C4 >= 1999

Name Question Focus

Value Operator

Domain Class

NameStudent

AStatistics

1999

YesNoNoNoNo

===

>=

T1C2c

T1c

T3C4v

T2C2v

T3C3v

NLDB-04

Conclusion

Lightweight NLDBI

Domain adaptation(to deal with a paraphrase problem)

Simplification of translation knowledge in the form of documents

Semi-automatic construction of translation knowledge

Expansion of translation knowledge by dictionary

Question answering(to resolve translation ambiguities)

Noun translation technique based on an IR framework– Class retrieval– Class disambiguation

NLDB-04

Semi-Automatic Acquisition:Initial Translation Knowledge

Class-Referring Translation Knowledge

‘Student identification number’

‘Student identification number’‘Identification number’‘Number’

Class document: T1C1C

n4s1n4, n4, s1, n4s1, s1n4

Value document: T1C1V

1999-0011, 2001-0027Domain class: T1C1

Linguistic name

All column values (alphanumeric)

n4s1n4

n-grams

1-byte char C2-byte char CSpecial char SDecimal N