NLDB-04 Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of NLDB-04 Lightweight Natural Language Database Interfaces Jun. 23, 2004 In-Su Kang*, Seung-Hoon Na,...
NLDB-04
Lightweight Natural Language Database Interfaces
Jun. 23, 2004
In-Su Kang*, Seung-Hoon Na, Jong-Hyeok Lee, Gijoo Yang
Dept. of Computer Science & Engineering
Pohang University of Science and Technology (POSTECH)
R. of KOREA
NLDB-04
Contents
Motivations
Introduction to NLDBI
Issues & our concerns
Two motivations
Lightweight architecture
Lightweight NLDBI
Domain adaptation
Question answering
Conclusion
NLDB-04
Introduction
Natural Language DataBase Interfaces (NLDBI)
Access database data in natural languages [Androutsopoulos,1995]
Main components
Analysis Translation
Natural LanguageQuestion
LinguisticKnowledge
TranslationKnowledge
DBMS
DatabaseQuery
MeaningRepresentation
Answer
NLDB-04
Terminology
Domain class
Refers to a table or a column
(e.g.) T_Customer, C_ID, C_Name
Domain class instance
Individual column value
(e.g.) 1034, 1035, “Bill Clinton”, “Jimmy Carter”
Class term
A lexical term referring to a domain class, such as “customer”
Value term
A lexical term indicating a domain class instance, such as “Bill”
C_ID C_Name
T_Customer
Bill Clinton1034
Jimmy Carter1035
NLDB-04
NLDBI Issues & Our Concerns
Process
Natural language understanding
Spoken language & meaning representation
Discourse analysis & dialogue model
Database query conversion (NL DB)
Paraphrase problem : M-to-1
Translation ambiguity problem: 1-to-N
Natural language generation
Co-operative answering
Knowledge management
Linguistic knowledge
Translation knowledge
Representation, acquisition
Domain transportability problem
NLDB-04
Motivation 1
Previous translation knowledge acquisition
Complex translation knowledge representation
Expensive expertise required (AI/NLP/DBMS/Domain knowledge)
– (e.g.) Devise conversion rules from parse trees to database query exp.
– (e.g.) Define database relations for logical predicates
Difficulty in initial creation and scalable expansion
– Cause domain transportability problem
No general solution
As one solution , domain tool methods are tightly coupled with underlying
NLDBI systems
– (e.g.) IRUS, CHAT-80, ASK, EUFID, TEAM, MASQUE, …
Our proposal
Semi-Automatic Acquisition by Simplifying Translation Knowledge Structures
NLDB-04
Motivation 2
Translation ambiguity
Class term ambiguity
A class term refers to several domain classes– ‘address’ TB_Customer.Address, or TB_Employee.Address
Value term ambiguity
A value term refers to several domain class instances– ‘London’ TB_Flight.departure, or TB_Flight.arrival
Resolution of translation ambiguity
So far, no systematic disambiguation scheme
We propose a Noun Translation Technique based on an Information Retrieval Framework
NLDB-04
Contents
Motivations
Lightweight NLDBI
Domain adaptation
Semi-automatic acquisition of translation knowledge
Physical Entity-Relationship Schema (pER schema)
Translation knowledge structures
Translation knowledge construction
Examples
Question answering
Conclusion
NLDB-04
Semi-Automatic Acquisition
Procedure
Linguistic annotation by domain experts
DBPhysical schema
ReverseEngineering
LinguisticAnnotation
pER schema
AutomaticExtraction Initial
Trans.Know.
To each domain class, give a Linguistic Name (in the form of NP)Make any linguistic description (called Domain Sentence) about or
among domain classes (in the form of simple sentences).In , an NP referring to a domain class should be either its linguistic
name defined in , or a domain class itself
Within a DB modeling tool
Input Guidelines
NLDB-04
Physical Entity-Relationship (pER) Schema
pER schema = pER graph + pER description
pER graph = a physical schema
Encode structural constraints among DB objects– Property-of b/w an entity and its attributes– Semantic relationship among entities and/or attributes
pER description = linguistic annotations on a pER graph
Bridge b/w DB objects and natural language expressions
NLDB-04
Translation Knowledge Structures
Class-referring info. (for paraphrase problem)
Class document for each domain class
Synonymous class terms, and their concept codes
Value document for each column
All-length ngrams / pattern-based 2grams generated from column data
Class-constraining info. (for translation ambiguity problem)
Valency-based selection restrictions
Domain verbs or case markers impose on domain classes order, {T_Customer, T_Product, T_Order.Date} from, {T_Flight.Departure} , to, {T_Flight.Arrival}
Collocation document for each domain class
Linguistic collocations of a domain class
NLDB-04
Translation Knowledge Construction
DB column data
N-gramValue Indexing
Value Doc.
Class TermExtraction
Linguistic names Domain sentences
Valency-based
ClassTerms
ValueTerms
NL Description
Class-Referring information Class-Constraining information
Class Doc. Collocation Doc.Concepthierarchy
Syntactic Analysis
NLDB-04
Contents
Motivations
Lightweight NLDBI
Domain adaptation
Semi-automatic acquisition of translation knowledge
Physical Entity-Relationship Schema (pER schema)
Translation knowledge structures
Translation knowledge construction
Examples
Question answering
Conclusion
NLDB-04
Semi-Automatic Acquisition: Physical DB Schema
Physical DB schema for a university course domainPhysical Database Schema Object Examples of Tuples
T1 (T1C1, T1C2, T1C3) Student(1999-0011, Richard, 1999)
(2001-0027, Tom, 2001)
T2 (T2C1, T2C2, T2C3) Course(ST201A, Statistics, Richard)
(ST310B, Algorithms, Joan)
T3 (T3C1, T3C2, T3C3, T3C4) Grade(1999-0011, ST201A, 1999, A)
(2001-0027, ST201A, 2003, C)
Reverse-Engineeringby a DB modeling tool
NLDB-04
Semi-Automatic Acquisition:NL Descriptions
Domain experts annotate NL descriptions on physical schema
Domain Class Natural Language Description
Linguistic Name Domain Sentence
T1
T1C1
T1C2
T1C3
Student
Student identification number
Student name
Entrance yearStudents take courses in ‘T3C3’
Students get grades in ‘T3C3’
Students enter a school in ‘T1C3’
Instructors teach courses
Instructors give grades
Courses are open in ‘T3C3’
T2
T2C1
T2C2
T2C3
Course
Course number
Course name
Instructor, Professor
T3
T3C1
T3C2
T3C3
T3C4
Grade
Student identification number
Course number
Grade year
Grade
NLDB-04
Semi-Automatic Acquisition:Initial Translation Knowledge
Class-Referring Translation Knowledge
‘Course name’
‘Course name’‘Name’
Class document: T2C2C
StatisticsAlgorithms
Value document: T2C2V
Statistics, Algorithms
Domain class: T2C2
Linguistic name All column values (non-alphanumeric)
NLDB-04
Semi-Automatic Acquisition:Initial Translation Knowledge
Class-Referring Translation Knowledge
Class and Value documents from linguistic names and DB tuples
Domain Class
Class-Referring Translation Knowledge
Class Document Value Document
T1 Student NULL
T1C1Student identification number, identification number, number
n4s1n4, n4, s1, n4s1, s1n4
T1C2
T1C3
T2
T2C1
T2C2
…
T3C4
Student name, name
Entrance year, year
Course
Course number, number
Course name, name
…
Grade
Richard, Tom
1999, 2003
NULL
c2n3c1, c2, n3, c1, c2n3, n3c1
Statistics, Algorithms
…
A, C
NLDB-04
Class-Constraining Translation Knowledge
Semi-Automatic Acquisition:Initial Translation Knowledge
All domain sentences
student, course, T3C3
T1, T2, T3C3
Take
Take
ClassDocuments
Entrance, student
Collocation document: T1C3
Entrance year
Linguistic name: T1C3
Student
Class document: T1
“Students take courses in ‘T3C3’”
Take-studentTake-courseTake-(in) T3C3
NLDB-04
Semi-Automatic Acquisition:Initial Translation Knowledge
Class-Constraining Translation Knowledge
Predicate or postposition
Set of Domain Classes
Take T1, T2, T3C3
Get T1, T3, T3C4, T3C3
Enter T1, T1C3
Teach T2C3, T2
Give T2C3, T3, T3C4
Open T2, T3C3
In T1C3, T3C3
Domain Class
Collocation Document
T1 NULL
T1C1 Student, identification
T1C2 Student
T1C3 Student, Entrance
T2 NULL
T2C1 Course
… …
Valency-based selection restriction Collocation-based selection restriction
NLDB-04
Semi-Automatic Acquisition: Expansion of Initial Translation Knowledge
Instructor, professor
Initial Class Document: T1C3
Instructor, teacher0, educator1, pedagogue1, professional2, professional_person2, adult3, grownup3, person4, individual4, someone4, somebody4, mortal4, human4, soul4
Professor, academician1, academic1, faculty_member1, educator2, pedagogue2, professional3, professional_person3, adult4, grownup4, person5, individual5, someone5, somebody5, mortal5, human5, soul5
Extended Class Document: T1C3
Instructor, …
Educator, …
Adult, …
Person, …
Paraphrase expansion by WordNet
NLDB-04
Contents
Motivations
Lightweight NLDBI
Domain adaptation
Question answering
Question analysis
Noun translation– Class retrieval– Class disambiguation
Query graph & SQL generation
Conclusion
NLDB-04
Question Analysis & Noun Translation
Question analysis by parsing
A set of question nouns
Each noun has features: question focus, value operator, etc.
A set of predicate-argument (P-A) pairs
Noun translation (or Domain class tagging)
Given a question noun, find the most probable domain class
Class retrieval
Retrieve candidate domain classes for each question noun
Lexically or conceptually equivalent domain classes
Class disambiguation
Select the most likely domain class
NLDB-04
Question Analysis & Noun Translation
Question : “Show me the names of students who got A in
statistics from 1999”
Question Analysis Noun Translation
Question
Noun
Head
Verb
Question
Focus
Value
Operator
Relevant Domain Classes
Disambiguated Domain Classes
Name
Student
A
Statistics
1999
Show
Get
Get
Get
Get
Yes
No
No
No
No
=
=
=
>=
T1C2c, T2C2c
T1c
T3C4v
T2C2v
T1C3v, T3C3v
T1C2c
T1c
T3C4v
T2C2v
T3C3v
NLDB-04
Class Retrieval
Information Retrieval (IR) framework
Translation knowledge a target document collection
Class/value/collocaton documents, valency-based selection restrictions
A question noun an IR query
Class term a surface word form & concept codes– ‘customer’, ‘product’
Linguistic value term all-length n-grams for Korean– ‘Bill’, ‘Bush’–
Alphanumeric value term pattern-based 2-grams–
C1 : 1-byte character, C2 : 2-byte character, N : decimal, S : special character
NLDB-04
Class Disambiguation
Definition of a class retrieval function
Notation RC(t) means a set of domain classes retrieved from a docum
ent collection C using a query term t
Rref(t): retrieves from ref (a set of class/value documents)
Rval(t): retrieves from val (valency-based constraints)
– Consider valency-based constraints as documents
Rcol(t): retrieves from col (collocation-based documents)
Class disambiguation by Boolean retrieval model
Valency-based
Rref(t) Rval(head(t))
Collocation-based
Rref(t) Rcol(adjacent(t))
NLDB-04
Class Retrieval & Class Disambiguation
Question Noun‘1999’
Relevant Domain Classes{T1C3v, T3C3v }
Valency-BasedConstraints
Head Verb of ‘1999’‘Get’
Valency-Based ConstraintGet: {T1, T3, T3C4, T3C3}
DisambiguationRref(‘1999’) Rval(head(‘1999’)) = {T3C3v }
Value TermAmbiguity
Q: Show me the names of students who got A in statistics from 1999
Class/ValueDocuments
NLDB-04
Class Retrieval & Class Disambiguation
Question Noun‘Name’
Relevant Domain Classes{T1C2c, T2C2c}
Class/ValueDocuments
Adjacent Word of ‘Name’‘Student’
Collocation-Based Constraint{T1C1, T1C2, T1C3}
DisambiguationRref(‘Name’) Rcol(adjacent(‘Name’)) = {T1C2c }
Class TermAmbiguity
CollocationDocuments
Q: Show me the names of students who got A in statistics from 1999
NLDB-04
Query Graph & SQL Generation
Query graph
A minimal connected sub-graph
A node is a disambiguated domain class for each question noun
Query graph is located from a physical schema graph using a Meng’s method (Meng et al. 1999)
SQL generation from a query graph
Entity nodes SQL-FROM
Arcs b/w entity nodes Join operations in SQL-WHERE
From question analysis
Domain class having question focus feature SQL-SELECT
Domain class having value operator feature SQL-WHERE
NLDB-04
Query Graph & SQL Generation
SELECT T1C2FROM T1, T2, T3WHERE T1.T1C1 = T3.T3C1and T2.T2C1 = T3.T3C2and T2C2 = ‘Statistics’and T3C3 = ‘A’and T3C4 >= 1999
Name Question Focus
Value Operator
Domain Class
NameStudent
AStatistics
1999
YesNoNoNoNo
===
>=
T1C2c
T1c
T3C4v
T2C2v
T3C3v
NLDB-04
Conclusion
Lightweight NLDBI
Domain adaptation(to deal with a paraphrase problem)
Simplification of translation knowledge in the form of documents
Semi-automatic construction of translation knowledge
Expansion of translation knowledge by dictionary
Question answering(to resolve translation ambiguities)
Noun translation technique based on an IR framework– Class retrieval– Class disambiguation
NLDB-04
Semi-Automatic Acquisition:Initial Translation Knowledge
Class-Referring Translation Knowledge
‘Student identification number’
‘Student identification number’‘Identification number’‘Number’
Class document: T1C1C
n4s1n4, n4, s1, n4s1, s1n4
Value document: T1C1V
1999-0011, 2001-0027Domain class: T1C1
Linguistic name
All column values (alphanumeric)
n4s1n4
n-grams
1-byte char C2-byte char CSpecial char SDecimal N