Thai FYP Report

download Thai FYP Report

of 45

Transcript of Thai FYP Report

  • 8/7/2019 Thai FYP Report

    1/45

    NANYANG TECHNOLOGICAL UNIVERSITY

    FINAL YEAR PROJECT

    NLP Search Engine

    Author:

    THA I BIN H DUONG

    Supervisor:

    PROF. CHA N CHE E KEONG

    Examiner:

    PROF. CHONG YON G KIM

    April 23, 2011

  • 8/7/2019 Thai FYP Report

    2/45

    Abstract

    Even though many natural language systems have been developed successfully and commercialized, none

    of them yet proved to be versatile enough for a wide variation of tasks. One exception probably was IBMs

    Watson, which during the course of this project has won against 2 human champions in a Jeopardy contest

    and showed for the first time that full scale interaction and reasoning in natural language were finally

    within the reach of modern technology. In this project, user input query which is in natural language

    form will be analyzed and presented in FOPC (First Order Predicate Calculus) which is suitable for using

    as input for higher layer tasks such as logic. In the process, referents or implication from the question,

    answer pair might also be deducted.

  • 8/7/2019 Thai FYP Report

    3/45

    Acknowledgment

    I owed my deepest gratitude to my supervisor professor Chan Chee Keong, who is very considerate and

    cheerful at the same time, and who has given me the opportunity to work on this project which has been

    very enjoyable.

    I also want to express my gratitude to my counsellor Mr. Frank Boon Pink, Ms. Jasmine, Ms. Joanne

    Quek, and professor Gwee Bah Hwee for behind me all the time.

    My big thanks to professor Francis Bond for his teachings that I became interested in natural language

    processing.

    And last but not least, my no-word-can-describe-this gratitude and love to my family and my friends

    Long and John for their supports.

    Thanks all of you for everything. All the mistakes in this project are my own.

  • 8/7/2019 Thai FYP Report

    4/45

    Contents

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    1 Introduction 1

    1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Report Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2 System Design 4

    2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3 Tools and Resources 6

    3.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1.2 Natural Language Processing Toolkit . . . . . . . . . . . . . . . . . . . . . . . 6

    3.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2.1 Wordnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2.2 Brown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2.3 Semcor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2.4 Question Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.3 Download and Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.4 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    4 Text Clean Up 9

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    4.2 Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    4.3 Spell Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    4.3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

  • 8/7/2019 Thai FYP Report

    5/45

    4.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.3.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    5 Part of Speech Tagger 19

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    5.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.3.1 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.3.2 Estimating Lambda Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.4 Operation Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.6 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    6 Meaning Representation 25

    6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    6.2 Background Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2.1 First Order Predicate Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2.2 Formal Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    6.3 Semantic Analysis A Study Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.4 Operation explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.4.1 Implication Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.4.2 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.5.1 Semantics Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.5.2 Meaning Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    6.6 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    7 Conclusion 36

  • 8/7/2019 Thai FYP Report

    6/45

    List of Figures

    2.1 Simplified system DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    4.1 Spell checker flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.2 Stemmer flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    4.3 Spelling correction module flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    5.1 POS tagger flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

  • 8/7/2019 Thai FYP Report

    7/45

    List of Tables

    4.1 Transformation steps and their patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.2 Weighed transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    5.1 Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.2 Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

  • 8/7/2019 Thai FYP Report

    8/45

    Chapter 1

    Introduction

    Seach engine is vital for fast and accurate information retrieval. Most search engine up to date based

    on key words, meta data and ranking algorithm to return results that are most likely to match the input

    queries. This can be very irrelevant as in early search engine back in the 90s, or can be very effective as

    in Google case, but still it can be tricked to give a higher rank than it really should, which is also known

    as search engine optimization (SEO). Nevertheless, this suggests that in Google method, the real content

    of a web page plays less significant role than it should do. Natural language search engine on the other

    hand in theory will be able to response to questions from users as opposed to keywords only, and be able

    to analyze the actual contents of the web page to determine the level of relevant.

    To be fair, Googles power lies in their gigantic knowledge base. To process such a huge knowledge

    base and half a billion queries per day, traditional search engine probably is the most suitable choice by

    letting the user do the final and the usually the most difficult task: read through the contents and choose

    whatever suit their needs. Other interesting search engines that might be more useful than Google when

    it comes to more specific tasks are GazoPa, TinyEyes, Stock photography all of which are similar image

    search engine, each with their distinct features; Bing which is great for lifestyle; and Wolfram Alpha, the

    worlds most academic search engine, which is also a natural language search engine.

    Natural language search engine can be broken down to basic natural language tasks that we perform

    daily: analysis, sense disambiguation, language generation. . . In fact, any natural language tasks can be

    grouped into these below:

    Phonetics and Phonology: The study of linguistic sounds.

    Morphology: The study of meaningful components of words.

    Syntax: The study of structural relationships between words

    Semantics: The study of meaning

    Pragmatics: The study of how language can be used to accomplish goal or in different situation.

    Discourse: The study of linguistic units larger than one single utterance.

    1

  • 8/7/2019 Thai FYP Report

    9/45

    Human attempts to build automation that mimic humanlike behavior dated back some thousands years

    ago and still going strong, from ancient programmable machines by pegs and ropes, to a mechanical mar-

    vel robotic lion by Leonardo da Vinci in1515, and what kind of science fiction that is without humanlike

    machine, either in a form of lovable and talkative android or an intelligent super computer with its own

    evil will and desire. Despite a long history of envisioning, striving and many brilliant minds, it was

    not until about 80 years ago in 1936, when the first freely programmable computer Z1 Computer was

    invented, that humankind had the facility to realized thist long standing dream.

    As the technology evolves, human also invent more methods to effectively communicate with the

    systems, from keyboard, mouse, to touch screen, eye and motion tracking, and even brain signal. But the

    holy grail of communication will be what we have been developed through generations and what we are

    most naturally familiar with: our mother tongue, or more generally natural languages. Even though many

    systems has been successfully developed and commercialized [7], yet none prove to be versatile enough

    for a wide variation of tasks. The linguistic tasks that we human perform almost effortlessly daily turnout to be challenging indeed.

    In this project, the main goal is to derive the implication given a pair of question and answer, which

    I believe is how the machine should, and can learn, just as how we used to learn when we were kids and

    plain. Along the way, minor tasks such as spelling and syntax analysis will also be explored.

    1.1 Literature Review

    Research on linguistics has been carried out in many other fields long before the computer science era

    and are known under different names in different fields: computational linguistics in linguistics, speech

    recognition in electrical engineering, and computational psycholinguistics in psychology. In computer

    science, it is known as natural language processing.

    In 1966, Weizenbaum proposed ELIZA, a computer simulation of a Rogerian psychotherapist [ 21].

    Using almost no information on human though or feeling and only clever decomposition and recomposi-

    tion rules trigger by system of ranked keywords, it sometime produced amazingly human like conversa-

    tion. Its so clever that Weizenbaum reported some attendants refused to not believe in ELIZA even after

    being explained about the situation [21]. Many chat bots were developed based on ELIZA, each with

    their unique discourse, hence their different conversation styles.

    In 1997 the infamous PC game Fallout was released with a feature that allowed player to interact with

    in-game characters using natural language input. It performed poorly considering a closed context and

    was discarded in the following sequel. Its still highly excited to see such a feature though.

    In May 2009, Wolfram Alpha was launched. Users around the world for the first time have access

    to a search engine which can operate using natural language both as input and output. Information

    retrieval and display is throughout and well organized as if have been carefully prepared by a human.

    The drawback is that its pretty useless for nonacademic purposes.

    In early 2011, IBM machine one more time defeated human champions in an intellectual contest.

    This time it was IBMs Watson against 2 champions on Jeopardy quiz show, which required contestants

    2

  • 8/7/2019 Thai FYP Report

    10/45

    to figure out which question should have been asked given a statement. Despite the fact that Watson had

    the entire data of Wikipedia loaded in its RAM, or exhibited weird behavior at some points, chance are

    not so far into the future, things such as interact freely with an intelligent system in natural language form

    will start to penetrate and change the way we are using the machines today.

    1.2 Goals and Objectives

    Be able to analyse a natural language input query, which is typically a question or question and answer

    pair and return its meaning representation in the format suitable for logic operation. Additional spelling

    correction might be performed if required.

    1.3 Scope

    Currently, the program is able to perform word stemming, spelling correction, part-of-speech tagging,

    chunking and partial meaning representation from an natural language input.

    1.4 Report Organization

    The main body of this report consists of 4 chapters. The first one, chapter 2 will cover the overall design

    of the system, as well as the requirements, specifications, tools and resources. Guide for installation is

    also provided. Next, chapter 4 will cover spelling correction. Part of speech (POS) tagging methodology,

    result and performance is discussed in chapter 5. Finally, chapter 6 will discuss how meaning can be

    extracted and represented using syntax analysis information and FOPC scheme.

    3

  • 8/7/2019 Thai FYP Report

    11/45

    Chapter 2

    System Design

    2.1 Requirements

    At the end of the project, the program should be able to analyse a natural language query, which is

    typically a question or a pair of question and answer and return its meaning representation in the format

    suitable for logic operation.

    Some necessary intermediate processes are:

    Text cleaning up: Including tokenizing and spelling correction if required.

    Part-of-speech (POS) tagging using second hidden Markov model (HMM): Accuracy should ap-

    proach 90%.

    Meaning deduction and representation using first order predicate calculus (FOPC).

    2.2 Designs

    At the core of the program are three separate modules and a central database.

    Three modules are:

    spelling module: for spelling correction.

    tagging module: for POS tagging.

    sense module: for semantics analysis.

    A simplified data flow diagram is shown in figure 2.1

    4

  • 8/7/2019 Thai FYP Report

    12/45

    Figure 2.1: Simplified system DFD

    5

  • 8/7/2019 Thai FYP Report

    13/45

    Chapter 3

    Tools and Resources

    3.1 Tools

    3.1.1 Python

    Python, named after Monty Python a British band of comedians, is an easy-to-use, flexible, object-

    oriented, mature, popular, and open source programming language designed to optimize development

    speed. Python emphasizes concepts such as quality, productivity, portability, and integration. In short,

    these terms mean (but not limited to) readability, fast development speed, text processing power and web

    work suitability.

    As a general-purpose language, Python can be used for almost anything computers are capable of. Afew organizations currently using Python are:

    Web developments: Google, Yahoo, Zope,. . .

    Games: Civilization 4, Battlefield 2,. . .

    Graphics: Pixar, Disney, Blender, Paint Shop Pro,. . .

    Numerical application: NASA, National Weather Service,. . .

    And many mores. . .

    To be fair, its unlikely that Python will ever be as fast as C. However, since Python programs use

    highly optimized structures and libraries, they tend to run near or even quicker than the speed of C

    language somehow. Both C and Python have their distinct strengths and roles. In modern software

    context, Pythons speed of development is just as important as Cs speed of execution.

    3.1.2 Natural Language Processing Toolkit

    Natural Language Processing Toolkit (NLTK) is an open source natural language processing libraries,

    software and data for Python. In this project, NLTK was used mainly for its corpus and probability

    module.

    6

  • 8/7/2019 Thai FYP Report

    14/45

    3.2 Resources

    Below are collection of corpora used in the process. Some of them are shipped with NLTK.

    3.2.1 Wordnet

    Wordnet is a lexical database for English language. Nouns, verbs, adjectives and adverbs are grouped

    into sets of synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of

    semantic and lexical relations. The result is a meaningful hierarchical network of related words, which is

    useful for text analysis and artificial intelligence.

    NLTK is shipped with Wordnet 3.0.

    3.2.2 Brown

    The Brown University Standard Corpus of Present-Day American English (or just Brown corpus) con-

    sists of 1,014,312 wordsl of running text of edited English prose printed in the United States during the

    calendar year 1961. The samples were divided into categories and subdivisions. A tagged version of the

    corpus, in which every word was tagged with its part of speech is also available in NLTK.

    3.2.3 Semcor

    A subset of Brown corpus in which words are also tagged with their sense along with their part of speech.

    Available at Rada Mihalceas page

    3.2.4 Question Corpus

    Collections of which and what questions tagged with part of speech and intention of the question.

    Available at Rada Mihalceas page

    3.3 Download and Installation

    Download size is about 17 MB. A full installation will require approximately 800 MB of free disk space.

    Python 2.7

    PyYAML

    NLTK

    After installing all packages, run the Python IDLE (see Getting Started), and type the commands:

    >>> import nltk

    >>> nltk.download()

    7

    http://www.cse.unt.edu/~rada/downloads.htmlhttp://www.cse.unt.edu/~rada/downloads.htmlhttp://www.python.org/ftp/python/2.7/python-2.7.msihttp://pyyaml.org/download/pyyaml/PyYAML-3.09.win32-py2.6.exehttp://nltk.googlecode.com/files/nltk-2.0b9.win32.msihttp://nltk.googlecode.com/files/nltk-2.0b9.win32.msihttp://pyyaml.org/download/pyyaml/PyYAML-3.09.win32-py2.6.exehttp://www.python.org/ftp/python/2.7/python-2.7.msihttp://www.cse.unt.edu/~rada/downloads.htmlhttp://www.cse.unt.edu/~rada/downloads.html
  • 8/7/2019 Thai FYP Report

    15/45

    A new window should open, showing the NLTK Downloader. Click on the File menu and select

    Change Download Directory. For central installation, set this to C:\nltk data. Next, select

    the packages or collections you want to download.

    If you did not install the data to one of the above central locations, you will need to set the NLTK DATA

    environment variable to specify the location of the data. Right click on My Computer, select Properties>Advanced>En

    Variables>User Variables>New...

    Test that the data has been installed as follows. (This assumes you downloaded the Brown Corpus):

    >>> from nltk.corpus import brown

    >>> brown.words()[:50]

    [The, Fulton, County, Grand, Jury, said, ...]

    3.4 Getting Started

    The simplest way to run Python is via IDLE, the Integrated Development Interface. It opens up a window

    and you can enter commands at the >>> prompt. You can also open up an editor with File ->New

    Window and type in a program, then run it using Run ->Run Module. Save your program to a file

    with a .py extension.

    In Windows:

    Start ->All Programs ->Python 2.6 ->IDLE

    Check that the Python interpreter is listening by typing the following command at the prompt. It should

    print Monty Python!

    >>> print "Monty Python!"

    8

  • 8/7/2019 Thai FYP Report

    16/45

    Chapter 4

    Text Clean Up

    4.1 Introduction

    Usually the very first step of every text processing tasks. A clever cleaning up can benefit the project in

    many ways. Even if its just a simple procedure which filters out non-desired characters, the amount of

    memory saving can be substantial considering a very large and noisy corpus such as html documents.

    As in this project context, user input query will be tokenized (including punctuation), filtered out odd

    characters and spell checked before being used for further processing.

    4.2 Tokenizer

    Here is the code for the tokenizer:

    import re

    TOK=r(?:\b([\w][\w\-\] *[\w]))|([\s\w])

    def tokenizer(sent,pattern=TOK):

    return [item[0] and item[0] or item[1] \

    for item in re.findall(pattern,sent)]

    The function uses a regular expression define in pattern (default value is TOK) to search for words

    and punctuations in the input string sent.

    The expression A a n d B o r C is an equivalent to:

    if A is True: return B

    else: return C

    However, this is not the case if B is False.

    A result example for an noisy input is shown below.

    >>> tokenizer("W-H-A-T yre looking/gaping at \"huh\" ??? ")

    [W-H-A-T, "", "yre", "", looking,, /, gaping,

    at, ", huh, ", ?, ?, ?]

    9

  • 8/7/2019 Thai FYP Report

    17/45

    4.3 Spell Checker

    4.3.1 Introduction

    [17] reports a rate of 25 errors per thousand for handwritten essays by secondary school students, though

    there was considerable variation between good spellers and poor spellers. On the other hand, a rate of

    0.5% to 23% for bibliographic database was reported in [2]. Spelling error rate and its significance varies

    depend on the application fields. For the projects context where user types in the input query, the rate of

    errors will be high, and they will have significant impact on the output results

    There are many sources of spelling errors:

    Spelling by sound: wuns (ones), sed (said).

    Not hearing the sound: umrella (umbrella), lirbary (library)

    Confusion of homophones: there/their, two/too/to.

    Suffix and prefix: stopt (stopped), happend (happened), realy (really).

    Order of letters in words: gril (girl), brid (bird)

    Finger slip: what (qhat), no3 (now).

    Similar shape in optical recognition: rn (m), 1 (number one) or l (letter l).

    Real word error: hole (hope), them (then). [17] found that real-word errors account for about a

    quarter to a third of all spelling errors, perhaps more if you include word-division errors.

    Depend on what kind of spelling errors, there are different methods for detection and correction [ 14].

    Some methods are:

    N-gram technique

    Rule based technique

    Minimum edit distance technique

    Probability technique

    Neural nets technique

    Acceptance based technique

    Expectation based technique

    . . .

    As for this project, we will only focus on errors that result in nonexistent words. Section 4.3.2 will

    explain in detail the method use in this project, while the rest of this chapter will evaluate and discuss on

    the performance.

    10

  • 8/7/2019 Thai FYP Report

    18/45

  • 8/7/2019 Thai FYP Report

    19/45

    >>> stemmer.stem(groceries)

    groceri

    For this reason, a more sophisticated stemmer was developed. Its flow chart is shown in figure 4.2.

    Figure 4.2: Stemmer flow chart

    The stemmer utilises extra information such as exceptions and part of speech tags for validation,

    hence a better accuracy. It can also return the definition associating with the part of speech of the token

    during the process if required. Source code can be found in module morphy.py. Another version that

    also return the meaning associated with the word can be found in module wn dict fast.py

    Spelling Correction

    The underlined approach was to figured out the similarity between 2 tokens. A few features were chosen

    and combined either by linear or cascade combination. These features are:

    fitl(word, can): length difference.

    fitorder(word,can): level of character disorder.

    transmuteI(word,can): steps required to transform word into can. 4 basic (distance 1)

    transform steps are: delete, add, change, and swap (adjacent letters only).

    fitc(word, can): scans for match strings between word and cand, sums up their lengths then

    weights.

    fitm(word, can): in some sense, it is the opposite of feature fitc. This weights the unique,

    rather than common characters.

    The experiment results indicated that feature transmuteI alone was sufficient. The other features

    can be used for selecting potential candidates to speed up the process. Feature fitc was chosen because

    of its high speed and usually resulted in a neither too broad or too restricted candidate set.

    12

  • 8/7/2019 Thai FYP Report

    20/45

    Detail Operation Explanation

    Refer to figure 4.2 for the stemmer operation. The source code can be found in module morphy.py.

    A flow chart for the spelling correction is shown in figure 4.3.

    Figure 4.3: Spelling correction module flow chart

    Lets explore the module at its very top layer.

    properties=[ #(fitorder, 1.0),

    #(transmute, 1.0),

    (transmuteI, 1.0),

    #(fitc, 1.0),

    #(fitl, 1.0),

    #(fitm, 1.0),(fitc, 0.5)]

    #(fitl, 0.7)]

    #(fitm, 0.67)]

    def interpolation(word,can,classifiers=properties):

    Linear interpolation of various classifiers, last element is used for candidate

    sig = sum(wei for (prop,wei) in classifiers[:-1])

    return 1.0*sum(prop(word,can)*wei for (prop,wei) in classifiers[:-1])/sig

    The C(properties) is a vector of features together with their weights for linear interpolation as

    can be seen in the function C(interpolation). The last element in the feature vector is used for

    13

  • 8/7/2019 Thai FYP Report

    21/45

    candidate filtering. As stated above, only C(transmuteI) will be used as feature and C(fitc) as

    candidate filter.

    Below are the implementation code for C(fitc):

    def fitc(word, can):

    """ The function scan for match strings btw word and candidate.

    Sum up their lengths then weight"""

    (word,can) = (word.lower(), can.lower())

    common = common_strings_mk2(word,can)

    l=0

    for string in common:

    l+=len(string)

    return 0.5*l*(1.0/len(word)+1.0/len(can))

    The code is self explained. C(fitc) is calculated by using equation (4.1).

    fitc =1

    2 length(common)

    1

    length(word)+

    1

    length(candidate)

    (4.1)

    Note that fitc is in range [0, 1]. The function implies that the bigger the value, the more similar the

    words are. Furthermore, when C(word) or C(can) becomes longer, their common must approach their

    lengths for fitc to be significant.

    Calculating transmuteI is not very straightforward as in the case of fitc. The transformation from

    C(word) to C(can) can be assumed to go through a series of basic transformation steps.

    These steps are:

    Add one character.

    Delete one character.

    Swap two adjacent character.

    Change one character into a different one.

    If 4 steps is required for the transformation, well say the distance between the two is 4. Obviously,

    there are infinite number of ways to transform one word to another, but there should be a limited number

    of shortest paths. The function C(transmuteI) will calculate the shortest distance and will figure out

    the intermediate transformation steps if required as well.

    This can be achieved through a four steps process:

    Search for the common string between the two words.

    Segment and align the two words based on the common string.

    Assign transformation steps.

    Weight the transformation for calculating transmuteI.

    14

  • 8/7/2019 Thai FYP Report

    22/45

    Firstly, search for the common string between the two words. This is a greedy process, which means it

    tries to group as many adjacent characters as possible, for instance abc as opposed to a,b,c

    or ab,c.

    Sample code for searching function:

    def common_strings_mk2(word, can):

    "Return list of substrings that match, in order of appearance from left to right

    ...

    >>> common_strings_mk2(abcdefghklm,21becdfhlkm)

    [b, cd, f, h, k, m]

    Secondly, segment and align the two words based on the common string. The desired output of this

    process is shown in the example below.

    def align(word,can,verbose=False):

    ...

    >>> align(abcdefghklm,21becdfhlkm)

    [[a, , b, , cd,e, f,g, h, , k,l, m],

    [2,1, b,e, cd, , f, , h,l, k, , m],

    [ 2, 4, 6, 8, 10, 12]]

    Thirdly, assign transformation steps. The process bases on the fact that the transformation steps

    will decide how the column vector look like. Table 4.3.2 showed the summary of the patterns and their

    corresponding steps.

    c

    Delete c

    c

    dChange c to d

    d c

    c d Swap c and d

    c but not

    d c

    c d Add c

    Table 4.1: Transformation steps and their patterns

    Finally, weight the transformation for calculating transmuteI. Some transformation are more likely

    to happen than others, so these transformations should be weighed less, i.e. nearer distance. Table 4.3.2

    shows the weighed transformations.

    15

  • 8/7/2019 Thai FYP Report

    23/45

    Swap Between vowels 0.5

    Swap Between (g,h), (r,t), (c,h), (h,p), (h,s), (h,k), and (h,r) 0.5

    Change a to e, o to u, e to a, u to o, i to e or y, y to i 0.5

    All steps Not above cases 1.0

    Table 4.2: Weighed transformation

    After weighing the transformation steps, transmuteI is calculated as follow:

    transmuteI =6

    6 + sum(weighed transformations)(4.2)

    Function (4.2) implies that returned value of the function is equal to 1.0 only if the distance is zero,

    converges to zero when distance approaches infinity, and reduces by one-third when distance is 3.0.

    Some sample codes and outputs of a few keys functions were shown below:

    def tagseq(word,can,verbose=False):

    "Figure out the transmute steps given aligned sequences\n\

    return value: [[weight,type,involved chars,index]]"

    ...

    >>> tagseq(abcdefghklm,21becdfhlkm)

    [[1.0, Change, a to 2, 0],

    [1.0, Add, 1, 1],

    [1.0, Add, e, 3],

    [1.0, Del, e, 5],

    [1.0, Del, g, 7],

    [1.0, Swap, l and k, 11]]

    def transmuteI(word,can):

    "Measure distance (steps require to tranform word to can)"

    #tagseq(align(word,can)) originally was the steps require to tranform word to ca

    #Basic steps are add, delete, change, and swap (adjacent chars)

    #Has been modified to weight the steps instead.

    #For example: swap a and e:0.5, change a to e: 1.0

    return 6.0/(6+sum([val[0] for val in tagseq(word,can)]))

    >>> transmuteI(abcdefghklm,21becdfhlkm)

    0.5

    4.3.3 Results

    Below are a few demonstrations for the stemmer.

    16

  • 8/7/2019 Thai FYP Report

    24/45

    >>> morphy(goes)

    [[go, v]]

    >>> morphy(propose)

    [[propose, v]]

    >>> morphy(grocery)

    [[grocery, n]]

    >>> morphy(groceries)

    [[grocery, n]]

    Some demonstration for the spelling correction. The error samples were taken in 4.3.1

    >>> correct(umrella)

    [umbrella]

    >>> correct(libary)

    [library]

    >>> correct(qhat)

    [what, that, quat, qat, khat, hat, ghat, chat]

    >>> correct(no3)

    [nox, now, nov, not, nos, nor, non, noi, nog,

    noe, nod, noc, nob, no., no]

    >>> correct(brid)

    [rid, grid, brit, bris, brio, brim,

    brig, brie, bride, braid, brad, bird, bid, arid]

    >>> correct(gril)

    [grit, gris, grip, grin, grim, grill, grid,

    grail, girl, aril]

    >>> correct(wuns)

    [uns, buns]

    >>> correct(sed)

    [sad]

    >>> correct(stopt)

    [stout, stops, stop, stoat]

    >>> correct(happend)

    [happen, append]

    >>> correct(realy)

    [reply, rely, relay, redly, realty, realm,

    really, real, ready, mealy]

    4.3.4 Evaluation

    The stemmer worked perfectly well for the test samples.

    17

  • 8/7/2019 Thai FYP Report

    25/45

    On the other hand, its hard to define a baseline or perform an accuracy test for the spelling checker

    because due to undetermined nature of spelling errors, there might be no unique correct answer. The best

    solution therefore would be letting the user choose from a list of suggested words. So we would say a

    spell checker fails only if it lefts out the intended solutions.

    From the above testing experiment, the spell checked failed in 4 cases: wuns, sed, stopt, and

    happend. However, these four cases couldnt be helped since they were errors due to spelling by sound

    and the spell checker wasnt designed for this type of errors.

    Further experiments by quickly skimming through a dictionary, catching one glance at a long word

    and rapidly typing it into the computer showed that the spell checker rarely failed. One of the few failed

    cases was lurve, which was one of Woody Allens words for love since he thought love was too weak

    of a word.

    >>> correct(lurve)

    [lure, curve]

    So in conclusion, the spell checker has done its job well.

    4.3.5 Future Work

    A spelling corpus is freely available at ota.ox.ac.uk/headers/0643.xml

    The corpus can be used for a statistical spell checker or accuracy test. However, the corpus is not

    readily usable. A corpus reader to make this corpus readily available in NLTK would be very useful.

    18

    http://www.ota.ox.ac.uk/headers/0643.xmlhttp://www.ota.ox.ac.uk/headers/0643.xml
  • 8/7/2019 Thai FYP Report

    26/45

    Chapter 5

    Part of Speech Tagger

    5.1 Introduction

    Most of the time, we can understand an utterance while paying little to no regard to its syntax. For

    example, I am driving a car, or some random jungle talk George good, George no hurt you!.

    However, at times when the syntax becomes complicated, or there is ambiguity, some syntax analysis

    will be necessary to understand the utterances sense correctly. In fact most of natural language tasks can

    be viewed ask resolving ambiguity at some points.

    Lets consider a few examples:

    1. I want to eat someplace thats close to NTU.

    2. He wont be in town until 4PM.

    3. He didnt marry her because she was rich.

    4. When Ulysses S. Grant and Robert E. Lee met in the parlor of a modest house at Appomattox Court

    House, Virginia to work out the terms for the surrender of Lees Army of Northern Virginia, a great

    chapter in American life came to a close and a greater chapter began.

    Considering example 1, given the usual structure of the verb eat, this utterance can either mean that

    the speaker wants to eat at some nearby location, or that the person wants to swallow the place. The latter

    is much less likely to happen in real world.

    Considering example 2, depending on which part of the sentence that the preposition phrase until

    4PM modifies, the meaning can be being out of town before 4PM and arriving in town only then, or

    arriving in town at some earlier time but not staying as late as 4PM.

    The same goes for example 3, the clause because she was rich can modify either the state of being

    married (the whole utterance) or just the cause (the verb only).

    Example 4 might look complicated at first, but closer inspection reveals a quite simple structure: a

    sentence pre modifier ( When Ulysses S. Grant and Robert E. Lee met in the parlor of a modest house

    at Appomattox Court House, Virginia to work out the terms for the surrender of Lees Army of Northern

    19

  • 8/7/2019 Thai FYP Report

    27/45

    Virginia,) followed by a series of simple sentences ([a great chapter in American life came to a close][

    and a greater chapter began.]

    As those examples above illustrated, a good syntax analysis will make the task of meaning extraction

    easier and more precise than intuition alone. In this chapter, well explore the syntax analysis, i.e. POS

    tagging. Section 5.2 will provide some information about method used. Necessary mathematical calcula-

    tion is covered in section 5.3. The following sections will continue to elaborate on the program operation

    and its performance.

    5.2 Method

    The language is assumed to be a second order hidden Markov model (HMM), which means that the

    choice of a word is only depend on its previous two words. This of course is not true, but not totally false

    either. Lets consider an example:I painted my neighbours whole new . . . green except for the front door yesterday.

    Which word will be likely to fill in the empty space? It should be something that can be painted on,

    such as face, house, board, pants, or cow. But only house are related to door, which appeared quite

    far after the empty location. Our current HMM model will fail in this case unless the term whole new

    house happens to appear more often than the others in usual context.

    However, considering our context which is part of speech tagging, the HMM model works much more

    efficiently since grammar rules restrict which classes can stand next to each other. Some examples are

    adjective usually precede noun, or to must be follow by noun phrase or bare infinitive verb.

    A tagger based on HMM model will estimate the probability of various sequences and return the most

    probable sequence or sequences.

    A HMM model has some major advantages compared to a decision tree model.

    It would required an adept knowledge in linguistics to capture all the grammar rules for using in

    the decision tree, since a strict definition of a grammar rule is not easy to define.

    Most words in English are unambiguous, that is they only have one part of speech. But many of

    the most common words in English are not. In fact [9]stated that over 40

    A single HMM tagger can be reused for many languages or applications if provided with proper

    training data which in this case is a set of sentences tagged with each words part of speech. This

    kind of database is usually simpler to prepare but time consumming.

    HMM model also has drawbacks:

    True model of the language is not known and can only be approximated.

    It is impossible to have a databse of every intance of a language, the HMM model should account

    for these unseen events (also known as smoothing)

    The application filed should be similar to the training data.

    20

  • 8/7/2019 Thai FYP Report

    28/45

    A database of sufficient size for trainning must be availbe.

    Unable to validate the returned solutions.

    In this project, a trigram, bigram and unigram classifiers were linearly interpolated into a tagger call

    lihmmtagger. In case this tagger fails to tag a words, an prefix tagger, a regular expression tagger and

    a default tagger will be called in that order to prevent error propagation caused by failing to tag a word.

    The source code can be found in module tagger.py.

    5.3 Mathematical Background

    5.3.1 Language Model

    The tagging problem can be defined for the trigram model as follow: giving sequence of[t1, t2] (POS 1

    and 2), find the probability of the sequence being follow by word w3 which has POS t3 i.e.

    P(t3, w3|t1, t2) = P(w3|t1, t2, t3) P(t3|t1, t2) = P(w3|t3) P(t3|t1, t2) (5.1)

    The formula can be derived by making the assumption that w3 only depends on t3.

    Similarly, the bigram and unigram model can be defined:

    P(t3, w3|t2) = P(w3|t3) P(t3|t2) (5.2)

    P(t3, w3) = P(w3|t3) P(t3) (5.3)

    Solution can be found using iinear interpolation of 5.1, 5.2, and 5.3:

    argmax

    1in

    P(wi|ti)(3 P(ti|ti2, ti1) + 2 P(ti|ti1) + 3 P(ti)) (5.4)

    In which 1, 2, 3 are weights of the classifiers, 1 + 2 + 3 = 1, and t1 = t2 = BOS (Beginning

    of Sentence). Logarithm is used when the numbers get smaller.

    It should be pointed out that, the returned soultions is the most probable paths, i.e. the highest multiple

    of probabilities. Its not the same as the multiple of highest probabilities. One method for finding this

    path is known as Viterbi algorithm [10]

    5.3.2 Estimating Lambda Value

    For each trigram in training data, compare the following values:C(t1,t2,t3)1C(t1,t2)1

    ,C(t2,t3)1C(t2)1

    ,C(t3)1N(t)1

    Depend on which is the maximum of them, increase the corresponding lambda by a certain amount. The

    chosen amount were: 1, C(t1, t2, t3),C(t1,t2,t3)C(t1,t2)

    The reason for minus 1 is because we treat the in using trigram as observed event, so the actual data must

    be minus by 1. For this reason we skipped trigrams which have been seen only once.

    21

  • 8/7/2019 Thai FYP Report

    29/45

    5.4 Operation Explanation

    The tagger flow chart is shown in figure 5.1

    Figure 5.1: POS tagger flow chart

    There are 4 tagger classes:

    A default tagger, which automatically assigns the most common tags which is NN (147169 counts

    in 1071233, that is 14%)

    retagger: a tagger bases on regular expression.

    affixtagger: a tagger bases on prefix and suffix of words.

    lihmmtagger: a linear interpolation of trigram, bigram and unigram tagger.

    Each tagger is a separate class and has some common methods:

    train: takes a tagged corpus as trainning data and export training information for later use since

    training might take a long time.

    setimport: import previous training information.

    tagit: tag a word.

    tagthem: tag a set of words.

    accuracy: take a tagged corpus as test set and return the percentage of correctly tagged words.

    22

  • 8/7/2019 Thai FYP Report

    30/45

    test: An improved accuracy test. Take a tagged coprus as test set, divide it into smaller sets and

    perform accuracy test on each set. Return average accuracy and the standard deviations.

    There are 170 possible tags, and since the tagger searches for the most probable path, the search space

    can easily reach several hundred thousands in just 10 or 15 words, which are the most common sentence

    length in Brown corpus. A search beam (1000 to 2000) must be applied to limit the search space, but

    this will affect the accuracy as well. In order to improve speed of processing but limiting the affect on

    accuracy at the same time, complex sentences which consist of more than one clauses will be broken and

    tagged individually since words across clauses have little syntax relation.

    5.5 Results

    Below are some return scores in accuracy test.

    Tagger Mean (percent) Standard deviation

    Regular expression 7.14 2.23

    Suffix tagger 28.56 6.38

    Prefix tager 27.15 6.84

    HMM tagger (most likely tags only) 85.68 4.42

    Table 5.1: Accuracies

    Table 5.5 showed accuracy test for the linear interpolation HMM tagger.

    Testset size Mean Mean w/o testset segmenting Variance Best Worst120 63.25 64.72 132 78.04 45.45

    1010 67.54 67.45 24 78.03 58.6

    Table 5.2: Accuracies

    5.6 Evaluation and Discussion

    The suffix tagger had slightly better accuracy and deviation than prefix tagger. Interpolation of them fell

    somewhere in between. So suffix tagger was chosen.

    It was surprising to achieve such decent accuracy just by knowing the ending or starting character.

    As for the HMM tagger, simply going from left to right and choosing the most likely tags yield a

    decent accuracy and speed. However, the outcomes might not be very useful for other tasks such as

    information extraction due to misleading tags. Therefore searching for the most probable path is more

    desirable. The drawback is it requires much more computational effort.

    Even though the mean accuracy for 2 method of testing is approximate to each other, segmenting the

    test set gave more insights to the accuracy scores than the latter. Despite the large difference in testing

    scale, the scores were close to each other. Hence it can be assumed that the tagger will work 67% of the

    time. This is a much lower score than the other HMM tagger, which shouldnt be the case. This implied

    23

  • 8/7/2019 Thai FYP Report

    31/45

    that the language model might go wrong at some points such as smoothing process or estimating lambda

    values.

    The speed is much slower than comparing to the other HMM tagger, especially for lengthy sentences.

    By segmenting the sentences into clauses, the speed was practically double. The performance improve-

    ments however was unknown. In some occasions during debugging process, segmenting the sentence into

    smaller clauses proved to have better accuracy. The experiment logs are provided in the database.

    Analyzing experiment logs suggested that punctuations can improve, but can also degrade the perfor-

    mance dramatically. For instance, lets consider a sample from the log:

    (, ), (What, WDT), (is, BEZ), (with, IN),

    (this, DT), (vow, NN), (jazz, NN), ("", ""), (?, .), (?, .)]

    NoCls: 2

    1 1 3 3 6 1 2 1 2 1 2 7 2 4 3 2 4 3 2 4 3 2 | | W D T | W D T B E Z

    | B E Z I N | I N D T | D T N N | N N N N | N N | , . | , .

    The tagger assigned what with a tag instead of WDT. This might due to punctuations often

    have very high frequency of appearance. More specifically, the POS tag has a count of 6160 as

    opposed to a count of 4834 for WDT.

    On the other hand, since punctuations are more likely to be tagged correctly, they also have the effect

    of limiting the error propagation. Therefore should the punctuations be considered during the training or

    tagging process, whether their benefits outgrow their disadvantage is unresolved at the moment.

    Even though it is possible to modify the program to carry out experiments for the sake of verifying the

    above problem, considering the projects context where the inputs are user search queries, punctuations

    will not be a big issue. Furthermore, when applied to the semantics module (discussed in the next chapter)

    the returned results were promising. So in conclusion, the program are good enough for practical usage,

    even though the performance score was not as high as expected.

    24

  • 8/7/2019 Thai FYP Report

    32/45

    Chapter 6

    Meaning Representation

    6.1 Introduction

    In the previous chapter, a few examples have illustrated that in many cases syntax analysis is useful but

    not necessary for meaning comprehension. This implies that a different system other than grammar is

    necessary for meaning representation. It makes sense since most of the grammar rules are about how

    different words can be combined to form a sentence. The exception, for example is some verbs must

    not stand alone on itself, like give. This class of verbs is known as transitive verb, and in someway

    represents the sense of the verbs. However, consider some of everyday language tasks that require some

    form of semantic processing:

    Answering an question.

    Following a recipe.

    Realizing that youve been insulted.

    It is clear that none of morphological or syntactical representation thus far will get us very far on these

    tasks. What is needed is a representation that can bridge the gap between linguistic inputs, their meaning

    and the kind of real world knowledge that is needed to perform the involved tasks.

    Over the years, a fair number of representational schemes have been invented to capture the meaningof the natural language inputs for use in processing systems. Three notable schemes are First Order

    Predicate Calculus (FOPC), Semantic Networks and Frames.

    In this chapter, some background theory on FOPC and formal logic will be cover in 6.2. 6.3 will

    explore how these theory can be apply to a specific case. The rest of this chapter will then explain how

    the program achieved the desired result describe in 6.3.

    25

  • 8/7/2019 Thai FYP Report

    33/45

    6.2 Background Theory

    6.2.1 First Order Predicate Calculus

    Desiderata For Representation Scheme

    There are basics requirements that a meaning representation must fulfill.

    Verifiability

    The systems ability to compare an affair described by a representation to affairs modeled in a knowledge

    base.

    Unambiguous Representations

    Ambiguities exist in all aspects of all languages. Some means of determining that certain interpretations

    are more or less preferable than others is needed.

    Canonical Form

    The phenomenon of distinct inputs that should be assigned the same meaning representations.

    Inference

    The systems ability to derive valid conclusions based on meaning representations of inputs and its knowl-

    edge base. The conclusions might be not explicitly represented in the knowledge base, but are logically

    derivable from the available propositions.

    Variables

    Allow the system to match unknown entity to a known object in knowledge base so that the entire propo-

    sitions is matched.

    Expressiveness

    Finally, to be useful a representation scheme must be expressive enough to cover a wide range of subject

    matters such as time and tense.

    The semantics of FOPC

    Capturing meaning of a sentence involves identifying the terms and predicates corresponding to various

    grammar elements of the sentence. Some basic building elements of FOPC are:

    Constants

    Refer to specific object in the world. Like in programming language, FOPC contants refer to exactly one

    object, Objects can, however have more than one contants refer to them.

    26

  • 8/7/2019 Thai FYP Report

    34/45

    Functions

    Functions in FOPC can be expressed as attributes of objects. FOPC functions have the same syntax as a

    single argument predicate, however, they are in fact terms since they actually refer to unique objects.

    Examples: LocationOf(NTU)

    Variables

    Give the system the ability to draw inferences or make assertions about objects without having to refer to

    any particular ones.

    Formular

    An equivalent so sentence in grammar representation. FOPC formular is a representation of objects,

    properties or relations between terms. Formulars therefore can be assigned with True or False values

    depending on whether the information they encoded are in accord with the knowledge base or not. Notethat the arguments of formulars must be terms, i.e. constants, functions, and variables and not a formular.

    Quantifiers

    Variable can be used to making statements about either a particular unknown object, or all the objects

    in some arbitrary classes of objects. Usage of quantifiers make these two uses possible. The two basic

    operator in FOPC are exist quantifier that denotes one particular unknown object, and all quantifier

    that refers to all objects in a class.

    Lambda Notation

    Enable formal functionality for replacing of a specifics variable by a specifics term.

    Examples: lambda x P(x)(A) --> P(A)

    6.2.2 Formal Logic

    The word formal means by form, rather than by shape or meaning. Some necessary terms to read and

    understand formal logic are:

    Argument: Is a line of reasoning, starts by a premise and ends by a conclusions.

    Proof and disproof: A formal proof is a logical argument which convinces by following formal

    rules.

    Claims: Are declarative remarks about states of the world.

    Contradiction: is a simultaneous acceptance and rejection of some remarks. Contradiction isnt

    allowed in formal logic (also known as consistency principle)

    Monotonicity principle: A proof cannot be invalidated by adding premises, since proof obeys rules,

    not on meaning or facts.

    27

  • 8/7/2019 Thai FYP Report

    35/45

    Formal rules: is an intermediate step in an logical arguments. Rules show how can larger proof be

    made out of one or more smaller proofs.

    Connectives: the logic connectives are used to build larger claims out of smaller claim. There are

    four connectives:

    And: If we accept A and B then we are forced to accept both A and B simultaneously.

    If-then: If we accept if A then B and A, we are forced to accept B.

    Or: If we accept A or B then we can either accept only A, only B, or both.

    Not: If we accept not A then accept A lead to a contradiction.

    Each connective has two rules associated with them: introduction and elimination. The names

    suggest that the connectives are introduced or eliminated in the final proof. For instance here is an

    example of If-then elimination rule:(If A then B) also A, therefore B

    6.3 Semantic Analysis A Study Case

    Semantic analysis is the process whereby meaning representation of linguistic inputs is created. Consider

    the following excerpt from the movie Monty Python and the Holy Grail

    - Q1: Tell me: What do you do with witches?

    - A1: Burn them!

    - Q2: What do you burn apart from witches?

    - More witches!

    - A2: Wood.

    - Q3: So, why do witches burn?

    - A3: Cause theyre made of wood?

    - Q4: How do we tell if shes made of wood? Does wood sink in water?

    - A4: No. It floats.

    - Q5: What also floats in water?

    - A5: A duck!

    - So, logically... If she... weighs the same as a duck... shes made of wood.

    - And, therefore... A witch!

    - A witch!

    - We shall use my largest scales. ... Remove the supports!

    A representation scheme based on FOPC and logic proof for the above excerpt will look something

    like this:

    lambda_x (witches(x) --> burn(x)) (1) premise

    lambda_x (burn(x)-->wood(x)) (2) premise

    28

  • 8/7/2019 Thai FYP Report

    36/45

    lambda_x (witches(x)-->wood(x)) (3) implication-introduction (1) (2)

    lambda_x (wood(x)-->float(x)) (4) premise

    float(duck) (5) premise

    lambda_x lambda_y ((weigh(x)=weigh(y)) and float(y) -->float(x)) (6) premise

    "We shall use my largest scales. ... Remove the supports!

    weigh(duck)=weigh(girl) (7) given

    lambda_x ((weigh(x)=weigh(duck)) and float(duck)

    -->float(x)) (8) lambda-reduction (6)

    (weigh(girl)=weigh(duck)) and float(duck)

    -->float(girl) (9) lambda-reduction (8)

    (weigh(girl)=weigh(duck)) and float(duck) (10) And-introduction (7) (5)

    float(girl) (11) Implication-elimination (10) (9)

    wood(girl) (12) Implication-elimination (11) (4)

    Lets assume: lamda_x (not_witches(x)-->not_wood(x)) (13) given

    lamda_x (wood(x)-->witches(x)) (14) backward chaining (13) (3)

    wood(girl)-->witches(girl) (15) lambda-reduction (14)

    witches(girl) (16) Implication-elimination (16) (12)

    "A witch!- A witch!"

    The claim at the end is nonsensical according to our intuition. Using logic as shown above, this is

    true since there is one invalid proof at step (12). However, it is still quite possible for the claim to be true

    if we rewrite the proof as below:

    lambda_x (wood(x)-->float(x)) (4) premise

    29

  • 8/7/2019 Thai FYP Report

    37/45

    lambda_x (human(x) and not_wood(x) --> not_float(x)) (4.1) premise

    lambda_x (human(x) and float(x)

    --> wood(x)) (4.2) backward chaining (4.1) (4)

    human(girl) and float(girl) --> wood(girl) (4.3) lambda-reduction

    human(girl) (4.4) given

    wood(girl) (12) Implication-elimination (11) (4.4) (4.3)

    A new premise (4.2), which is reasonable is introduced. Amazingly enough, the final claim which is

    supposed to be humorous and nonsensical is true logically.

    6.4 Operation explanation

    6.4.1 Implication Derivation

    Design

    The program derives the implication by figuring out what are the symbols, referents and their syntactical

    roles in the sentences. Considering an example:

    Q: Who is Albert?

    A: He is a genius.

    In the above example, who and he are symbols, and they refers to genius and Albert respec-

    tively. Logically, the implication must satisfy the following two rules:

    1. The left hand side of the implication must be either the argument or predicate of the question, and

    the right hand side of the implication must be either the argument or predicate of the answer.

    2. The left hand side must be meaningful, or the implication will be pretty much useless.

    Following the above two rules, a sound implication would be Albert --> genius or Albert

    --> he, and implication such as who --> he is not very informative.

    The above two rules and referent identification process are embedded in deduct class, while predi-

    cate and argument extraction are handled by sentence class.

    phrase Class

    As the name suggested, phrases such as noun phrase, preposition phrase or verb phrase can be modeled

    using the class phrase

    Some key attributes of the class phrase are:

    30

  • 8/7/2019 Thai FYP Report

    38/45

    string - the actual string of the phrase.

    root - beginning and ending indexes of root of the phrase.

    type - Type of phrases. Currently noun phrase, preposition phrase, verb phrase and adjective

    phrase are supported

    span - beginning and ending indexes of the phrase.

    modifiers - modifiers of the phrase, including postmodifiers and premodifiers.

    Each phrase has an associated function to search for that particular phrase in a sentence.

    Sentence Class

    This class is used to represent an instance of sentence. It provides convenient access to many components

    of a sentence such as POS tags, subject, predicate, sentence type. . .

    Some key attributes of the class Sentence are:

    type - Sentence type. Currently declarative, yes/no question and some of Wh-questions are sup-

    ported.

    wttups - Tuples of POS tags.

    wtstr - String of POS tags.

    jungle - Jungle talk version of the original sentence.

    SubPre - Lists of subjects and predicates.

    jdiscourse - Discourse of the sentence.

    negate - Whether the sentence is negative or not

    referent - Symbols and referents.

    deduct Class

    To derive implications from a pair of question and answer. It is worh noting that the process relies solely

    on syntactical information provided by the Sentence class and pays no regard to the actual meaning of

    the tokens, hence the name syntax driven semantics analysis.

    Some key attributes of the class deduct are:

    lhs - Instance of the sentence on the left hand side (the question).

    rhs - Instance of the sentence on the right hand side (the answer).

    presentation - List of possible implications.

    FOPC - FOPC representation. Unimplemented at the moment.

    refpairs - Symbols and referents pairs

    31

  • 8/7/2019 Thai FYP Report

    39/45

    6.4.2 Presentation

    Design

    At the moment, the representation module is not integrated into the semantics analysis modules discussed

    above since the extracted information is not detail enough for the module to generate an accurate rep-

    resentation. More specifically, the module will need information such as what should the quantifier be,

    what should the term be (variable, constant, or function).

    However, if we are able to provide these information, the class atom can be used to generate a proper

    representation.

    atomClass

    On top of generating basic atomic representations, lambda reduction and combining smaller expressions

    are also supported.

    preamble - FOPC formulas with quantifiers.

    amble - FOPC formulas without quantifiers.

    Lambda - Trigger lambda reduction.

    quantifier - Formulas quantifier.

    connective - Formulas connective.

    term - Formulas term

    predicate - Formulas predicate.

    var - Variable symbol.

    presentations - List of possible representations.

    6.5 Results

    6.5.1 Semantics Analysis

    Below are the output implications for sample pairs of questions and answers. Some of them are from the

    study case in section 6.3, and some statements are modified into questions for compatibility or manually

    tagged if they were tagged wrongly.

    q1="What do you do with witches?"

    a1=Burn them.

    >>> deduct(q1,a1).presentation

    [witches/NNS-->burn/VB]

    32

  • 8/7/2019 Thai FYP Report

    40/45

    q2=What do you burn apart from witches?

    a2=Wood.

    >>> deduct(q2,a2).presentation

    [burn/VB-->wood/NN]

    q3=Why do witches burn?

    a3="Because they are made of wood"

    >>> deduct(q3,a3).presentation

    [witches/NNS-->made/VBN of/IN wood/NN]

    q4=does/DOZ wood/NN sink/VB in/IN water/NN

    a4=No, it floats.

    >>> deduct(q4,a4,1).presentation

    [wood/NN-->floats/VBZ, water/NN-->floats/VBZ]

    q5="What also floats in water?"

    a5=A duck

    >>> deduct(q5,a5).presentation

    [floats/VBZ in/IN water/NN-->duck/NN]

    q6=why/WRB does/DOZ she/PPS weight/VB the/AT

    same/AP as/CS a/AT duck/NN

    a6=Because she is made of wood

    >>> deduct(q6,a6,1).presentation

    [weight/VB same/AP as/CS a/AT duck/NN-->wood/NN,

    duck/NN-->made/VBN of/IN wood/NN]

    q8=who is Albert

    a8=He is a genius

    >>> deduct(q8,a8).presentation

    [albert/NP-->genius/NN, albert/NP-->he/PPS]

    q9=which country is north of America

    a9=Canada is north of America

    >>> deduct(q9,a9).presentation

    [north/NR of/IN america/NP-->canada/NP]

    q10=who is the most stupid guy on earth

    a10=Dummies

    33

  • 8/7/2019 Thai FYP Report

    41/45

    >>> deduct(q10,a10).presentation

    [the/AT most/QL stupid/JJ guy/NN

    on/IN earth/NN-->dummies/NNS]

    6.5.2 Meaning Representation

    Below are a few demonstrations including atomic terms, lambda reduction and combining smaller atoms

    into a bigger representation.

    >>> t1=atom(witches,None,var)

    >>> t1.presentation

    lambda_x_witches Isa(x_witches,witches) connective

    >>> t2=atom(burn,None,var)

    >>> t2.presentation

    lambda_x_burn Isa(x_burn,burn) connective

    >>> t3=atom(t2,t1)

    >>> t3.presentation

    lambda_x_burn Isa(x_burn,burn) lambda_x_witches

    Isa(x_witches,witches) connective Role_of_x_witches(x_burn,x_witches)

    >>> t4=atom(t3,Lambda=[[t1,girl]])

    >>> t4.Apresentation

    Isa(x_burn,burn) Isa(girl,witches) connective Role_of_girl(x_burn,girl)

    6.6 Evaluations

    Further expriments on the deduction program showed that even though the program analyses the sentence

    in a rather simple minded way, it worked unexpectedly well for simple sentences, and for some a bit more

    complex sentences.

    >>> Q=Why is he kicking that poor dog?

    >>> A=Because it bites him.

    >>> deduct(Q,A).presentation

    [dog/NN-->bites/VBZ him/PP, dog/NN-->bites/VBZ him/PP]

    Q=Why does someone die?

    A=Because he is old

    >>> deduct(Q,A).presentation

    [someone/PN-->old/JJ]

    34

  • 8/7/2019 Thai FYP Report

    42/45

    Some current limitations are:

    Not all type of questions, as well as sentences with clauses and commas are supported. Currently

    only yes no question, what, which, who, whom, why and simple sentences are available.

    Unable to extract information such as tense, quantity or entity relationships, which are necessary

    for FOPC representation scheme.

    The analysis bases only on syntactical roles and does not consider the actual meaning and the

    relative locations of the tokens

    Despite the limitations, the program does work in simple context. Hence it can either be used as a

    layer in a multi level process in which each layer solves a particular problem, or be improved to be able

    to dealt with complex sentences and extract more useful information.

    As for the representation module, it worked as expected for simple formulas. As the formulas getting

    more complex, connectives proved to be an ambiguity issue. Lets consider an example:

    - Every restaurant has a menu

    The meaning representation of the sentence might take either one of the below two forms:

    - all Restaurant(x) then exist e, y Having(e) and Haver(e, x) and Isa(y, Menu) and Had(e, y)

    - exist y Isa(y, Menu) and all x Isa(x, Restaurant) then exist e Having(e) and Haver(e, x) and Had(e,

    y)

    In the worst case, a sentence with n quantifiers will have n! representations.

    35

  • 8/7/2019 Thai FYP Report

    43/45

    Chapter 7

    Conclusion

    This has been a very enjoyable project for me. I have had fun learning Python and study some very

    interesting aspect of the language that I used to take for granted. Though I failed to achieve the initial

    goals, I managed to retrieve other precious things in return: less arrogant, realized how magnificent this

    world and its people are, and memorable moments such as the excitement when getting the program

    running for first time, or had my heart sunk to my feet when hearing about IBM Watsons triumph.

    Though I felt that I could have done much more if I have had received more formal training on

    linguistics, I was contended with the achievement. I really am glad to have the opportunity to work on

    this project. My deepest gratitudes owed to my supervisor professor Chan Chee Keong, thank you.

    36

  • 8/7/2019 Thai FYP Report

    44/45

    Bibliography

    [1] Richard Bornat. Proof and Disproof in Formal Logic An introduction for programmers. Oxford

    University Press,, 2005.

    [2] Charles P. Bourne. Frequency and impact of spelling errors in bibliographic data bases. Information

    Processing and Management, 13(1):1 12, 1977.

    [3] Thorsten Brants. Tnt - a statistical part-of-speech tagger.

    [4] Eugene Charniak Brian Roark. Measuring efficiency in high-accuracy,broad-coverage statistical

    parsing.

    [5] Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. In

    Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 00,

    pages 286293, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.

    [6] Hinrich Schiitze Christopher D. Manning. Foundations of Statistical Natural Language Processing.

    Massachusetts Institute of Technology, second edition, 1999.

    [7] Kenneth W. Church and Lisa F. Rau. Commercial applications of natural language processing.

    Commun. ACM, 38:7179, November 1995.

    [8] Fred J. Damerau. A technique for computer detection and correction of spelling errors. Commun.

    ACM, 7:171176, March 1964.

    [9] James H. Martin Daniel Jurafsky. An Introduction to Natural Language Processing, Computational

    Linguistics and Speech Recognition. Prentice Hall, 2000.

    [10] JR. G. DAVID FORNEY. The viterbi algorithm. Computation and Neural Systems, 7, May 1991.

    [11] Lisa F. Rau Kenneth W. Church. Commercial applications of natural language processing.

    [12] Mark D. Kernighan, Kenneth W. Church, and William A. Gale. A spelling correction program based

    on a noisy channel model. In Proceedings of the 13th conference on Computational linguistics - Vol-

    ume 2, COLING 90, pages 205210, Stroudsburg, PA, USA, 1990. Association for Computational

    Linguistics.

    [13] Karen Kukich. Spelling correction for the telecommunications network for the deaf. Commun.

    ACM, 35:8090, May 1992.

    37

  • 8/7/2019 Thai FYP Report

    45/45

    [14] Karen Kukich. Techniques for automatically correcting words in text. ACM Comput. Surv., 24:377

    439, December 1992.

    [15] Charles R. Litecky and Gordon B. Davis. A study of errors, error-proneness, and error diagnosis in

    cobol. Commun. ACM, 19:3338, January 1976.

    [16] Mark D Ryan Michael RA Huth. Logic in Computer Science Modelling and reasoning about sys-

    tems. Cambridge University Press,, 2000.

    [17] Roger Mitton. Spelling checkers,spelling correctors and the misspellings of poor spellers. Inf.

    Process. Manage., 23:495505, September 1987.

    [18] Jacob Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010.

    [19] James L. Peterson. Computer programs for detecting and correcting spelling errors. Commun. ACM,

    23:676687, December 1980.

    [20] Kristina Toutanova and Robert C. Moore. Pronunciation modeling for improved spelling correction.

    In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 02,

    pages 144151, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.

    [21] Joseph Weizenbaum. Elizaa computer program for the study of natural language commu-

    nication between man and machine. Commun. ACM, 9:3645, January 1966.

    [22] Jean-Sbastien Sencal Frderic Morin Jean-Luc Gauvain Yoshua Bengio, Holger Schwenk. Neural

    probabilistic language models.