Computational Language Andrew Hippisley. Computational Language Computational language and AI...

Post on 21-Dec-2015

234 views 4 download

Tags:

Transcript of Computational Language Andrew Hippisley. Computational Language Computational language and AI...

Computational Language

Andrew Hippisley

Computational Language Computational language and AI Language engineering: applied

computational language Case study: spell checkers

Computational language & AI Artificial Intelligence:

“the simulation on computer of distinctly human mental functions.” Wilks (1993)

Computational language & AI Language integral to intelligent

systems Artificial Intelligence Turing Test ELIZA

ELIZA Eliza: Hello. I am ELIZA. How can I help you? Me: I am feeling a bit depressed today Eliza: Do you think coming here will help you not to be depressed? Me: yes Eliza: You are sure? Me: a friend told me I should talk to you Eliza: Do you wish to you should talk to me? Me: I am willing to try Eliza: How long have you been willing to try? Me: I have been willing to try for about two weeks Eliza: Can you elaborate on that? Me: a friend said I should try to talk to you about my problem Eliza: Why do you say your problem? Me: I have a problem with depression Eliza: Can you elaborate on that? Me: no, I don't wish to discuss it Eliza: You are being a bit negative

Computational language & AI Why language engineering? Language integral to intelligent

systems Artifiicial Intelligence Turing Test ELIZA Expert systems: natural language

interface, natural language database

Computational language & AI Methods shared across systems

Finite State Transition Networks (FSTN)

Logic Formal rules Probability Data: you know it!

Applied computational language

History of the field Machine Translation: 1960, 1966, post 1966 Database access Text interpretation Information retrieval Text categorisation

Language engineering

Information overload Need a way of automatically

processing text documents Information extraction

Language engineering

Information extraction GIDA: system for automatically

monitoring financial market sentiment

GIDA

-5

-4

-3

-2

-1

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 1 0

T ra d in g d a y

% C

hang

e

A c tua l C lo s ing% C ha ng eC a lc u la te d % c ha ng e

O u tp u t o b ta in e d fo r t h e p e r io d 1 st t o 1 2 th Ju ly 2 0 0 2 .

Language engineering

Information overload Need a way of automatically

processing text documents Information extraction Summarisation

Automatic summarisation(courtesy of Paulo FERNANDES de OLIVEIRA, PhD)

• Get information source;

• Extract some content from it;

• Present the most importantmost important part to the userxx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx xxxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xxxx xxxx xxxxxx xx xx xxxx x xxxxx x xx xx xxxxx x x xxxxx xxxxxx xxxxxx x xxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xx x xxxx xxxx xxx xxxx xx

xxx xx xxx xxxx xxxxx x xxxx x xx xxxxxx xxx xxxx xx x xxxxxx xxxx x xxx x xxxxx xx xxxxx x x xxxxxxxxx xx x xxxxxxxxxx xx xx xxxxx xxx xxxxx xx xxxx x xxxxxxx xxxxx x

Lexical CohesionSentence 23:

J&J's stock added 83 cents to $65.49.

Sentence 26:

Flagging stock markets kept merger activity and new stock offerings on the wane, the firm said.

Sentence 42:

Lucent, the most active stock on the New York Stock Exchange, skidded 47 cents to $4.31, after falling to a low at $4.30.

Sentence 15:

"For the stock market this move was so deeply discounted that I don't think it will have a major impact".

Links Example

Text title: U.S. stocks hold some gains.

Collected from Reuters’ Website on 20 March 2002.

Lexical Cohesion

17. In other news, Hewlett-Packard said preliminary estimates showed shareholders had approved its purchase of Compaq Computer -- a result unconfirmed by voting officials.

 

 

19. In a related vote, Compaq shareholders are expected on Wednesday to back the deal, catapulting HP into contention against International Business Machines for the title of No. 1 computer company.

Bonds Example

Text title: U.S. stocks hold some gains.

Collected from Reuters’ Website on 20 March 2002.

Language engineering

Information overload Need a way of automatically

processing text documents Information extraction Summarisation Translation Retrieve only relevant documents Voice processing

Language engineering

Two main approaches Symbolic Stochastic

Case study spell checkers

Spelling dictionaries aim? given a sequence of symbols:

1. identify misspelled strings 2. generate a list of possible ‘candidate’

correct strings 3. select most probable candidate from

the list

Spelling dictionaries Implementation:

Probabilistic framework bayesian rule noisy channel model

Spelling dictionaries Types of spelling error

actual word errors non-word errors

Spelling dictionaries Types of spelling error

actual word errors /piece/ instead of /peace/ /there/ instead of /their/

non-word errors

Spelling dictionaries Types of spelling error

actual word errors /piece/ instead of /peace/ /there/ instead of /their/

non-word errors /graffe/ instead of /giraffe/

Spelling dictionaries Types of spelling error

actual word errors /piece/ instead of /peace/ /there/ instead of /their/

non-word errors /graffe/ instead of /giraffe/

of all errors in type written texts, 80% are non-word errors

Spelling dictionaries non-word errors

Cognitive errors /seperate/ instead of /separate/ phonetically equivalent sequence of symbols

has been substituted due to lack of knowledge about spelling

conventions

Spelling dictionaries non-word errors

Cognitive errors Typographic (‘typo’) errors

influenced by keyboard e.g. substitution of /w/ for /e/ due to its

adjacency on the keyboard /thw/ instead of /the/

Spelling dictionaries non-word errors noisy channel model

The actual word has been passed through a noisy communication channel

This has distorted the word, thereby changing it in some way

The misspelled word is the distorted version of the actual word

Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted

Spelling dictionaries non-word errors noisy channel model What are the possible distortions?

insertion deletion substitution transposition all of these viewed as transformations that

take place in the noisy channel

Spelling dictionaries Implementing spelling identification

and correction algorithm

Spelling dictionaries Implementing spelling identification and

correction algorithm STAGE 1: compare each string in document with a

list of legal strings; if no corresponding string in list mark as misspelled

STAGE 2: generate list of candidates Apply any single transformation to the typo string Filter the list by checking against a dictionary

STAGE 3: assign probability values to each candidate in the list

STAGE 4: select best candidate

Spelling dictionaries STAGE 3

prior probability given all the words in English, is this candidate more

likely to be what the typist meant than that candidate? P(c) = c/N where N is the number of words in a corpus

likelihood Given, the possible errors, or transformation, how likely

is it that error y has operated on candidate x to produce the typo?

P(t/c), calculated using a corpus of errors, or transformations

Bayesian rule: get the product of the prior probability and the

likelihood P(c) X P(t/c)

Spelling dictionaries non-word errors Implementing spelling identification

and correction algorithm STAGE 1: identify misspelled words STAGE 2: generate list of candidates STAGE 3a: rank candidates for probability STAGE 3b: select best candidate Implement:

noisy channel model Bayesian Rule

Next week

Finite state machines and regular expressions