N-gram Search Engine on Wikipedia

Post on 20-Jan-2016

33 views 0 download

Tags:

description

N-gram Search Engine on Wikipedia. Satoshi Sekine (NYU) Kapil Dalwani (JHU). Hammer : Fast and multi-functional n-gram search engine. Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text. ngrams. 2. Characteristics. Search up to 7 grams with wildcards - PowerPoint PPT Presentation

Transcript of N-gram Search Engine on Wikipedia

July 30th, 2009 Lexical Knowledge from

Ngrams1

N-gram Search Engine on Wikipedia

Satoshi Sekine (NYU)Kapil Dalwani (JHU)

July 30th, 2009

Lexical Knowledge from Ngrams

2

Hammer : Fast and multi-functional n-gram search engine

2

ngrams

Search ngram:

FAST

INPUT: token, POS, chunk, NE

OUTPUT: frequency to text

July 30th, 2009

Lexical Knowledge from Ngrams

3

Characteristics

• Search up to 7 grams with wildcards• Multi-level input

– Token, POS, chunk, NE, combinations– NOT, OR for POS, chunk, NE

•Multi-level output– Token, POS, chunk, NE– document information– Original sentences, KWIC, ngram

•Display– Show the results in the order of frequency

•Running Environment– Single CPU, PC-Linux, 400MB process, 500GB disk

3

July 30th, 2009

Lexical Knowledge from Ngrams

4

Demo

• http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2

July 30th, 2009

Lexical Knowledge from Ngrams

5

Available for you

• Web system– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive

July 30th, 2009

Lexical Knowledge from Ngrams

6

1. Search candidates

2. Filtering3. Display

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Lexical Knowledge from Ngrams

7

1. Search candidates

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Lexical Knowledge from Ngrams

8

• Example: 3-grams

•Posting list

From n-gram to Inverted Index

Ngram ID Position=1 Position=2 Position=3

1 A B C

2 A B B

3 B A C

3A pos=2

1 2A pos=1

3B pos=1

1 2B pos=2

2B pos=3

1 3C pos=3

July 30th, 2009

Lexical Knowledge from Ngrams

9

Posting list

• Wide variation of posting list size (in 7-gram: 1.27B)– “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672)– conscipcuous, consiety, Mizuk, (1)

• 3 types for faster speed and smaller index size– Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list)

– List of ngramID

– Encoded into pointer (freq=1)

1 3C pos=3

1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1

C pos=3 5

July 30th, 2009

Lexical Knowledge from Ngrams

10

Search

• Given an n-gram request (A B C)– Get posting lists for A, B and C– Search intersections of posting lists– Use “look ahead” to speed up the search

• Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996)

4 33 34 55 76 80 89 92 99

4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98

SKIP

July 30th, 2009

Lexical Knowledge from Ngrams

11

1 Search candidates.

2. Filtering

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Lexical Knowledge from Ngrams

12

Filtering

• Not all candidate ngramID’s match the request

• We need frequency, sentence information to matched n-grams

• POS, chunk and NE information is presented as ID– Reduce the index more than 200GB

NN

VB

PERSON

LOC

A BFreq=123

Freq=10Freq=5

July 30th, 2009

Lexical Knowledge from Ngrams

13

1. Search candidates

3. Display2. Filtering

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Lexical Knowledge from Ngrams

14

Display

• N-gram will be displayed in the descending order of frequency– N-gram ID is ordered by the frequency

• Sentences are searched using suffix array• POS, chunk, NE are displayed with sentence,

KWIC, ngram• Doc ID, title of Wikipedia (and possible

features of doc) is displayed with sentences and KWIC

July 30th, 2009

Lexical Knowledge from Ngrams

15

Size of data

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayFor text

POS, chunk, NEfor

N-gram data

108 GB

6 GB

8 GB

8 GB

260 GB

100 GB

Others

40 GB

Text 1.7 G words 200M sentences 2.4M articles

Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B

Total530GB

July 30th, 2009

Lexical Knowledge from Ngrams

16

Future Work

• Other information (ex: parse, coref, relation, genre, discourse…)

• Longer n-gram• Compress index, dictionary• Ease the indexing load

– Now we need a big memory machine– Distributing indexing

• Union operation for tokens

July 30th, 2009

Lexical Knowledge from Ngrams

17

Available for you

• Web demo– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive