N-gram Search Engine on Wikipedia

July 30th, 2009 Lexical Knowledge from

Ngrams1

Satoshi Sekine (NYU)Kapil Dalwani (JHU)

July 30th, 2009

Lexical Knowledge from Ngrams

Hammer : Fast and multi-functional n-gram search engine

ngrams

Search ngram:

INPUT: token, POS, chunk, NE

OUTPUT: frequency to text

July 30th, 2009

Characteristics

• Search up to 7 grams with wildcards• Multi-level input

– Token, POS, chunk, NE, combinations– NOT, OR for POS, chunk, NE

•Multi-level output– Token, POS, chunk, NE– document information– Original sentences, KWIC, ngram

•Display– Show the results in the order of frequency

•Running Environment– Single CPU, PC-Linux, 400MB process, 500GB disk

July 30th, 2009

• http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2

July 30th, 2009

Available for you

• Web system– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive

July 30th, 2009

1. Search candidates

2. Filtering3. Display

Implementation: Overview

Wikipedia text

WikipediaPOS, chunk, NE

N-gram data

Inverted index for n-gram data

Suffix arrayfor text

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Wikipedia text

N-gram data

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

• Example: 3-grams

•Posting list

From n-gram to Inverted Index

Ngram ID Position=1 Position=2 Position=3

1 A B C

2 A B B

3 B A C

3A pos=2

1 2A pos=1

3B pos=1

1 2B pos=2

2B pos=3

1 3C pos=3

July 30th, 2009

Posting list

• Wide variation of posting list size (in 7-gram: 1.27B)– “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672)– conscipcuous, consiety, Mizuk, (1)

• 3 types for faster speed and smaller index size– Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list)

– List of ngramID

– Encoded into pointer (freq=1)

1 3C pos=3

1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1

C pos=3 5

July 30th, 2009

Search

• Given an n-gram request (A B C)– Get posting lists for A, B and C– Search intersections of posting lists– Use “look ahead” to speed up the search

• Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996)

4 33 34 55 76 80 89 92 99

4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98

July 30th, 2009

1 Search candidates.

2. Filtering

Wikipedia text

N-gram data

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Filtering

• Not all candidate ngramID’s match the request

• We need frequency, sentence information to matched n-grams

• POS, chunk and NE information is presented as ID– Reduce the index more than 200GB

PERSON

A BFreq=123

Freq=10Freq=5

July 30th, 2009

3. Display2. Filtering

Wikipedia text

N-gram data

POS, chunk, NEfor

N-gram data

Searchrequest

July 30th, 2009

Display

• N-gram will be displayed in the descending order of frequency– N-gram ID is ordered by the frequency

• Sentences are searched using suffix array• POS, chunk, NE are displayed with sentence,

KWIC, ngram• Doc ID, title of Wikipedia (and possible

features of doc) is displayed with sentences and KWIC

July 30th, 2009

Size of data

Wikipedia text

N-gram data

Suffix arrayFor text

POS, chunk, NEfor

N-gram data

108 GB

260 GB

100 GB

Others

Text 1.7 G words 200M sentences 2.4M articles

Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B

Total530GB

July 30th, 2009

Future Work

• Other information (ex: parse, coref, relation, genre, discourse…)

• Longer n-gram• Compress index, dictionary• Ease the indexing load

– Now we need a big memory machine– Distributing indexing

• Union operation for tokens

July 30th, 2009

Available for you

• Web demo– At NYU

• http://nlp.cs.nyu.edu/nsearch

– At JHU?

• USB Hard drive

N-gram Search Engine on Wikipedia

Documents

Transcript of N-gram Search Engine on Wikipedia

PENGARUH VARIASI ELEKTROLIT KALIUM HIDROKSIDA … · elektrolit KOHyang divariasikan yaitu 0.50 gram,0.55 gram,0.60 gram,0.65 gram, 0.70 gram dan 0.75 gram dalam setiap 1 liter aquades.

Wikipedia: redazione delle voci e scrittura collaborativaeventipa.formez.it/sites/default/files/allegati_eventi/Wikipedia... · I PILASTRI Wikipedia è un'enciclopedia Wikipedia ha

Introduction to Wikipedia & Wikipedia assignment.

Stirling Engine - Wikipedia

Stirling engine - 123seminarsonly.comStirling engine From Wikipedia, the free encyclopedia This article has multiple issues.Please help improve it or discuss these issues on the talk

Lecture 12.2b- Gram-Gram Stoich

Gram stain: Gram positive streptococcus and Gram negative rod

Bacterias Gram Positivas y Gram Negativas

STUDI EKSPERIMEN VARIASI ROLLER 7 GRAM, 10 GRAM, 11 …tugas akhir – tm141585 studi eksperimen variasi roller 7 gram, 10 gram, 11 gram dan 12 gram pada continuously variable transmission

Tap 3 ngu phap - gram gram

Wikipedia in Kosovo (Wikipedia Weekend Tirana)

Gram a Ti Gram As

Pewarnaan Gram (Gram Staining-hsc

Intelligent design - Wikipedia, the free encyclopedia - … design - Wikipedia, the free... · Intelligent design - Wikipedia, the free encyclopedia Intelligent design From Wikipedia,

Wikipedia on Twitter: Analyzing Tweets about Wikipedia

Gram Negativos y Gram Positivos

Wikipedia Campus Training: Wikipedia Essentials · Wikipedia Campus Training: Wikipedia Essentials 1. Objectives At the end of this training you will: Know how to create your Wikipedia

clasificacion gram + y gram -

Tap 2 tu vung - gram gram

Gram panchayat & gram sabha