Working with arrays (we will use an array of double as example)
An Efficient Language Model Using Double-Array Structures
-
Upload
jun-ya-norimatsu -
Category
Technology
-
view
3.590 -
download
3
Transcript of An Efficient Language Model Using Double-Array Structures
![Page 1: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/1.jpg)
An Efficient Language Model
Using Double-Array Structures
Makoto Yasuhara, Toru Tanaka
Jun-ya Norimatsu, Mikio Yamamoto
University of Tsukuba, Japan
EMNLP 2013
![Page 2: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/2.jpg)
Introduction(1)
Bigger and Bigger LMs
Have you ever encountered these problems?
LMs cannot be load into memory because of their size
The query speed for LMs become a bottleneck of your system
Store compactly, query fast!
![Page 3: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/3.jpg)
Our System Overview
We call our LM “DALM”
• LM implementation based on double-array structures
• Modified double-array structure to store backward suffix trees
• Two optimization methods to improve efficiency
![Page 4: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/4.jpg)
Double-Array Structures
(Aoe, 1989)
A fast and compact representation of a trie
What is a double-array structure?
Abstract image
A B
Double-array representation
1
1 1
BASE
CHECK
A trie is represented by two arrays (BASE and CHECK)
ROOT A B
ROOT
![Page 5: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/5.jpg)
2D Array Implementation of a Trie
2 3
4 5
6
7
B C
BA C
C
1
2 3
4 5 6
7
1
2
3
4
5
6
7
Node#
A B C
Simple and fast but consumes a lot of memory
Sparse array
ROOT
![Page 6: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/6.jpg)
Compact Representation of a
Sparse 2D Array
2 3
4 5
6
7
1
2
3
4
5
6
7
Node# A B C
2 3
4 5
6
7
Shift 3
Shift 3
Shift 4
2 3 4 6 5 7Merged-NEXT
Merge
Information loss!
Double-array structure modified
to include all information about the original trie
Shift
![Page 7: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/7.jpg)
Details of Double-Array Structures
(Aoe, 1989)
0 3 3 0 0 4 0
0 0 2 3 2 6
BASE
CHECK
B C
BA C
C
0 1 2 3 4 5 6 7
Definition:
Example:
ROOT
![Page 8: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/8.jpg)
Efficient Trie Representations for
Ngram Model
B
A C
C
X
Y
Z
X
X
Y
X
Y
(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)
History words are stored in reverse order
Target words are stored in separated lists
Efficient back-off
Backward suffix trees
The B node is
not found
ROOT
![Page 9: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/9.jpg)
Endmarker Symbols for
Backwards Suffix Trees
B
A C
C
X
Y
Z
X
X
Y
X
Y
B
A C
C
#
X Y Z #
X Y
#
XY#
X
Endmarker symbols (Aoe, 1989) are placed after history words
Target word follows
the endmarker symbol
ROOTROOT
![Page 10: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/10.jpg)
Double-array Representation of
Backward Suffix Trees
Endmarker symbols are treated as words
A word ID is assigned to the endmarker symbol
B
X
Y
Z
0 2 4 0 0 4 0
0 2 2 3 3 3
BASE
CHECK
0 1 3 4 5 6 72ROOT
![Page 11: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/11.jpg)
Double-array Language Model:
Simple Structures
Introducing a VALUE array
0 2 5 4
0 3 6
BASE
CHECK
0 1 2 3 4 5 6 7
A
A
BXVALUE
B # X
The VALUE array contains corresponding
probabilities and back-off weights (BOW)
ROOT
![Page 12: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/12.jpg)
Double-array Language Model:
Embedding structures (1)
Filling unused slots with values
0 2 5 4
0 3 6
BASE
CHECK
0 1 2 3 4 5 6 7
A
A
BX
B # X
Unused slots
These empty slots are used to store values
ROOT
![Page 13: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/13.jpg)
Double-array Language Model:
Embedding structures (2)
Using the BASE and CHECK arrays to store values
0 2 5 4
0 3 -2 6
BASE
CHECK
0 1 2 3 4 5 6 7
A B # X
VALUE
Lossless
quantization
Index of the VALUE array
with a negative sign
![Page 14: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/14.jpg)
Double-array Language Model:
Ordering method (1)
Tuning for word IDs
We assign word IDs in order of unigram probability
P(Word) Word Word ID
- # 1
0.0413 B 2
0.0300 X 3
0.0284 A 4
0.0201 Y 5
0.0101 C 6
0.0050 Z 7
0.0020 D 8
Sort the words in
order of descending
probability
![Page 15: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/15.jpg)
Double-array Language Model:
Ordering method (2)
3 2 13
6 4
6 11
9 8
1
2
3
4
Node# # B X A Y C Z D
3 2 13
6 4
6 11
9 8
1
2
3
4
Node# # B CA YX ZD
Before ordering:
After ordering:
Modifying the 2D array
![Page 16: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/16.jpg)
Experiments: Datasets
Model Corpus size
[words]
Unique types
[words]
N-grams
(unigrams to
5-grams)
100 Mwords 100 M 195 K 31 M
5 Gwords 5 G 2,140 K 936 M
Test set 100 M 198 K -
Publication of unexamined Japanese patent applications
Data source
Distributed with the NTCIR 3,4,5,6 patent retrieval task
(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)
![Page 17: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/17.jpg)
Comparison: Proposed Methods
Results for 100-Mword corpus
![Page 18: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/18.jpg)
Building a large double-array structure needs a lot of time
Dividing the trie into several parts
It is impractical to wait for the 5-Gword model to get built
(Nakamura and Mochizuki, 2006)
A C
C# #
A C
C# #
Division Method
ROOT ROOT ROOT
![Page 19: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/19.jpg)
Experiments: Division Methods
Results for 100-Mword corpus
![Page 20: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/20.jpg)
Experiments: Other Methods
Results for 100-Mword and 5-Gword corpora
![Page 21: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/21.jpg)
Discussion
DALM is smaller and faster than KenLM Probing
The smallest LM is KenLM Trie
The differences between KenLM Probing and DALM are
smaller for the 5-Gword model than for the 100-Mword model
Large language models require shorter back-off time
![Page 22: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/22.jpg)
Conclusion
We proposed an efficient language model using double-array structures
We proposed two optimization methods: embedding and ordering
In experiments, DALM achieved the best speed among the compared
LMs though keeping modest model size.
• Double-array structures are a fast and compact representation of tries
• We use double-array structures to represent backward suffix trees
• Embedding: using empty slots in the double-array to store values
• Ordering: tuning word IDs to make LMs smaller and faster
![Page 23: An Efficient Language Model Using Double-Array Structures](https://reader033.fdocuments.net/reader033/viewer/2022042701/55af7e971a28ab2d368b461b/html5/thumbnails/23.jpg)
Questions…
My English skills are limited
Please speak slowly if you have any questions.