Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden...
-
Upload
posy-stafford -
Category
Documents
-
view
221 -
download
0
Transcript of Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden...
![Page 1: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/1.jpg)
Trustworthy Keyword Search for Regulatory Compliant Record Retention
Windsor W. Hsu
IBM Almaden
Marianne Winslett
University of Illinois
Soumyadeb Mitra
University of Illinois
IBM Almaden
![Page 2: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/2.jpg)
There is a need for trustworthy record keeping
HIPAA
Focus on ComplianceFocus on Compliance
DigitalDigitalInformationInformationExplosionExplosion
SoaringSoaringDiscoveryDiscovery
CostsCosts
Spending oneDiscovery Growing
at 65% CAGR
Average F500Company Has 125
Non-FrivolousLawsuits at Any
Given Time
IDC Forecasts60B Business
Emails AnnuallyBy 2006
Instant Messaging
Files
Records
Corporate Corporate MisconductMisconduct
Sources: IDC, Network World (2003), Socha / Gelbmann (2004)
![Page 3: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/3.jpg)
What is trustworthy record keeping?
Alice
Storage Device
time
Bob
QueryRegret
Adversary
CommitRecord
Establish solid proof of events that have occurred
Bob should get back Alice’s data
![Page 4: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/4.jpg)
This leads to a unique threat model
time
Query is trustworthy
Commit is trustworthy
Adversary has super-user privileges
Record is createdproperly
Record is queriedproperly
• Access to storage device• Access to any keys
Adversary could be Alice herself
![Page 5: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/5.jpg)
Traditional schemes do not work
time
Cannot rely on Alice’s signature
![Page 6: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/6.jpg)
WORM storage helps address the problem
New Record Record Overwrite/
Delete
Write Once Read Many
Adversary cannotdelete Alice’s record
![Page 7: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/7.jpg)
Index required due to high volume of records
Query from Index
Alice
time
Bob
Regret
Adversary
Commit RecordUpdate Index
Index
![Page 8: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/8.jpg)
In effect, records can be hidden/altered by modifying the index
Or replace B with B’
The index must also be secured (fossilized)
Hide record B from the
index A B B’B
![Page 9: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/9.jpg)
Most business records are unstructured, searched by inverted index
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Keywords Posting Lists
One WORM file for each posting list
![Page 10: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/10.jpg)
Index must be updated as new documents arrive
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Keywords Posting Lists
Data
Index
Query
Doc: 7979
79
79
500 keywords = 500 disk seeks ~1 sec per document
![Page 11: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/11.jpg)
Amortize cost by updating in batch
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Keywords Posting Lists
Query
Doc: 79
Query
Doc: 82
Query
Doc: 83
79 81 83
Buffer
1 seek per keyword in batch Large buffer to benefit infrequent terms
Over 100,000 documents to achieve 2 docs/sec
Doc: 80Doc: 81
![Page 12: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/12.jpg)
Index is not updated immediately
Alice
Adversary
Index
Buffer
time
OmitAlter
Buffer
Prevailing practice – email must be committed before it is delivered
CommitRecord
![Page 13: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/13.jpg)
Can storage server cache help?
Storage servers have huge cache Data committed into cache is effectively on
disk Is battery backed-up Inside the WORM box, so is trustworthy
![Page 14: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/14.jpg)
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Base
Index
Query
Doc: 7979
Cache Miss
Caching does not benefit infrequent terms
79 Cache Miss
79 Cache Miss
Query
Doc: 80
80
Cache Hit
Caching works in blocks
![Page 15: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/15.jpg)
Simulation results show caching is not enough
Cache Misses Per Doc
0
50
100
150
200
250
300
350
400
450
500
0 4 8 16 32 64 128
256
512
1024
2048
4096
Cache Size
I/O
Per
Do
c
Cache Miss
![Page 16: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/16.jpg)
Simulation results show caching is not enough
What if number posting lists ≤ Number of cache blocks?
Each update will hit the cache
![Page 17: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/17.jpg)
So, merge posting lists so that the tails blocks fit in cache
Query
Data
Base
Worm
Index
1 3 11
3 9 31
3 19
7 36
3
Only 1 random I/O per document, for 4K block size
00 1 00 3 01 3 10 3
01 9 00 11 10 19 01 31
Document IDs
Keyword Encodings
![Page 18: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/18.jpg)
VLDB 2006 : Query answered by scanning both posting lists
The tradeoff is longer lists to scan during lookup
Workload lookup cost before merging:
∑ tw qw
After merging into A = {A1, …, An} :
∑ ( ∑ tw ) (∑ qw )w € A w € AA
w
length of posting list for keyword w
# of times w is queried in workload
length of A
# of times A is searched
![Page 19: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/19.jpg)
Which lists to merge? We need a heuristic solution
Choose A={A1, A2 .. An} n = Cache blocks Minimize ∑ ( ∑ tw ) * (∑ qw )
Problem is NP-complete Reduction from minimum-sum-of-squares
So, try some merging heuristics on a real-world workload 1 million documents from IBM’s intranet 300,000 queries
![Page 20: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/20.jpg)
A few terms contribute most of the query workload cost
0.E+00
1.E+09
2.E+09
3.E+09
4.E+09
5.E+09
6.E+09
0 5000 10000 15000 20000 25000
Term Rank
Wo
rklo
ad C
ost
QF
TF
(tw *qw)
![Page 21: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/21.jpg)
Different merging heuristics were tried
Separate lists for high contributor terms Merging heuristics
Based on qw tw
Random merging Details of heuristics, evaluation in paper
![Page 22: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/22.jpg)
Additional index support is needed to answer conjunctive queries quickly
2
7
13
24
31
3
24
Merge Join : O (m+n)
24
2
7
13
24
31
3
24
24
Index Join : m log(n)
2
13
31
2
31
VLDB and 2006
VLDB 2006
n
m
![Page 23: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/23.jpg)
How to maintain B+trees on WORM B+trees require node split and join B+tree on posting list are special case
Documents IDs are inserted in an increasing order Can be built bottom up without split/joins Please refer to our paper
![Page 24: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/24.jpg)
B+tree index is insecure, even on WORM
2 4 7 11 13 19 23 29 31
7 13
23
31
25
25 26 30
27
Path to an element depends on elements inserted later – Adversary can attack it
![Page 25: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/25.jpg)
Our solution is jump indexes
Path to an element only depends on elements inserted before
Jump index is provably trustworthy Leverages the fact that document IDs are increasing O(log N) lookup : N - # of documents
Supports range queries too Reasonable performance as compared to B+ trees
for conjunctive queries in experiments with real-workload
For details, see our paper
![Page 26: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/26.jpg)
Conclusions
WORM storage by itself is not enough We need a trustworthy index too
Trustworthy inverted indexes can be built efficiently 10-15% slowdown, for non-conjunctive queries Within 1.5x of optimal B-tree performance, for
conjunctive queries Other possible uses
Indexing time
![Page 27: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/27.jpg)
Future work
Ranking attacks Adversary can attack a lot of junk documents
Secure generic index structure Path to an element is fixed Supports range queries
![Page 28: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/28.jpg)
Questions
![Page 29: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/29.jpg)
A new index structure required: Jump Index
n 0 1 2 3 4 5
Element Pointers
ith pointer points to an element ni
n + 2i ≤ ni < n + 2(i+1)
n+2
n + 2 1 ≤ n+2 < n + 2 2
![Page 30: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/30.jpg)
Jump index in action
1
2
1 + 2 0 ≤ 2 < 1 + 2 1 1 + 2 2 ≤ 5 < 1 + 2 3
0 1 2 3 4
71 + 2 2 ≤ 7 < 1 + 2 3
5 + 2 1 ≤ 7 < 5 + 2 2
5
0 1 2 3 4
Already Set
Follow Pointer
log(N) pointers to N
![Page 31: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/31.jpg)
Path to an element does not depend on future elements
1
2
0 1 2 3 4
7
5
0 1 2 3 4
1 + 2 2 ≤ 7 < 1 + 2 3
Follow Pointer
Start here
5 + 2 1 ≤ 7 < 5 + 2 2
Got 7
Lookup (7)
![Page 32: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/32.jpg)
Jump index elements are stored in blocks
Jump Pointersp entries(0
,1)
(0,2
)
(0,B
-1)
(1,0
)
(0,1
)
( i ,
j )
.. ..l
l + j*Bi ≤ x < l + (j+1)*Bi
Storing pointers with every element is inefficient p entries are grouped together Brach factor B.
Pointer (i,j) from block b points to b’ having smallest x
![Page 33: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/33.jpg)
Jump index evaluation parameters p : Elements grouped together B : Branching factor L : Block size
p + (B-1) logB N ≤ L
Evaluation Index update performance
Pointers have to be set Query performance
![Page 34: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/34.jpg)
Update performance levels off at reasonable cache size
0
10
20
30
40
50
60
128MB
160MB
192MB
224MB
256MB
288MB
320MB
352MB
384MB
Cache Size
I/O
s p
er
Do
c 16
32
64
128
![Page 35: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/35.jpg)
Query performance is close to optimal (Btree)
0
1
2
3
4
5
6
7
2 3 4 5 6
Terms in Query
Sp
ee
du
p
1-8
1-16
1-32
1-64
1-128
"btree"
![Page 36: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/36.jpg)
Btree on WORM
![Page 37: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649d145503460f949e83a3/html5/thumbnails/37.jpg)
Is this a real threat?
Would someone want to delete a record after a day its created?
Intrusion detection logging Once adversary gain control, he would like to
delete records of his initial attack Record regretted moments after creation
Email best practice - Must be committed before its delivered