Interactive Document Indexing Method Based on Explicit Semantic Analysis (PART I: INDEX)

23
Interactive Document Indexing Method Based on Explicit Semantic Analysis (PART I: INDEX) Andrzej Janusz, Wojciech Swieboda, Adam Krasuki, Hung Son Nguyen Faculty of Mathematics, Information and Mechanics, The University of Warsaw RSCTC 2012 Dec 19, 2013 Hee-gook Jun

description

Interactive Document Indexing Method Based on Explicit Semantic Analysis (PART I: INDEX). Andrzej Janusz , Wojciech Swieboda , Adam Krasuki , Hung Son Nguyen Faculty of Mathematics, Information and Mechanics, The University of Warsaw RSCTC 2012 Dec 19, 2013 Hee -gook Jun. Outline. - PowerPoint PPT Presentation

Transcript of Interactive Document Indexing Method Based on Explicit Semantic Analysis (PART I: INDEX)

Page 1: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

Interactive Document Indexing Method Based on Explicit Semantic Analysis (PART I: INDEX)

Andrzej Janusz, Wojciech Swieboda, Adam Krasuki, Hung Son NguyenFaculty of Mathematics, Information and Mechanics, The University of WarsawRSCTC 2012

Dec 19, 2013Hee-gook Jun

Page 2: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

2 / 23

Outline

Introduction Index Structure Index Optimization

Page 3: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

3 / 23

Index

Index Inverted Index

Page 4: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

4 / 23

Basic Concept: Disk Access

Sequential Access Random Access

Page 5: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

5 / 23

Basic Concept: Disk IO

x

x

Memory DiskBlock IO

Shared Memory

Program Memory

x

READ X

Page 6: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

6 / 23

Sample Table

Singer 데이터– Tuple: 20– 1 Block = 4 Tuple (5 Blocks)

Name Gender DebutDate MgtCode

강민준 남 20010105 A0015

강서연 여 20020206 A0065

고민재 남 20030506 A0035

고민서 여 20040607 A0065

고지훈 남 20050708 A0054

구준서 남 20060901 A0065

금수빈 여 20071015 A0098

… … … …

Page 7: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

7 / 23

Create Index

CREATE INDEX ON Singer (name);– RowID = Block Address + Offset

Index Key RowID

강민준 AAPLO/AAUAAAAPAAA

강서연 AAPLO/AAUAAAAPAAD

고민재 AAPLO/AAUAAAAPAAH

고민서 AAPLO/AAUAAAAPAAI

고지훈 AAPLO/AAUAAAAPAAJ

구준서 AAPLO/AAUAAAAPAB

금수빈 AAPLO/AAUAAAAPAAE

… …

B tree Leaf Node

안성민강민준김현우나민지

Page 8: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

8 / 23

SELECT * FROM Singer WHERE name = '구준서 '– 20 Tuples, 5 Blocks (1 Block = 4 Tuple)– Block Read = 5

Point Query: Without Index

Memory

Disk

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

구준서 남 ..

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3

박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

구준서 남 ..

Page 9: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

9 / 23

SELECT * FROM Singer WHERE name = '구준서 '– 20 Tuples, 5 Blocks (1 Block = 4 Tuple)– Block Read = 1

Point Query: With Index

Memory

Disk

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3

박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

구준서 남 ..

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

구준서 남 ..

Idx Key

RowID

강민준 AAPLO/AAUAAAA-PAAA

강서연 AAPLO/AAUAAAA-PAAD

고민재 AAPLO/AAUAAAA-PAAH

고민서 AAPLO/AAUAAAAPAAI

고지훈 AAPLO/AAUAAAAPAAJ

구준서 AAPLO/AAUAAAA-PAB

금수빈 AAPLO/AAUAAAA-PAAE

… …

Page 10: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

10 / 23

Range Query: Without Index

SELECT * FROM Singer WHERE name like '강 %'– Block Read = 5

Memory

Disk

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

강서연 여 ..

나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3

박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

강민준 남 ..

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

강민준 남 ..

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

강서연 여 ..

Page 11: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

11 / 23

Range Query: With Index

SELECT * FROM Singer WHERE name like '강 %'– Block Read = 2

Memory

Disk

나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3

박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1Idx Key

RowID

강민준 AAPLO/AAUAAAA-PAAA

강서연 AAPLO/AAUAAAA-PAAD

고민재 AAPLO/AAUAAAA-PAAH

고민서 AAPLO/AAUAAAAPAAI

고지훈 AAPLO/AAUAAAAPAAJ

구준서 AAPLO/AAUAAAAPAB

금수빈 AAPLO/AAUAAAA-PAAE

… …

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

강민준 남 ..

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

강서연 여 ..

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

강민준 남 ..

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

강서연 여 ..

Page 12: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

12 / 23

Read All Data: Without Index

Block Read = Number of Blocks

Memory

Disk

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

구준서 남 ..

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3

박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

구준서 남 ..

Page 13: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

13 / 23

Read All Data: With Index

This case, do not use index– Block Read = Number of tuples

Index Key RowID

강민준 AAPLO/AAUAAAAPAAA

강서연 AAPLO/AAUAAAAPAAD

고민재 AAPLO/AAUAAAAPAAH

고민서 AAPLO/AAUAAAAPAAI

고지훈 AAPLO/AAUAAAAPAAJ

구준서 AAPLO/AAUAAAAPAB

금수빈 AAPLO/AAUAAAAPAAE

… …

B tree Leaf Node

Memory

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5

Page 14: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

14 / 23

Outline

Introduction Index Structure Index Optimization

Page 15: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

15 / 23

Index Structure

HashFunction

Cache Buffer Chain Latch

RowID

Buffer Block

Page 16: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

16 / 23

Indexing Cost

DataIndex

1VerticalSearch

2Horizontal

Search

3RandomAccess

Page 17: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

17 / 23

Outline

Introduction Index Structure Index Optimization

– Reduce Random Access– Reduce Horizontal Search

Page 18: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

18 / 23

Reduce Random Access: Buffer Pinning [1/2]

Do not count as logical read– Prevent age-out of shared Block

Before buffer pinning– Clustering Factor = number of index key

Idx Key

RowID

강민준 AAPLO/AAUAAAA-PAAA

강서연 AAPLO/AAUAAAA-PAAD

고민재 AAPLO/AAUAAAA-PAAH

고민서 AAPLO/AAUAAAAPAAI

고지훈 AAPLO/AAUAAAAPAAJ

구준서 AAPLO/AAUAAAAPAB

금수빈 AAPLO/AAUAAAA-PAAC

김서현 AAPLO/AAUAAAA-PAAP

김현우 AAPLO/AAUAAAA-PAAG

나민지 AAPLO/AAUAAAA-PAAE

고민재 남 ..서수민 여 ..구준서 남 ..강서연 여 ..

5고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

나민지 여 ..여유진 여 ..김서현 여 ..김현우 남 ..

3박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

Page 19: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

19 / 23

Reduce Random Access: Buffer Pinning [2/2]

After buffer pinning– Clustering Factor = 7

Idx Key

RowID

강민준 AAPLO/AAUAAAA-PAAA

강서연 AAPLO/AAUAAAA-PAAD

고민재 AAPLO/AAUAAAA-PAAH

고민서 AAPLO/AAUAAAAPAAI

고지훈 AAPLO/AAUAAAAPAAJ

구준서 AAPLO/AAUAAAAPAB

금수빈 AAPLO/AAUAAAA-PAAC

김서현 AAPLO/AAUAAAA-PAAP

김현우 AAPLO/AAUAAAA-PAAG

나민지 AAPLO/AAUAAAA-PAAE

강서연 여 ..고민재 남 ..서수민 여 ..구준서 남 ..

5

고민서 남 ..배민성 남 ..소동현 남 ..강민준 남 ..

4

여유진 여 ..김서현 여 ..김현우 남 ..나민지 여 ..

3

박지원 남 ..노예은 여 ..고지훈 남 ..금수빈 여 ..

2

안성민 남 ..마현준 남 ..소지민 여 ..박승민 남 ..

1

Page 20: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

20 / 23

Reduce Random Access

Minimize clustering factor– SQL Server: Cluster Index– Oracle: Clustered Index, IOT (Index Organized Table)

Index Data Index + Data

ClusteredIndex

Page 21: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

21 / 23

Outline

Introduction Index Structure Index Optimization

– Reduce Random Access– Reduce Horizontal Search

Page 22: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

22 / 23

Driving Condition

Consider distinct– debutDate: Low– gender : High– mgtCode : Medium

Equal (or IN) condition– Driving condition with next condition

Range condition– Check condition

namegen-der

debut-Date

mgt-Code

강민준 M 20100305 A0015

강서연 F 20020206 A0065

고민재 M 20030506 A0035

고민서 F 20100607 A0065

고지훈 M 20050708 A0054

구준서 M 20060901 A0065

금수빈 F 20071015 A0098

김서현 F 20080909 A0015

김현우 M 20090516 A0054

나민지 F 20100123 A0065

노예은 F 20111109 A0065

마현준 M 20101005 A0065

박지원 F 20130608 A0073

박승민 M 20050709 A0073

배민성 M 20100908 A0065

서수민 F 20101215 A0065

소동현 M 20080130 A0098

소지민 F 20090807 A0035

안성민 M 20100626 A0035

여유진 F 20110519 A0035

SELECT *FROM SingerWHERE debutDate BETWEEN '20100101' AND '20131212' AND gender = ‘F' AND mgtCode = 'A0065'

Page 23: Interactive Document Indexing Method Based on Explicit Semantic Analysis  (PART I: INDEX)

23 / 23

debut-Date

gen-der

mgt-Code

20020206 F A006520030506 M A003520050708 M A005420050709 M A007320060901 M A006520071015 F A009820080130 M A009820080909 F A001520090516 M A005420090807 F A003520100123 F A006520100305 M A001520100607 F A006520100626 M A003520100908 M A006520101005 M A006520101215 F A006520110519 F A003520111109 F A006520130608 F A0073

mgt-Code

debut-Date

gen-der

A0015 20080909 FA0015 20100305 MA0035 20030506 MA0035 20090807 FA0035 20100626 MA0035 20110519 FA0054 20050708 MA0054 20090516 MA0065 20020206 FA0065 20060901 MA0065 20100123 FA0065 20100607 FA0065 20100908 MA0065 20101005 MA0065 20101215 FA0065 20111109 FA0073 20050709 MA0073 20130608 FA0098 20071015 FA0098 20080130 M

gen-der

mgt-Code

debut-Date

F A0015 20080909F A0035 20090807F A0035 20110519F A0065 20020206F A0065 20100123F A0065 20100607F A0065 20101215F A0065 20111109F A0073 20130608F A0098 20071015M A0015 20100305M A0035 20030506M A0035 20100626M A0054 20050708M A0054 20090516M A0065 20060901M A0065 20100908M A0065 20101005M A0073 20050709M A0098 20080130

debut-Date

gen-der

mgt-Code

20020206 F A006520030506 M A003520050708 M A005420050709 M A007320060901 M A006520071015 F A009820080130 M A009820080909 F A001520090516 M A005420090807 F A003520100123 F A006520100305 M A001520100607 F A006520100626 M A003520100908 M A006520101005 M A006520101215 F A006520110519 F A003520111109 F A006520130608 F A0073

mgt-Code

debut-Date

gen-der

A0015 20080909 FA0015 20100305 MA0035 20030506 MA0035 20090807 FA0035 20100626 MA0035 20110519 FA0054 20050708 MA0054 20090516 MA0065 20020206 FA0065 20060901 MA0065 20100123 FA0065 20100607 FA0065 20100908 MA0065 20101005 MA0065 20101215 FA0065 20111109 FA0073 20050709 MA0073 20130608 FA0098 20071015 FA0098 20080130 M

gen-der

mgt-Code

debut-Date

F A0015 20080909F A0035 20090807F A0035 20110519F A0065 20020206F A0065 20100123F A0065 20100607F A0065 20101215F A0065 20111109F A0073 20130608F A0098 20071015M A0015 20100305M A0035 20030506M A0035 20100626M A0054 20050708M A0054 20090516M A0065 20060901M A0065 20100908M A0065 20101005M A0073 20050709M A0098 20080130

Index Matching Performance

Distinct (gender > mgtCode > debutDate)

WHERE debutDate BETWEEN '20100101' AND '20131212' AND gender = ‘F' AND mgtCode = 'A0065'check check check driving driving check driving driving driving