Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip...

19
Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes

Transcript of Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip...

Search for Approximate Matchesin Large Databases

Eugene FinkJaime Carbonell

Aaron GoldsteinPhilip Hayes

Motivation

Fast identification of approximatematches in large sets of records.

Applications:

• Medical databases

• Customer records

• National security

Outline

• Records and queries

• Search for matches

• Experimental results

Table of records

We specify a table of records by a list of attributes.

ExampleWe can describe patients in a hospitalby their sex, age, and diagnosis.

Records and queriesA record includes a specificvalue for each attribute.

A query may include lists ofvalues and numeric ranges.

QuerySex: male, femaleAge: 20..40Dx: asthma, flu

ExampleRecordSex: femaleAge: 30Dx: asthma

Query typesA point query includes a specificvalue for each attribute.

A region query includes lists of values or numeric ranges.

Region querySex: male, femaleAge: 20..40Dx: asthma, flu

ExamplePoint querySex: femaleAge: 30Dx: asthma

Exact matchesA record is an exact match for a query if every value in the record belongs tothe respective range in the query.

RecordAge

Sex

Dx

Query

Approximate matchesA record is an approximate match for aquery if it is “close” to the query region.

Record

Age

Sex

Dx

Query

Approximate queries

An approximate query includes:

• Point or region

• Distance function

• Number of matches

• Distance limit

Outline

• Records and queries

• Search for matches

• Experimental results

Indexing structure

diagnosis

male, 30,asthma

female, 30,asthma

male, 40,flu

female, 50,flu

female, 30,ulcer

female, 30,fracture

diagnosis diagnosisdiagnosis

ageage

sexmale female

3040 5030

asthma flu fracture ulcerasthmaflu

• Maintain a PATRICIA tree of records

• Group nodes into fixed-size disk blocks

Search for matches

diagnosis

male, 30,asthma

female, 30,asthma

male, 40,flu

female, 50,flu

female, 30,ulcer

female, 30,fracture

diagnosis diagnosisdiagnosis

ageage

sexmale female

3040 5030

asthma flu fracture ulcerasthmaflu

• Depth-first search for exact matches

• Best-first search for approximate matches

Outline

• Records and queries

• Search for matches

• Experimental results

Performance

:

• Twenty-one attributes

• 1.6 million records

Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002

Use of a Pentium computer:• 2.4 GHz CPU

• 1 Gbyte memory

• 400 MHz bus

Variables

Control variables:

• Number of records

• Memory size

• Query type

Measurements:

• Retrieval time

Small memory• Number of records: 100 to 1,672,016• Memory size: 4 MByte

Ret

riev

al T

ime

(mse

c)

100

10

1102 103 104 105 106

Number of Records

Rangequeries

Approximatequeries

Exact queries

Availablememory

lg n

n0.15

lg n

n0.5

Large memory• Number of records: 1,672,016• Memory size: 64 to 1,024 MByte

Range queries

Approximatequeries

Exact queries

Ret

riev

al T

ime

(mse

c)

100

10

164 128 256 512 1,024

Memory Size (MBytes)

1,000

10,000

Summary• Retrieval time grows as fractional power (about 0.5) of database size

• If we extrapolate this growth rate, retrieval times are reasonable for very large databases

Summary• Retrieval time grows as fractional power (about 0.5) of database size

• If we extrapolate this growth rate, retrieval times are reasonable for very large databases:

Number ofrecords (n)

n 0.5 time(seconds)

1,000,000100,000,000

10,000,000,0001,000,000,000,000

0.05 . 0.50 .

5.00 .

50.00 .