IMAGE DATABASES Prof. Hyoung-Joo Kim OOPSLA Lab. Computer Engineering Seoul National University.

28
IMAGE DATABASES Prof. Hyoung-Joo Kim OOPSLA Lab. Computer Engineering Seoul National University

Transcript of IMAGE DATABASES Prof. Hyoung-Joo Kim OOPSLA Lab. Computer Engineering Seoul National University.

IMAGE DATABASES

Prof. Hyoung-Joo Kim

OOPSLA Lab.

Computer Engineering

Seoul National University

Contents

• Introduction• Image database systems• Indexing issues and basic mechanisms• A taxonomy on image indexes • Color-spatial hierarchical indexes• Signature-based color-spatial retrieval• Summary

Introduction

• Traditional DBMS

– effective in managing structured data

– not effective for images that are non-alphanumeric and unstructured

• Applications to manage images

– medical application

– geographic information system

– satellite image database

– criminal database

– interior design

– art galleries and museum management

– etc...

Introduction(2)

• Content-based image retrieval techniques– techniques that retrieve image based on their

visual properties such as texture, color, shape, etc– sequentially comparing image feature is time-

consuming and impractical– need content-based image index

Image Database System

• Additional functionalities

– feature extraction• the system must be able to analyze an image to extract key

features such as shape, color, texture

– feature-based indexing• the system must build indexes based on the features

extracted

– content-based retrievals• the system should support a wide range of queries that

involve the contents of the image

– measure of similarity• the system requires a measure to capture what we humans

perceive as similarity between two images

Architecture of Image DBMSInput Image

PREPROCESSING MODULE

QUERY MODULE

ImageInput/Scanner

FeatureExtraction

UpdateIndex/

Database

InteractiveQuery

Formulation

Browsing&

Feedback

FeatureExtraction

FeatureMatching

RuntimeProcessor

ConcurrencyControl &RecoveryManager

Feature/ImageDatabaseUser

Query

OutputRetrievedImage

Indexing Issues

• Three issues in content-based index design– determine a representation for the indexing feature– determine a similarity measure between two images

based on their representation– determine an appropriate index organization

Indexing Issues

• Desirable properties of a representation(first issue)– exactness– space efficiency– computationally inexpensive similarity matching– preservation of the similarity between the features– automatic extraction– insensitivity to noise, distortion, rotation

Indexing Issues

• Main criterion of the similarity measure(second issue)– if two images are similar under the indexing feature,

then their representations should remain so.• Several alternatives to determine the similarity

– exact match• the representation of an image feature is usually coarse, in

sense that images with similar feature will be mapped to the same representation

– approximate match• the degree of similarity between the image representations

is computed based on some approximation techniques

Indexing Issues

• Criteria for selection of an index structure– the similarity measure can be supported efficiently– storage efficiency– maintenance(update) overhead

• Appropriate index structure– be determined by the representation and similarity

measure– example

• if image feature is represented as a vector, an the similarity measure is the Euclidean distance, then a natural choice is the multi-dimensional point access method

Basic Index Schemes

• Spatial access methods(SAMs)– basic idea

• extract k image features for each image• map images into points in a k-dimensional feature space• use SAMs such as the grid file, quad-tree, the family of R-

tree

– problem: “high-dimensionality curse”• these techniques perform no better than sequential

scanning as the dimension becomes sufficiently large

Basic Index Schemes

• Inverted file– basic idea

• an inverted list is created for each distinct key(indexed features)

• the inverted list consists of a list of pointers to the objects that contain features that are similar to the indexed feature

– problem• high storage overhead & expensive update

• Signature file– refer [Text Database]

Taxonomy on Image Indexes

• Shape feature• Spatial relationship• Texture• Color

Taxonomy on Image IndexesContent-based indexes

Spatial relationship

Texture

Objects

Shape

1-D string

Inverted file

Numerical vectors

Color

Color histogram Color-Spatial

Signature 2-D string

Multi-levelsignature file

Sequential file

Multi-dimensionalindex

Multi-dimensionalindex

Multi-levelhistogram

2-levelB+-tree

3-tiercolor index

Tamura features

Multi-dimensionalindex

Signaturefile

Similarity againstrepresentativeobjects

Signature

Inverted file

Shape Feature

• Representation– boundary information– a collection of rectangle that forms a rectangular cover of the

shape– mathematical morphology– pattern spectrum– numerical vector using 2-D Fourier transformation or

Wavelet transformation

• Problems– shape vary widely from object to object– unless the images have very distinct shape, the

performance may suffer

Spatial relationship

• Representation– 2-D string

• semantic representation for spatial relationship using a two-dimensional string

• projection of the symbols along the x-axis and y-axis• example : “O1 is to left of O2 which is left of O3”• the projection on the x-axis - O1 < O2 < O3• variations and extensions to the 2-D string have been

proposed

– Problem• spatial relationships can be drastically affected by the

orientation of the image

Texture and Color

• Texture– can be represented by the coarseness, contrast, directionality– the extraction of text information is a computationally intensive

operations

• Color– color histogram that captures the color composition of images– color alone is not sufficient to characterize an image

• consider two image - one with the top half blue and bottom half red, while the other’s left half is red and it’s right half is blue

• although these two images are similar in color composition, the are entirely different to a human observer.

– recent studies have proposed to integrate color and its spatial distribution

Color-spatial hierarchical Indexes

• Hierarchical Indexes– multiple indexing mechanisms are integrated to form

a single index structure– three indexes that have been proposed to integrated

color and spatial information for retrieval• Two-level B+-tree structure• Three-tier color index• Sequential Multi-Attribute Tree(SMAT)

Two-level B+-tree

• Feature extraction– the color-spatial information of an image is modeled

by splitting the image into 9 equal sub-areas(33)– the color information within each sub-area is

represented by a color histogram– one can obtain a more accurate similarity by

matching the corresponding color histogram of two image

– the color histogram within each sub-area is mapped into a numerical key

• Retrieval technique

Two-level B+-tree

• Retrieval technique– use two level information– the first level

• describes the composition of colors corresponding to the histogram of the region

• the colors are grouped into 11 bins• each group is assigned a range which bounds the

percentage of pixels in the mage with colors of group

– the second level• contains the average H, average V, and average C values

of all the 11 histogram bins

Two-level B+-treeLevel 1:B+-tree on Normal Pixel Count

Level 2:B+-tree on Average H,V, andC values

Three-tier Color Index

• Layer 1:– dominant color classification

• a fixed number of dominant colors is extracted• the dominant colors are those with the largest number of pixel

count

• Layer 2:– multidimensional R-tree structure

• the image can be assigned to a partition• prune away image within the candidate partitions that are not

relevant

• Layer 3:– multi-level color histogram(quad-tree structure)

• compare the histograms of the query image with those of remaining potential candidate image

Tree-tier Color IndexTier 1: Dominant Color Classification

Tier 2: R-tree

Tier 1: Multi-Level Color Histogram

K = 1

K = 2

K = 3

0 1 2 n(no.of colors)

(0,1) (0,2) .. (0,n) (1,2) …...…(2,n) ….

(0,1,2) ... (0,1,n) ……. (0,2,n) …...… …. K : dominant colors

SMAT

• Height-balanced color-spatial index– the problem with the two approaches

• individual tree structures(B+-tree, R-tree, Dominant Color Classification) are height-balanced, the entire hierarchical index structure may not be so.

– height-balancing of SMAT• SMAT is a multi-tier index structure similar to two

approaches, but remains height-balancing of the entire index structure by controlling the growth of the tree using a certain threshold which is application-specific value.

• refer to the VLDB Journal(1998) 7: 115-128 for more information

Signature-based Color-Spatial Retrieval

• Representation of color-spatial information– an image is partitioned into a grid of mn cells of equal size– the colors that can be used to represent a cell are determined– for a given color, each cell is examined to determine the

percentage of the total number of pixels in the cell having that color

0 1 2

31

An image partitioned into a 4x8 grid

Cell does not satisfy the threshold

Cell does satisfy the threshold

- an image is represented by 32-bit color signature, 10001111001000110000000001100011

• The retrieval process– an image with k colors has k color signatures

– let Qi and Di denote the signature of color i for a query image Q and a database image D.

– let the representative color sets of Q and D be CQ and CD.

Signature-based Color-Spatial Retrieval

QCi

basicbasic i)D,(Q,SIM D)(Q,SIM

otherwise

C icolor if D

0)BitSet(Q

)DBitSet(Q{i)D,(Q,SIM

i

ii

basic

Summary

• Promising area that require further research– performance evaluation

• comparative study will be useful for application designers and practitioners to pick the best method for their applications

– more on access method• designing efficient access methods will make the content-

based retrieval techniques more practical and useful

– concurrent access and distributed indexing• we expect to see more real-time application as well as

application running in parallel or distributed environment

Summary

– Integration and optimization• most content-based image retrieval techniques only capture

a part of the image’s semantics• important issues

– selecting an “optimal” set of image features that fits best for an application

– developing techniques that can integrate them into achieve the optimal results

• one promising method(semantic-based retrieval technique)– use content-based techniques as the basis, but also

exploits semantic meanings of the image and queries to support concept-based queries