Bitmap Indexes. Bitmap Indices Bitmap indices are a special type of index designed for efficient...
-
date post
21-Dec-2015 -
Category
Documents
-
view
228 -
download
1
Transcript of Bitmap Indexes. Bitmap Indices Bitmap indices are a special type of index designed for efficient...
Bitmap IndexesBitmap Indexes
Bitmap IndicesBitmap Indices
Bitmap indices are a special type of index designed for efficient querying on multiple keys
Very effective on attributes that take on a relatively small number of distinct values E.g. gender, country, state, …
E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity)
A bitmap is simply an array of bits For each gender, we associate a bitmap, where each bit represents
whether or not the corresponding record has that gender.
Bitmap Indices (Cont.)Bitmap Indices (Cont.)
In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute Bitmap has as many bits as records
In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise
Bitmap Indices (Cont.)Bitmap Indices (Cont.)
Bitmap indices are useful for queries on multiple attributes not particularly useful for single attribute queries
Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not)
Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap E.g. 100110 AND 110011 = 100010
100110 OR 110011 = 110111 NOT 100110 = 011001
Males with income level L1: And’ing of Males bitmap with Income Level L1 bitmap
– 10010 AND 10100 = 10000 Can then retrieve required tuples. Counting number of matching tuples is even faster
Bitmap Indices (Cont.)Bitmap Indices (Cont.)
Bitmap indices generally very small compared with relation size E.g. if record is 100 bytes, space for a single bitmap is 1/800 of space
used by relation. If number of distinct attribute values is 8, bitmap is only 1% of
relation size
Deletion needs to be handled properly Existence bitmap to note if there is a valid record at a record location
Needed for complementation not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap
Should keep bitmaps for all values, even null value To correctly handle SQL null semantics for NOT(A=v):
intersect above result with (NOT bitmap-A-Null)
Index Definition in SQLIndex Definition in SQL
Create a B-tree index (default in most databases)
create index <index-name> on <relation-name>(<attribute-list>)
-- create index b-index on branch(branch_name)
-- create index ba-index on branch(branch_name, account) -- concatenated index
-- create index fa-index on branch(func(balance, amount)) – function index
Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key.
Hash indexes: not supported by every database (but implicitly in joins,…) PostgresSQL has it but discourages due to performance
Create a bitmap index
create bitmap index <index-name> on <relation-name>(<attribute-list>)
- For attributes with few distinct values
- Mainly for decision-support(query) and not OLTP (do not support updates efficiently)
To drop any index
drop index <index-name>
Query ProcessingQuery Processing
General OverviewGeneral Overview
Relational model - SQL Formal & commercial query languages
Functional Dependencies
Normalization
Physical Design
Indexing
Query Processing and Optimization
ReviewReview
Data Retrieval at the physical level: Indices: data structures to help with some query evaluation:
SELECTION queries (ssn = 123)
RANGE queries (100 <= ssn <=200)
Index choices: Primary vs secondary, dense vs sparse, ISAM vs B+-tree vs Extendible Hashing vs Linear Hashing
But what about join queries? Or other queries not directly supported by the indices? How do we evaluate these queries?
Sometimes, indexes not useful, even for SELECTION queries. When? What decides when to use them?
A: Query Processing (one of the most complex
components of a database system)
QP & O QP & O
SQL Query
Data: result of the query
Query Processor
QP & O QP & O SQL Query
Data: result of the query
Query Processor
Parser
AlgebraicExpression
Query Optimizer
Execution plan
Evaluator
QP & O QP & O
Query Optimizer
Query Execution Plan
AlgebraicRepresentation
Query Rewriter
Algebraic Representation
Plan Generator Data Stats
Query Processing and OptimizationQuery Processing and Optimization
Parser / translator (1st step)
Input: SQL Query (or OQL, …)
Output: Algebraic representation of query (relational algebra expression)
Eg SELECT balance
FROM account
WHERE balance < 2500
balance(balance2500(account))
or
balance
balance2500
account
QP & OQP & O
Plan Evaluator (last step)
Input: Query Execution Plan
Output: Data (Query results)
Query execution plan Algorithms of operators that read from disk:
Sequential scan Index scan Merge-sort join Nested loop join …..
QP & OQP & O
Query Rewriting
Input: Algebraic representation of query
Output: Algebraic representation of query
Idea: Apply heuristics to generate equivalent expression that is likely to lead to a better plan
e.g.: amount > 2500 (borrower loan)
borrower (amount > 2500(loan))
Why is 2nd better than 1st?
QP & OQP & O
Plan Generator
Input: Algebraic representation of query
Output: Query execution plan
Idea: 1) generate alternative plans on evaluating a query
amount > 2500
1) Estimate cost for each plan
2) Choose the plan with the lowest cost
Sequential scan
Index scan
QP & OQP & O
Goal: generate plan with minimum cost
(i.e., fast as possible)
Cost factors:
1. CPU time (trivial compared to disk time)
2. Disk access time
main cost in most DBs
3. Network latency
Main concern in distributed DBs
Our metric: count disk accesses
Cost ModelCost Model
How do we predict the cost of a plan?
Ans: Cost model For each plan operator and each algorithm we have a cost formula
Inputs to formulas depend on relations, attributes
Database maintains statistics about relations for this (Metadata)
MetadataMetadata
Given a relation r, DBMS likely maintains the following metadata:
1. Size (# of tuples) nr
2. Size (# of blocks) br
3. Block size (#tuples) fr
(typically br = nr / fr )
4. Tuple size (in bytes) sr
5. Attribute Variance (for each attribute r, # of different values) V(att, r)
6. Selection Cardinality (for each attribute in r, expected size of a selection: att = K (r ) ) SC(att, r)
ExampleExample
V(balance, account) = 3
V(acct_no, account) = 6
S(balance, account) = 2 ( nr / V(att, r))
accountbname acct_no balanceDntn A-101 500Mianus A-215 700Perry A-102 500R.H. A-305 900Dntn A-200 700Perry A-301 500
naccount = 6
saccount = 33 bytes
faccount = 4K/33
Some typical plans and their costsSome typical plans and their costs
A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition.
Cost estimate (number of disk blocks scanned) = br
br denotes number of blocks containing records from relation r
If selection is on a key attribute, cost = (br /2)
stop on finding record (on the average in the middle of the file)
Linear search can be applied regardless of
selection condition or
ordering of records in the file, or
availability of indices
Query: att = K (r )
Selection Operation (Cont.)Selection Operation (Cont.)
A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered. Requires that the blocks of a relation are stored contiguously
Cost estimate:
log2(br) — cost of locating the first tuple by a binary search on the blocks
Plus number of blocks containing records that satisfy selection condition
EA2 = log2(br) + sc(att, r) / fr -1
Query: att = K (r )
What is the cost if att is a key?
EA2 = log2(br)
ExampleExample
Query: bname =“Perry” ( account )
V(bname, account) = 50
naccount = 10K faccount = 20 tuples/block
Primary index on bname
Key: acct_no
Cost Estimates:
A1: EA1 = naccount / faccount = 500 I/O’s
A2: EA2 = log2(br) + sc(att, r) / fr -1 = 9 + 9 = 18 I/O’s
More Plans for selectionMore Plans for selection What if there is an index on att?
We need metadata on size of index (i). DBMS keeps that of:
1. Index height: HTi
2. Index “Fan Out”: fi
Average # of children per node (not same as order..)
3. Index leaf nodes: LBi
Note: HTi ~ logfi(LBi) + 1
More Plans for selectionMore Plans for selection
A3: Index scan, Primary Index
What: Follow primary index, searching for key K
Prereq: Primary index on att, i
Cost:
EA3 = HTi + 1, if att is a candidate key
EA3 = HTi + SC(att, r) / fr, if not
Query: att = K (r )
A5: Index scan, Secondary Index
What: Follow according index, searching for key K
Prereq: Secondary index on att, i
Cost:
if att not a key:
EA4 = HTi + 1 + SC(att, r)
Else, if att is a key: EA4 = HTi + 1
Index blockreads
bucket read
File block reads(in worst case, eachtuple on different block)
CardinalitiesCardinalities
Cardinality: the size (number of tuples) in the query result
Why do we care?
Ans: Cost of every plan depends on nr
e.g.
Linear scan: br + nr / frPrimary Index: HTi +1 ~ logfi(LBi) +2 ≤ logfi(nr / fr )+2
But, what if r is the result of another query?
Must now the size of query results as well as cost
Size of att = K (r ) ? SC(att, r)
Selections Involving ComparisonsSelections Involving Comparisons
Query: Att K (r )
A6 (primary index, comparison). (Relation is sorted on Att) For Att V(r) use index to find first tuple v and scan relation
sequentially from there
For AttV (r) just scan relation sequentially till first tuple > v; do not use index
Cost: EA5 =HTi + c / fr (where c is the cardinality of result)
k
k...
HTi
Query: Att K (r )
Cardinality: More metadata on r are needed:
min (att, r) : minimum value of att in r
max(att, r): maximum value of att in r
Then the selectivity of Att = K (r ) is estimated as:
(or nr /2 if min,
max unknown)
Intuition: assume uniform distribution of values between min and max
min(attr, r) max(attr, r)K
),min(),max(),max(
rattrattKrattr
rn
Plan generation: Range QueriesPlan generation: Range Queries
A6: (secondary index, comparison).
Cost:
EA6 = HTi -1+ #of leaf nodes to read + # of file blocks to read
= HTi -1+ LBi * (c / nr) + c, if att is a candidate key
k, k+1 k+m
k...
HTi
k+1k+m
...
Att K (r )
Plan generation: Range QueriesPlan generation: Range Queries
A6: (secondary index, range query). If att is NOT a candidate key
k, k+1 k+m
k
...
HTi
k+1 k+m
...
k...
...
Cost: EA6 = HTi -1+ #of leaf nodes to read + #of file blocks to read
+#buckets to read
= HTi -1+ LBi * (c / nr) + c + x
Join OperationJoin Operation
Size and plans for join operation
Running example: depositor customer
Metadata:
ncustomer = 10,000 ndepositor = 5000
fcustomer = 25 fdepositor = 50
bcustomer= 400 bdepositor= 100
V(cname, depositor) = 2500 (each customer has on average 2 accts)
cname in depositor a foreign key for customer
depositor(cname, acct_no)customer(cname, cstreet, ccity)
Cardinality of Join QueriesCardinality of Join Queries
What is the cardinality (number of tuples) of the join?
E1: Cartesian product: ncustomer * ndepositor = 50,000,000
E2: Attribute cname common in both relations, 2500
different cnames in depositor
Size: ncustomer * (avg# of tuples in depositor with same cname)
= ncustomer * (ndepositor / V(cname, depositor))
= 10,000 * (5000 / 2500)
= 20,000
Cardinality of Join QueriesCardinality of Join Queries
E3: cname is a foreign key for depositor on customer
Size: ndepositor * (avg # of tuples in customer with same cname)
= ndepositor * 1
= 5000
Note: If cname is a key for customer but NOT a foreign key for depositor,then 5000 an UPPER BOUND
Some customer names may not match w/ any customers in customer
Cardinality of Joins in generalCardinality of Joins in general
Assume join: R S
1. If R, S have no common attributes: nr * ns
2. If R,S have attribute A in common:
(take min)
1. If R, S have attribute A in common and:
1. A is a candidate key for R: ≤ ns
2. A is candidate key in R and candidate key in S : ≤ min(nr, ns)
3. A is a key for R, foreign key for S: = ns
),(),( rAVrn
snorsAVsn
rn
Nested-Loop JoinNested-Loop Join
Algorithm 1: Nested Loop Join
Idea:
Query: R S
t1
t2
t3
R
u1
u2
u3
S
Blocks of...
results
Compare: (t1, u1), (t1, u2), (t1, u3) .....
Then: GET NEXT BLOCK OF S
Repeat: for EVERY tuple of R
Nested-Loop JoinNested-Loop Join
Algorithm 1: Nested Loop Join
for each tuple tr in R do
for each tuple us in S do
test pair (tr,us) to see if they satisfy the join
condition
if they do (a “match”), add tr • us to the result.
R is called the outer relation and S the inner relation of the join.
Query: R S
Nested-Loop Join (Cont.)Nested-Loop Join (Cont.)
Cost: Worst case, if buffer size is 3 blocks
br + nr bs disk accesses.
Best case: buffer big enough for entire INNER relation + 2
br + bs DAs. Assuming worst case memory availability cost estimate is
5000 400 + 100 = 2,000,100 disk accesses with depositor as outer relation, and
10000 100 + 400 = 1,000,400 disk accesses with customer as the outer relation.
If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500 disk accesses. (actually we need 2 more blocks)
Join AlgorithmsJoin Algorithms
Algorithm 2: Block Nested Loop Join
Idea:
Query: R S
t1
t2
t3
R
u1
u2
u3
S
Blocks of...
results
Compare: (t1, u1), (t1, u2), (t1, u3) (t2, u1), (t2, u2), (t2, u3) (t3, u1), (t3, u2), (t3, u3)
Then: GET NEXT BLOCK OF S
Repeat: for EVERY BLOCK of R
Block Nested-Loop JoinBlock Nested-Loop Join Block Nested Loop Join
for each block BR of R do
for each block BS of S do
for each tuple tr in BR do
for each tuple us in Bs do
begin
Check if (tr,us)
satisfy the join condition
if they do (“match”), add tr
• us to the result.
Block Nested-Loop Join (Cont.)Block Nested-Loop Join (Cont.)
Cost:
Worst case estimate: br bs + br block accesses.
Best case: br + bs block accesses. Same as nested loop.
Improvements to nested loop and block nested loop algorithms for a buffer with M blocks: In block nested-loop, use M — 2 disk blocks as blocking unit for outer
relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output
Cost = br / (M-2) bs + br If equi-join attribute forms a key or inner relation, stop inner loop on first
match
Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)