Chapter 10 Sorting and Searching - Lakehead...

CS 2412 Data Structures

Chapter 10

Sorting and Searching

Some concepts

• Sorting is one of the most common data-processing

applications.

• Sorting algorithms are classed as either internal or external.

• Sorting order can be either ascending sequence or descending

sequence.

• Sort stability is an attribute of a sort, indicating that data with

equal keys maintain their relative input order in the output.

• Sort efficiency usually is based on the comparisons and moves

required for the sorting. The best possible sorting algorithms

are O(n log n).

• During the sorting process, each traversal of the data is

referred to as a sort pass.

Data Structure 2016 R. Wei 2

Selection sorts

• Heap sort: we have already discussed. First build a heap. Then

remove the root of the heap and put the last element to the

root and reheap down.

• Straight selection sort: In each pass of the selection sort, the

smallest element is selected from the unsorted sublist and

exchange with the element at the beginning of the unsorted list.


Algorithm selectionSort (list, last)

set current to 0

loop (until last element sorted)

set smallest to current

set walker to current +1

loop (walker key < smallest key)

set smallest to walker

increment walker

end loop

exchange (current, smallest)

increment current

end loop


The efficiency of selection sort

• Straight select sort: O(n2). The algorithm has two level of

loops, each of the loop executes about n times.

• Heap sort: O(n log n). To build a heap, about n log n loops are

needed. To sort from the heap needs another n log n loops. In

big-O notation, the complexity is O(n log n).


Insertion sorts

• Straight insertion sort: the list is divided into sorted and

unsorted sublists. In each pass the first element of the unsorted

sublist is inserted into the sorted sublist at correct position.

• Shell sort: the list is divided into K segments and each

segment is sorting (the segments are dispersed through the

list). After each passing, the number of segments is reduced

according to a increment. When the number of segments is

reduced to 1, the list is sorted.


Algorithm insertionSort(list, last)

set current to 1

loop (until last element sorted)

move current element to hold

set walker to current - 1

loop (walker >= 0 AND hold key < walker key)

move walker element right one element

decrement walker

end loop

move hold to walker + 1 element

increment current

end loop


The main idea for the Shell sort is divide the list into segments and

use insertion sort to sort each segment.

The positions of the elements of a segment are at a distance of

increment. In the following example, the list is of size 10. The 5

segments for increment K = 5 are as follows:

Segment 1. A[0], A[5]





Then for increment K = 2

Segment 1. A[0], A[2], A[4], A[6], A[8]

Segment 2. A[1], A[3], A[5], A[7], A[9]


Algorithm shellSort (list, last)

set incre to last / 2

loop (incre not 0)

set current to incre

loop(until last element sorted)

move current element to hold

set walker to current - incre

loop (walker>=0 AND hold key < walker key)

move walker element one increment right

set walker to walker - incre

end loop

move hold to walker + incre element

increment current

end loop

set incre to incre / 2

end loop


void shellSort (int list [], int last)

{

int hold;

int incre;

int walker;

incre = last / 2;

while (incre != 0)

{

for (int curr = incre; curr <= last; curr++)

{

hold = list [curr];

walker = curr - incre;

while (walker >= 0 && hold < list [walker])

{

list [walker + incre] = list [walker];

walker = ( walker - incre );


} // while

list [walker + incre] = hold;

} // for walk

incre = incre / 2;

} // while

return;

} // shellSort

Note

In the above algorithm, the increment start from n/2, then each

pass reduce half of the size. This is not the most efficient way, but

simple. The ideal increments should be set so that no two elements

will appear at same segment more than once. But this is not easy

in general.


Insertion sort efficiency:

• Straight insertion sort: O(n2). The algorithm has two

embedded loops. The execute times is about n(n+ 1)/2.

• Shell sort: the complexity is difficult to analysis. Using

empirical studies show that the average sort complexity is

O(n1.25)


Exchange sorts

• Bubble sort: the list in divided into two sublists: sorted and

unsorted. The smallest element is bubbled from the unsorted

sublist to the sorted sublist each time.

• Quick sort: each time a pivot is selected. Then the elements

less than pivot and the elements greater or equal to pivot are

separated into two sublist. The pivot is put at its ultimately

correct location in the list.


Example:

23 78 45 8 56 32

8 ∥23 78 45 32 56

8 23 ∥32 78 45 56

8 23 32 ∥45 78 56

8 23 32 45 ∥56 78


Algorithm bubbleSort(list, last)

set current to 0

set sorted to false

loop (current <= last AND sorted false)

set walker to last

set sorted to true

loop (walker > current)

if (walker dta < walker -1 data)

set sorted to false

exchange (list, walker, walker -1)

end if

decrement walker

end loop

increment current

end loop


Note for quick sort

• There are different methods for selecting the pivot.

– Select the first element.

– Select the middle element.

– Select the median value of three elements: left, right and

the element in the middle of the list. This text uses this

method.

• When the partition becomes small, a straight insertion sort can

be used, which may be more efficient.


Example for one pass of a quick sort:


Algorithm medianLeft(sortData, left, right)

set mid to (left + right ) /2

if (left key > mid key)

exchange (sortData, left, mid)

end if

if (left key > right key)

exchange ( sortData, left, right)

end if

if(mid key > right key)

exchange (sortData, mid, right)

end if

exchange (sortData, left, mid) //put pivot in left.


The list in Figure 12-15 is sorted as follows:


The exchange sort efficiency:

• Bubble sort: O(n2). There are two loops in the algorithm. The

comparison is about n(n+ 1)/2.

• Quick sort: O(n logn). The algorithm has 5 loops. However,

for each pass, the partition is general half size as previous pass.

Roughly say, there are total log2 n passes.


void bubbleSort (int list [], int last)

{

int temp;

for (int current = 0, sorted = 0;

current <= last && !sorted;

current++)

for (int walker = last, sorted = 1;

walker > current;

walker--)

if (list[ walker ] < list[ walker - 1 ])

{

sorted = 0;

temp = list[walker];

list[walker] = list[walker - 1];

list[walker - 1] = temp;

} // if

return;

} // bubbleSort


External sorts

In external sorting, portions of the data may be stored in secondary

memory during the sorting process.

One important method for the external sort is merge the (sorted)

files in to one sorted file.


Merge sorts

A simple merge is merge two sorted files into one file. For example,

we have two sorted lists:

• 1, 3, 5

• 2, 4, 6, 8, 10

After we merged these two list, we should obtain the following list:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10.


The following algorithm merges two sorted files file1, file2.

The combined data are written into file3

Algorithm mergeFiles

open files

read (file1 into record1)


loop (not end file1 or not end file2)

if (record1.key <= record2.key)

write (record1 to file3)


if (end of file1)

set record1.key to infinity

end if

else

write (record2 to file3)



if (end of file2)

set record2 key to infinity

end if

end if

end loop

close files

end mergeFiles


Merge unsorted files:

• Form merge runs for the files. Each run is ordered.

• The end of each run is identified by a stepdown.

• Merge each run of the two files.

• When one run is stepdown, the another run is rollout (copied

to the merged file).


The sorting process:

• Sort phase: Divide the file into merge files according to the size

of memory. Foe example, if we have 2300 records, but the

memory only can handle 500 records. We first read in 500

records and sort it as the first merge run. Then read and sort

501-1000 records as first run of the merge 2, etc.

• Merge phase: merge the sorted runs.


There are different merge concepts. We discuss 3 of them as

examples

• Natural merge: after merge, all data are written in one file and

need a distribute phase to redistribute the data to two files.

• Balance merge: use a constant number of input merge files and

the same number of output merger files.

• Ployphase merge: A constant number of input merge files are

merged to one output merge file, the input merge files are

immediately reused when their input has been completely

merged.


Searching

• Binary search: for sorted list.

• Sequential search:

– Straight sequential search: each time check if the key

equals to the target AND if it is the last key.

– Sentinel sequential search: add the target at the end of the

list so that each time just check if key equals to the

target.

– Probability search: when a target is found, move the

element containing target up one location. In this way, most

frequent targets are easier to found.


Hashed list searches

• Hashing is a method using key-to-address mapping to find the

data quickly.

• The basic idea is using a hash function to map a key (which is

at a large range) to a index (which is at a small range) of data.

• Some keys may be mapped to a same index (synonyms). Then

we need some method to solve the collision.

• The main part of hashing is to find good hashing methods.


Hashing methods:

• Direct method: the range of keys and the range of index are

the same.

• Subtraction method: subtract a fixed number from the key.

Also require both ranges are the same.

• Modulo-division method: index= key MODULO listSize

• Digit-extraction method: select digits at certain positions as

the index.

• Midsquare method: key is squared and the middle digits are

used as index.


• Folding method: fold shift (key is divided into parts whose size

matches the size of the index. Then the left and right parts are

shifted and added with the middle part); fold boundary (the

left and right numbers are folded on a fixed boundary between

them and the center number. The two outside values are

reversed).


• Rotation method: rotating the last character to the front of the

key. Usually used by incorporating with other methods.

• Pseudorandom method: the key is used as the seed in a

pseudorandom number generator, the resulting random number

is then scaled into the possible index range.


Some concepts used in collision resolution method:

• Load factor: the number of elements in the list divided by the

number of physical allocated for the list, expressed as

percentage (better less than 75).

α =k

n× 100.

• Clustering: as data are added to a list and collisions are

resolved, some hashing algorithms tend to cause data to group

within the list.


Open addressing to resolve collisions (disadvantage: each collision

resolution increases the probability of future collisions).

• Linear probe: when data cannot be stored in the home address,

we resolve the collision by adding 1 to the current address.


• Quadratic probe: the increment is the collision probe number

squared.


• Pseudorandom collision resolution (double hashing): use a

pseudorandom number to resolve the collision. Use the collision

address as the key of the the pseudorandom generator.


• Key offset (double hashing): calculate the new address as a

function of the old address and the key.

For example:

offSet = key / listSize

address = (offSet + old address) modulo listSize


Linked list collision resolution: use a separate area to store

collisions and chains all synonyms together in a linked list (usually

use LIFO sequence). Two storage areas are used: prime area and

the overflow area.


Bucket hashing: keys are hashed to buckets, nodes that

accommodate multiple data occurrences. (disadvantage: use more

empty space, when the bucket is full, collision occurs)


Combination approaches may used:

bucket hashing first, then a linear probe is used if bucket is full.


Chapter 10 Sorting and Searching - Lakehead...

Documents

Transcript of Chapter 10 Sorting and Searching - Lakehead...