Advanced Data Structures Notes

8/11/2019 Advanced Data Structures Notes

1/142


2/142


3/142


4/142


5/142


6/142


7/142


8/142


9/142


10/142


11/142


12/142


13/142


14/142


15/142


16/142


17/142


18/142


19/142


20/142


21/142


22/142


23/142


24/142


25/142


26/142


27/142


28/142


29/142


30/142


31/142


32/142


33/142


34/142


35/142


36/142


37/142


38/142


39/142


40/142


41/142


42/142


43/142


44/142


45/142


46/142


47/142


48/142


49/142


50/142


51/142


52/142


53/142


54/142


55/142


56/142


57/142


58/142


59/142


60/142


61/142


62/142


63/142


64/142


65/142


66/142


67/142


68/142


69/142


70/142


71/142


72/142


73/142


74/142


75/142


76/142


77/142


78/142


79/142


80/142


81/142


82/142


83/142


84/142


85/142


86/142


87/142


88/142


89/142


90/142


91/142


92/142


93/142


94/142


95/142


96/142


97/142


98/142


99/142


100/142


101/142


102/142


103/142


104/142


105/142


106/142


107/142


108/142


109/142


110/142


111/142


112/142


113/142


114/142


115/142


116/142


117/142


118/142

117

We proceed by comparing successive characters of W to "parallel" characters of S, moving fromone to the next if they match. However, in the fourth step, we get S[3] is a space and W[3] = 'D',a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between

positions 0 and 3 in S except at 0; hence, having checked all those characters previously, weknow there is no chance of finding the beginning of a match if we check them again. Therefore

we move on to the next character, setting m = 4 and i = 0.1 2m: 01234567890123456789012S: ABC ABCDAB ABCDABCDABDEW: ABCDABDi: 0123456

We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have adiscrepancy. However, just prior to the end of the current partial match, we passed an "AB"which could be the beginning of a new match, so we must take this into consideration. As wealready know that these characters match the two characters prior to the current position, we

need not check them again; we simply reset m = 8, i = 2 and continue matching the currentcharacter. Thus, not only do we omit previously matched characters of S, but also previouslymatched characters of W.

1 2m: 01234567890123456789012S: ABC ABCDAB ABCDABCDABDEW: ABCDABDi: 0123456

This search fails immediately, however, as the pattern still does not contain a space, so as in thefirst trial, we return to the beginning of W and begin searching at the next character of S: m = 11,reset i = 0.


Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does notmatch the final character 'D' of the word W. Reasoning as before, we set m = 15, to start at thetwo-character string "AB" leading up to the current position, set i = 2, and continue matchingfrom the current position.


This time we are able to complete the match, whose first character is S[15].Algorithm:

www.jntuworld.com || www.android.jntuworld.com || www.jwjobs.net || www.android.jwjobs.net

www.jntuworld.com || www.jwjobs.net


119/142

118

algorithm kmp_search:input:

an array of characters, S (the text to be searched)an array of characters, W (the word sought)

output:

an integer (the zero-based position in S at which W is found)

define variables:an integer, m 0 (the beginning of the current match in S) an integer, i 0 (the position of the current character in W) an array of integers, T (the table, computed elsewhere)

while m + i < length(S) doif W[i] = S[m + i] then

if i = length(W) - 1 thenreturn m

let i i + 1 elselet m m + i - T[i]if T[i] > -1 then

let i T[i] else

let i 0

(if we reach here, we have searched all of S unsuccessfully)return the length of S

Efficiency of KMP:

Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), thecomplexity of the overall algorithm is O(n + k).These complexities are the same, no matter how many repetitive patterns are in W or S.Implementation of KMP:#include#include#includeclass str{

private:char t[58],p[67],f[45];int i,j,m,n;

public:void failure(char[]);int kmpmatch(char[],char[]);};void str::failure(char x[]){


http://en.wikipedia.org/wiki/Array_data_typehttp://en.wikipedia.org/wiki/Array_data_type


120/142

119

m=strlen(x); j=0;i=1;f[0]=0;while(i0)

j=f[j-1];else{

f[i]=0;i++;}}}int str::kmpmatch(char t[],char p[]){failure(p);i=0;j=0;n=strlen(t);while(i0)

j=f[i-1];elsei++;}return -1;}void main(){int i,j,m,n;str b;




121/142

120

clrscr();char t[50],p[20];coutt;coutp;int a=b.kmpmatch(t,p);if(a!=-1)cout


122/142

121

The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit charactercomparisons at different alignments. Instead of a brute-force search of all alignments (of whichthere are m - n + 1), Boyer-Moore uses information gained by preprocessing P to skip as manyalignments as possible.The algorithm begins at alignment k = n, so the start of P is aligned with the start of T.

Characters in P and T are then compared starting at index n in P and k in T, moving backward:the strings are matched from the end of P to the start of P. The comparisons continue until eitherthe beginning of P is reached (which means there is a match) or a mismatch occurs upon whichthe alignment is shifted to the right according to the maximum value permitted by a number ofrules. The comparisons are performed again at the new alignment, and the process repeats untilthe alignment is shifted past the end of T, which means no further matches will be found.The shift rules are implemented as constant-time table lookups, using tables generated during the

preprocessing of P.

The Good Suffix Rule

Description

- - - - X - - K - - - - -M A N P A N A M A N A P -A N A M P N A M - - - - -- - - - A N A M P N A M -Demonstration of good suffix rule with pattern ANAMPNAM.

The good suffix rule is markedly more complex in both concept and implementation than the badcharacter rule. It is the reason comparisons begin at the end of the pattern rather than the start,and is formally stated thus :[3] Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a

mismatch occurs at the next comparison to the left. Then find, if it exists, the right-most copy t'of t in P such that t' is not a suffix of P and the character to the left of t ' in P differs from thecharacter to the left of t in P. Shift P to the right so that substring t' in P is below substring t in T.If t' does not exist, then shift the left end of P past the left end of t in T by the least amount sothat a prefix of the shifted pattern matches a suffix of t in T. If no such shift is possible, then shiftP by n places to the right. If an occurrence of P is found, then shift P by the least amount so thata proper prefix of the shifted P matches a suffix of the occurrence of P in T. If no such shift is

possible, then shift P by n places, that is, shift P past T.

Preprocessing

The good suffix rule requires two tables: one for use in the general case, and another for usewhen either the general case returns no meaningful result or a match occurs. These tables will bedesignated L and H respectively. Their definitions are as follows :[3] For each i, L[i] is the largest position less than n such that string P[i..n] matches a suffix ofP[1..L[i]] and such that the character preceding that suffix is not equal to P[i-1]. L[i] is defined to

be zero if there is no position satisfying the condition.Let H[i] denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. Ifnone exists, let H[i] be zero.


http://en.wikipedia.org/wiki/Brute-force_searchhttp://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3http://en.wikipedia.org/wiki/Brute-force_search


123/142

122

Both of these tables are constructible in O(n) time and use O(n) space. The alignment shift forindex i in P is given by n - L[i] or n - H[i]. H should only be used if L[i] is zero or a match has

been found.Performance:The Boyer-Moore algorithm as presented in the original paper has worst-case running time of

O(n+m) only if the pattern does not appear in the text.Implementation of boyer moore:#include#include#includeclass bm{

public:char M[20],P[15];

public:int last(char);

int psize();int boyer(int,int);int min(int,int);

};

int bm::last(char ch){

int l[150],i,k;for(i=0;i


124/142

123

return i;else

{i=i-1;

j=j-1;

}}else

{i=i+m-min(j,1+last(M[i]));

j=m-1;}} while(ii)?i:j);}

void main(){

bm b1;int m,n,x;clrscr();coutb1.M;coutb1.P;

m=strlen(b1.P);n=strlen(b1.M);x=b1.boyer(n,m);if(x==-1)

cout


125/142

124

Ex: bit one of 1000 is 1, and bits two , three , four are 0. All keys in the left subtree of a node at level I have bit i equal to zero whereas those in

the right subtree of nodes at this level have bit i = 1. Assume fixed number of bits Not empty =>

Root contains one dictionary pair (any pair) All remaining pairs whose key begins with

a 0 are in the left subtree. All remaining pairs whose key begins with

a 1 are in the right subtree. Left and right subtrees are digital subtrees

on remaining bits.This digital search tree contains the keys 1000,0010,1001,0001,1100,0000

Example: Start with an empty digital search tree and

insert a pair whose key is 0110

Now , insert a pair whose key is 0010





126/142

125

Now insert a pair whose key is 1011


Search and Insert: The digital search tree functions to search and insert are quite similar to the

corresponding functions for binary search trees. The essential difference is that the subtree to move to is determined by a bit in the searchkey rather than by the result of the comparison of the search key and the key in thecurrent node.

Try to build the digital search tree: A 00001 S 10011 E 00101 R 10010 C 00011 H 01000

I 01001 N 01110 G 00111 X 11000 M 01101 P 10000

When we dealing with very long keys, the cost of a key comparison is high. We can reduce thenumber of key comparisons to one by using a related structure called Patricia




127/142

126

We shall develop this structure in three steps. First, we introduce a structure called a binary trie.

Then we transform binary tries into compressed binary tries. Finally, from compressed binary tries we obtain Patricia.

7.5 Binary Trie A binary trie is a binary tree that has two kinds of nodes: branch nodes and elementnodes.

A branch node has the two data members LeftChild and RightChild. It has no datamember.

An element node has the single data member data. Branch nodes are used to build a binary tree search structure similar to that of a digital

search tree. This leads to element nodesSix element binary trie:

Compressed Binary trie: The binary trie contains branch nodes whose degree is one. By adding another data

member, BitNumber , to each branch node, we can eliminate all degree-one branch nodesfrom the trie. The BitNumber data member of a branch node gives the bit number of thekey that is to be used at this node.

Binary Trie with degree one nodes eliminated:

7.6 Patricia:( Practical Algorithm to Retrieve Information Coded inAlphanumeric)

Compressed binary tries may be represented using nodes of a single type. The new nodes,called augmented branch nodes, are the original branch nodes augmented by the data




128/142

127

member data. The resulting structure is called Patricia and is obtained from a compressed binary trie in the following way:

(1)Replace each branch node by an augmented branch node. (2)Eliminate the element nodes. (3)Store the data previously in the element node in the data data members of the

augmented branch nodes. Since every nonempty compressed binary trie has one less branch node than it has element nodes, it is necessary to add one augmented branch node.This node is called the head node . The remaining structure is the left subtree of the headnode. The head node has BitNumber equal to zero. Its right-child data member is notused. The assignment of data to augmented branch node is less than or equal to that in the

parent of the element node that contained this data . (4)Replace the original pointers to element nodes by pointers to the respective augmented

branch nodes.

typedef struct patricia_tree *patricia;struct patricia_tree {

int bit_number;

element data; patricia left_child, right_child;};

patricia root; patricia search:Patricia search(patricia t, unsigned k){/*search the Patricia tree t; return the last node y encountered; if k = y ->data.key, the key is inthe tree */Patricia p, y;If (!t) return NULL; /* empty tree*/y=t->left_child;

p=t;while (y->bit_number > p->bit_number){

p=y;y=(bit(k, y->bit_number)) ?y->right_child : y->left_child;

}return y;




129/142

128

}Patricia Insert:void insert (patricia *t, element x){/* insert x into the Patricia tree *t */

patricia s, p, y, z;

int i;if (!(*t)) { /* empty tree*/*t = (patricia)malloc(sizeof(patricia_tree));if (IS_FULL(*t)) {

fprintf(stderr, The memory is full \n) ; exit(1);

}(*t)->bit_number = 0(*t)->data = x;(*t)->left_child = *t;

}

y = search(*t,x.key);if (x.key == y->data.key) {fprintf(stderr, The key is in the tree. Insertion fails. \n); exit(1);}

/* find the first bit where x.key and y->data.key differ*/for(i = 1; bit (x.key,i) == bit(y->data.key,i); i++ );/* search tree using the first i-1 bits*/s = (*t)->left_child;

p = *t;while (s->bit_number > p->bit_number && s->bit_number < 1){

p = s;s = (bit(x.key,s->bit_number)) ?

s->right_child : s->left_child;}/* add x as a child of p */z = (patricia)malloc(sizeof(patricia_tree));if (IS_FULL(z)) {

fprintf(stderr, The memory is full \n); exit(1);

}z->data = x;z->bit_number = i;z->left_child = (bit(x.key,i)) ? s: z;z->right_child = (bit(x.key,i)) ? z : s;if (s == p->left_child) p->left_child = z;else

p->right_child = z;




130/142

129

Assignment Questions:Pattern matching and Tries

1. Explain Pattern matching algorithmsi. the Boyer Moore algorithm

ii. the Knuth-Morris-Pratt algorithm




131/142

130

2. Define Tries?Give the concepts of digital search tree? What are the applications ofTries.

3. Write a short notes oni. binary trie

ii. Patricia

iii.

Multi-way trie4. Describe an efficient algorithm to find the longest palindrome that is a suffixof a string T of length n.




132/142

131

UNIT-VIIITopics: File Structures: Fundamental File Processing Operations-opening files, closingfiles, Reading and Writing file contents, Special characters in files.Fundamental File Structure Concepts- Field and record organization, Managing fixed-length,fixed-field buffers.

8.1 Fundamental file processing operations physical file

A file as seen by the operating system, and which actually exists on secondary storage.logical file

A file as seen by a programPrograms read and write data from logical files.Before a logical file can be used, it must be associated with a physical file.This act of connection is called "opening" the file.Data in a physical file is persistent.Data in a logical file is temporary.A logical file is identified (within the program) by a program variable or constant.

C++ supports file access on three levels:Unbuffered, unformatted file access using handles.Buffered, formatted file access using the FILE structure.Buffered, formatted file access using classes.

8.1.1Opening Filesopen

To associate a logical program file with a physical system file. protection mode

The security status of a file, defining who is allowed to access a file and which accessmodes are allowed.

access modeThe type of (file) access allowed.

file descriptorA cardinal number used as the identifier for a logical file by operating systems such asUNIX and PC-DOS.In C++, a file is opened by using library functions.The name of the physical file must be supplied to an open function.The open function must also be supplied with an access mode.The open function can also be supplied with a protection mode.The access mode has several aspects:Is the file is to be accessed by reading, by writing, or by both?

What should be done with existing contents of the file?Should a new file be created if none exists?Should any character translation be done?For handle level access, the logical file is declared as an int .The handle is also known as a file descriptor. The C++ open function is used to open a file for handle level access.The value returned by the open is assigned to the file variable.For FILE level access, the logical file is declared as a FILE * .


http://cpp.comsci.us/etymology/function/open.htmlhttp://cpp.comsci.us/etymology/function/open.html


133/142

132

The C++ fopen function is used to open a file for FILE level access.The value returned by the fopen is assigned to the file variable.For class level access, the logical file is declared as an fstream (or as an ifstream or anofstream .)The C++ open method of the class is used to open a file for class level access.

8.1.2 Closing FilesTo disassociate a logical program file from a physical system file.Closing a file frees system resources for reuse.Data may not be actually written to the physical file until a logical file is closed.A programs should close a file when it is no longer needed.The C++ close function is used to close a file for handle level access.The logical file to be closed is an argument to the handle close function.The C++ fclose function is used to close a file for FILE level access.The logical file to be closed is an argument to the FILE fclose function.The C++ close method of the class is used to close a file for class level access.

8.1.3 Reading and writing

read To tranfer data from a file to program variable(s).write

To tranfer data to a file from program variable(s) or constant(s).end-of-file

A physical location just beyond the last datum in a file.The read and write operations are performed on the logical file with calls to libraryfunctions.For read, one or more variables must be supplied to the read function, to receive the datafrom the file.For write, one or more values (as variables or constants) must be supplied to the writefunction, to provide the data for the file.For unformatted transfers, the amount of data to be transferred must also be supplied.The C++ read function is used to read a file with handle level access.The C++ write function is used to write a file with handle level access.The C++ fread function is used to read a file with FILE level access.The C++ fwrite function is used to write a file with FILE level access.The C++ read method of the class is used to read a file with class level access.The C++ write method of the class is used to write a file with class level access.The acronym for end-of-file is EOF.When a file reaches EOF, no more data can be read.Data can be written at or past EOF.The C++ eof function is used to detect end-of-file with handle level access.The handle eof function detects when the file pointer is at end-of-file.The C++ feof function is used to detect end-of-file with FILE level access.The FILE feof function detects when the file pointer is past end-of-file.The C++ eof method of the class is used to detect end-of-file with class level access.The stream class eof function detects when the file pointer is past end-of-file.


http://cpp.comsci.us/etymology/function/fopen.htmlhttp://cpp.comsci.us/etymology/method/open.htmlhttp://cpp.comsci.us/etymology/function/close.htmlhttp://cpp.comsci.us/etymology/function/fclose.htmlhttp://cpp.comsci.us/etymology/method/close.htmlhttp://cpp.comsci.us/etymology/function/read.htmlhttp://cpp.comsci.us/etymology/function/write.htmlhttp://cpp.comsci.us/etymology/function/fread.htmlhttp://cpp.comsci.us/etymology/function/fwrite.htmlhttp://cpp.comsci.us/etymology/method/read.htmlhttp://cpp.comsci.us/etymology/method/write.htmlhttp://cpp.comsci.us/etymology/function/eof.htmlhttp://cpp.comsci.us/etymology/function/feof.htmlhttp://cpp.comsci.us/etymology/method/eof.htmlhttp://cpp.comsci.us/etymology/method/eof.htmlhttp://cpp.comsci.us/etymology/function/feof.htmlhttp://cpp.comsci.us/etymology/function/eof.htmlhttp://cpp.comsci.us/etymology/method/write.htmlhttp://cpp.comsci.us/etymology/method/read.htmlhttp://cpp.comsci.us/etymology/function/fwrite.htmlhttp://cpp.comsci.us/etymology/function/fread.htmlhttp://cpp.comsci.us/etymology/function/write.htmlhttp://cpp.comsci.us/etymology/function/read.htmlhttp://cpp.comsci.us/etymology/method/close.htmlhttp://cpp.comsci.us/etymology/function/fclose.htmlhttp://cpp.comsci.us/etymology/function/close.htmlhttp://cpp.comsci.us/etymology/method/open.htmlhttp://cpp.comsci.us/etymology/function/fopen.html


134/142

133

Seeking:seek

To move to a specified location in a file. byte offset

The distance, measured in bytes, from the beginning.

Seeking moves an attribute in the file called the file pointer. C++ library functions allow seeking.In DOS, Windows, and UNIX, files are organized as streams of bytes, and locations arein terms of byte count.Seeking can be specified from one of three reference points:

o The beginning of the file.o The end of the file.o The current file pointer position.

The C++ lseek function is used to seek with handle level access.The C++ fseek function is used to seek with FILE level access.The C++ seekg method of the class is used to seek with class level access for read (get.)

The C++ seekp method of the class is used to seek with class level access for write (put.)C++ allows, but does not require, separate file pointers and seek functions for readingand writing.Most implementations of C++ do not have separate file pointers.

8.1.4 Special characters in filesSpecifics of files can vary with the operating system.The C++ language was written originally for the UNIX operating system.UNIX and DOS (Windows) systems handle separators between lines differently.In UNIX files, lines are separated by a single new line character (ASCII line feed .)In DOS (Windows) files, lines are separated by a two characters (ASCII carriage return and line feed .)When DOS files are opened in text mode, the internal separator ('\n') is translated to thethe external separator () during read and write.When DOS files are opened in binary mode, the internal separator ('\n') is not translatedto the the external separator () during read and write.In DOS (Windows) files, end-of-file can be marked by a "control-Z" character (ASCIISUB).In C++ implementations for DOS, a control-Z in a file is interpreted as end-of-file.Other operating systems may handle line separation and end-of-file differently.Implementations of C++ should treat text files so that the internal representations are thesame as UNIX.On UNIX systems, files opened in text mode or binary mode behave the same way.

8.2 Fundamental file structure conceptsPersistent=Retained after execution of the program which created it.When we build file structures, we are making it possible to make data persistent . That is,one program can store data from memory to a file, and terminate. Later, another programcan retrieve the data from the file, and process it in memory.In this chapter, we look at file structures which can be used to organize the data withinthe file, and at the algorithms which can be used to store and retrieve the datasequentially.


http://cpp.comsci.us/etymology/function/lseek.htmlhttp://cpp.comsci.us/etymology/function/fseek.htmlhttp://cpp.comsci.us/etymology/method/seekg.htmlhttp://cpp.comsci.us/etymology/method/seekp.htmlhttp://cs.tulsa.to/datacom/ascii.htmlhttp://cs.tulsa.to/datacom/ascii.htmlhttp://cs.tulsa.to/datacom/ascii.htmlhttp://cs.tulsa.to/datacom/ascii.htmlhttp://cpp.comsci.us/etymology/method/seekp.htmlhttp://cpp.comsci.us/etymology/method/seekg.htmlhttp://cpp.comsci.us/etymology/function/fseek.htmlhttp://cpp.comsci.us/etymology/function/lseek.html


135/142

134

8.2.1 Field and Record organizationRecord:A subdivision of a file, containing data related to a single entity.Field:A subdivision of a record containing a single attribute of the entity which the recorddescribes.stream of bytes:A file which is regarded as being without structure beyond separation

into a sequential set of bytes.Within a program, data is temporarily stored in variables.Individual values can be aggregated into structures, which can be treated as a singlevariable with parts.In C++, classes are typically used as as an aggregate structure.C++ Person class (version 0.1):class Person { public:

char FirstName [11];char LastName[11];char Address [21];

char City [21];char State [3];char ZIP [5];

};With this class declaration, variables can be declared to be of type Person. Theindividual fields within a Person can be referred to as the name of the variable and thename of the field, separated by a period (.).C++ Program:#include

class Person { public:

char FirstName [11];char LastName[11];char Address [31];

char City [21];char State [3];

char ZIP [5];};

void Display (Person);

int main () {Person Clerk;Person Customer;

strcpy (Clerk.FirstName, "Fred");strcpy (Clerk.LastName, "Flintstone");strcpy (Clerk.Address, "4444 Granite Place");strcpy (Clerk.City, "Rockville");




136/142

135

strcpy (Clerk.State, "MD");strcpy (Clerk.ZIP, "00001");

strcpy (Customer.FirstName, "Lily");strcpy (Customer.LastName, "Munster");

strcpy (Customer.Address, "1313 Mockingbird Lane");strcpy (Customer.City, "Hollywood");strcpy (Customer.State, "CA");strcpy (Customer.ZIP, "90210");

Display (Clerk);Display (Customer);

}

void Display (Person Someone) {cout


137/142

136

Fixed length records: A record which is predetermined to be the same length as the other recordsin the file.

Record 1 Record 2 Record 3 Record 4 Record 5The file is divided into records of equal size.

All records within a file have the same size.Different files can have different length records.Programs which access the file must know the record length.Offset, or position, of the nth record of a file can be calculated.There is no external overhead for record separation.There may be internal fragmentation (unused space within records.)There will be no external fragmentation (unused space outside of records) except fordeleted records.Individual records can always be updated in place.

Example (80 byte records):

0 66 69 72 73 74 20 6C 69 6E 65 0 0 1 0 0 0 first line......10 0 0 0 0 0 0 0 0 FF FF FF FF 0 0 0 0 ................20 68 FB 12 0 DC E0 40 0 3C BA 42 0 78 FB 12 0 h.....@.


138/142

137

Offset, or position, of the nth record of a file cannot be calculated.There is external overhead for record separation equal to the size of the delimiter per record.There should be no internal fragmentation (unused space within records.)There may be no external fragmentation (unused space outside of records) after file updating.Individual records cannot always be updated in place.

Example (Delimiter = ASCII 30 (IE) = RS character:0 66 69 72 73 74 20 6C 69 6E 65 1E 73 65 63 6F 6E first line.secon

10 64 20 6C 69 6E 65 1E d line.Example (Delimiter = '\n'):

0 46 69 72 73 74 20 28 31 73 74 29 20 4C 69 6E 65 First (1st) Line10 D A 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 6C ..Second (2nd) l20 69 6E 65 D A ine..

Disadvantage: the offset of each record cannot be calculated from its record number. Thismakes direct access impossible.Advantage: there is space overhead for the length prefix.

Advantage: there will probably be no internal fragmentation (unusable space within records.)Length prefixed variable length records:

110 Record 1 40 Record2

100 Record 3 80 Record 4 70 Record 5

The records within a file are prefixed by a length byte or bytes.Records within a file can have different sizes.Different files can have different length records.Programs which access the file must know the size and format of the length prefix.Offset, or position, of the nth record of a file cannot be calculated.There is external overhead for record separation equal to the size of the length prefix per

record.There should be no internal fragmentation (unused space within records.)There may be no external fragmentation (unused space outside of records) after fileupdating.Individual records cannot always be updated in place.Example:

0 A 0 46 69 72 73 74 20 4C 69 6E 65 B 0 53 65 ..First Line..Se10 63 6F 6E 64 20 4C 69 6E 65 1F 0 54 68 69 72 64 cond Line..Third20 20 4C 69 6E 65 20 77 69 74 68 20 6D 6F 72 65 20 Line with more30 63 68 61 72 61 63 74 65 72 73 characters

Disadvantage: the offset of each record can be calculated from its record number. Thismakes direct access possible.Disadvantage: there is space overhead for the delimiter suffix.Advantage: there will probably be no internal fragmentation (unusable space withinrecords.)




139/142

138

Indexed variable length records:

An auxiliary file can be used to point to the beginning of each record.In this case, the data records can be contiguous.If the records are contiguous, the only access is through the index file.Example:Index File:

0 12 0 0 0 25 0 0 0 47 0 0 0 ....%...G...

Data File:

0 46 69 72 73 74 20 28 31 73 74 29 20 53 74 72 69 First (1st) Stri10 6E 67 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 53 ngSecond (2nd) S20 74 72 69 6E 67 54 68 69 72 64 20 28 33 72 64 29 tringThird (3rd)30 20 53 74 72 69 6E 67 20 77 68 69 63 68 20 69 73 String which is40 20 6C 6F 6E 67 65 72 longer

Advantage: the offset of each record is be contained in the index, and can be looked up from itsrecord number. This makes direct access possible.Disadvantage: there is space overhead for the index file.Disadvantage: there is time overhead for the index file.Advantage: there will probably be no internal fragmentation (unusable space within records.)

The time overhead for accessing the index file can be minimized by reading the entireindex file into memory when the files are opened.

Fixed field count records:Records can be recognized if they always contain the same (predetermined) number of fields.Delineation of fields in a record:Fixed length fields:

Field 1 Field 2 Field 3 Field 4 Field 5Each record is divided into fields of correspondingly equal size.Different fields within a record have different sizes.Different records can have different length fields.

Programs which access the record must know the field lengths.There is no external overhead for field separation.There may be internal fragmentation (unused space within fields.)

Delimited variable length fields:

Field 1 ! Field2 ! Field 3 ! Field 4 ! Field 5 !

The fields within a record are followed by a delimiting byte or series of bytes.




140/142

139

Fields within a record can have different sizes.Different records can have different length fields.Programs which access the record must know the delimiter.The delimiter cannot occur within the data.If used with delimited records, the field delimiter must be different from the record

delimiter.There is external overhead for field separation equal to the size of the delimiter per field.There should be no internal fragmentation (unused space within fields.)

Length prefixed variable length fields:

12 Field 1 4 Field2 10 Field 3 8 Field 4 7 Field 5

The fields within a record are prefixed by a length byte or bytes.Fields within a record can have different sizes.Different records can have different length fields.Programs which access the record must know the size and format of the length prefix.

There is external overhead for field separation equal to the size of the length prefix perfield.There should be no internal fragmentation (unused space within fields.)

Representing record or field length:Record or field length can be represented in either binary or character form.The length can be considered as another hidden field within the record.This length field can be either fixed length or delimited.When character form is used, a space can be used to delimit the length field.A two byte fixed length field could be used to hold lengths of 0 to 65535 bytes in binaryform.A two byte fixed length field could be used to hold lengths of 0 to 99 bytes in decimal

character form.A variable length field delimited by a space could be used to hold effectively any length.In some languages, such as strict Pascal, it is difficult to mix binary values and charactervalues in the same file.The C++ language is flexible enough so that the use of either binary or character formatis easy.

Tagged fields:Tags, in the form "Keyword=Value", can be used in fields.Use of tags does not in itself allow separation of fields, which must be done with anothermethod.Use of tags adds significant space overhead to the file.

Use of tags does add flexibility to the file structure.Fields can be added without affecting the basic structure of the file.Tags can be useful when records have sparse fields - that is, when a significant number ofthe possible attributes are absent.

Mixing numbers and Characters: Use of a File DumpFile-dump gives us the ability to look inside a file at the actual bytes that are storedOctal Dump: od -xc filenamee.g. The number 40, stored as ASCII characters and as a short integer




141/142

140

Byte order:The byte order of integers (and floating point numbers) is not the same on all computers.This is hardware dependent (CPU), not software dependent.Many computers store numbers as might be expected: 40 10 = 28 16 is stored in a four byte

integer as 00 00 00 28.PCs reverse the byte order, and store numbers with the least significant byte first: 40 10 =2816 is stored in a four byte integer as 28 00 00 00.On most computers, the number 40 would be stored in character form in its ASCIIvalues: 34 30.IBM mainframe computers use EBCDIC instead of ASCII, and would store "40" as F4F0.

8.3 Managing Fixed length,fixed field buffersFor having the fixed length and fixed field buffer instead of writing the size of each field or eachrecord we can write the methods that control the fixed length of each field.The class FixedLengthBuffer is subclass of IOBuffer.This class supports both the fixed length

and fixed field buffers.The object of FixedLengthBuffer class can record the size of each record.FixedLengthBuffer class as given belowclass FixedFieldBuffer:public FixedLengthBuffer{

public:FixedFieldBuffer(int maxFields,int RecordSize=3000);FixedFieldBuffer(int maxFields,int *fieldSize);int AddField(int fieldSize);//define the next fieldint Pack(const void* field,int size=-1);int Unpack(void * field,int maxBytes=-1);int NumberOfFields()const;// return number of defined fields

protected:int * FieldSize;//array to hold field sizesint maxFields;//max number of fieldsint NumFields;//actual number of defined fields};The AddField method is used to specify the size of the field.The total number of fields can beobtained using NumberOfFields method.

Assignment Questions:

File Structures1. Explain the fundamental File Processing Operationsi. opening files

ii. closing filesiii. Reading and Writing file contentsiv. Special characters in files.

2. Discuss the fundamental File Structure Conceptsi. Field and record organization




142/142

ii. Managing fixed-length,iii. fixed-field buffers.


Advanced Data Structures Notes

Documents

Transcript of Advanced Data Structures Notes