Indexer Documentation

4
Author: Joon H. Cho Assignment: Lab 5 – Indexer Class: CS 50 Date: 05/08/2014 Design Specification: (1) *Input* Command Input: ./indexer [TARGET_DIRECTORY] [RESULTS FILENAME] [RESULTS FILENAME] [REWRITEN FILENAME] Example command input: ./indexer ../crawler/output/ index.dat index.dat new_index.dat [TARGET_DIRECTORY] ../crawler/output Requirement: The directory must exist and be readable. The files inside must contain only numbers. The files are text files, with the url in the first line, the depth in the second line, and the html document beginning on the third line. Usage: The indexer needs to inform the user if directory is not found [RESULTS FILENAME] index.dat Requirement: Must not already exist. Needs to be writeable Usage: The crawler must inform the user if that file exists already [RESULTS FILENAME] index.dat Requirement: A file to read from. Usage: the indexer will alert the user if the file is not the same as the previously created results filename or if it cannot be opened. [REWRITEN FILENAME] new_index.dat Requirement: A new file to be written in. Usage: the indexer will alert the user if the filename already exists or cannot be written to. (2) *Output* For each file in the given directory, the indexer will convert the html inside the doc into a string, and then parse the html string for the words in the html. It will add the words to a hashtable, which contains word nodes that contain a linked list of documents that contain the word. These documents are documentnodes, which contain the document id and the frequency of the word. It then writes all words in the hashtable to [RESULTS FILENAME]. The format of the written text file is [word] [number of documents it appears in]. The next numbers are pairs of documents and frequencies, which means that the word appears in that document with that frequency. 897 2 means that the word appears in doc 897 2 times. If the user inputs 4 arguments and tests indexer, then the [RESULTS FILENAME] will be used to recreate a hashtable of wordnodes. This hashtable will then again be rewritten to

description

Indexer Project CS50

Transcript of Indexer Documentation

  • Author: Joon H. Cho Assignment: Lab 5 Indexer Class: CS 50 Date: 05/08/2014 Design Specification: (1) *Input* Command Input: ./indexer [TARGET_DIRECTORY] [RESULTS FILENAME] [RESULTS FILENAME] [REWRITEN FILENAME] Example command input: ./indexer ../crawler/output/ index.dat index.dat new_index.dat [TARGET_DIRECTORY] ../crawler/output Requirement: The directory must exist and be readable. The files inside must contain only numbers. The files are text files, with the url in the first line, the depth in the second line, and the html document beginning on the third line. Usage: The indexer needs to inform the user if directory is not found [RESULTS FILENAME] index.dat Requirement: Must not already exist. Needs to be writeable Usage: The crawler must inform the user if that file exists already [RESULTS FILENAME] index.dat Requirement: A file to read from. Usage: the indexer will alert the user if the file is not the same as the previously created results filename or if it cannot be opened. [REWRITEN FILENAME] new_index.dat Requirement: A new file to be written in. Usage: the indexer will alert the user if the filename already exists or cannot be written to. (2) *Output* For each file in the given directory, the indexer will convert the html inside the doc into a string, and then parse the html string for the words in the html. It will add the words to a hashtable, which contains word nodes that contain a linked list of documents that contain the word. These documents are documentnodes, which contain the document id and the frequency of the word. It then writes all words in the hashtable to [RESULTS FILENAME]. The format of the written text file is [word] [number of documents it appears in]. The next numbers are pairs of documents and frequencies, which means that the word appears in that document with that frequency. 897 2 means that the word appears in doc 897 2 times. If the user inputs 4 arguments and tests indexer, then the [RESULTS FILENAME] will be used to recreate a hashtable of wordnodes. This hashtable will then again be rewritten to

  • [REWRITTEN FILENAME]. This way, the user can compare the two files to see if they have the same contents. (3) *Data Flow* The number of documents in the given directory are found. Then the filenames (with path) are stored in a filename array. We loop through the filename array, convert the html code inside the doc into a string, and then parse the html for the words inside the html. For each We loop through the documents in the directory, parsing the html in each document for the words in the html. For each word, we add it to a hashtable of words, and adjust the list of documentnodes. After parsing each document, we write the hashtable to a file as specificed in (2). If we are testing indexer, then we recreate the hashtable from the [RESULTS FILENAME] given by the user, which must be the same as the [RESULTS FILENAME] that we have just written to. We add to the hashtable by parsing each line for the word, then adding document nodes for each document id frequency pair on the line. Then we write the second hashtable to another file [REWRITTEN FILENAME]. The contents of the two files are compared to check that the indexer is working properly. (4) *Data Structures* GeneralHashTable Contains an array of GeneralHashTableNodes. We can store any kind of object into the GeneralHashTable GeneralHashTableNode - a. char * hash_key: the string which we input into the hash_function to determine the index in which to place the hashnode b. void *object: The object to place into a GeneralHashTableNode c. GeneralHashTableNode *next: a pointer to the next GeneralHashTableNode in case of collisions (These nodes form a list of nodes) We need a data structure to contain the words and their information related to the documents. We need a way to quickly retrieve each individual word data structure, which is why we use a hashtable, and then insert the word nodes described below into them for O(1) access. WordNode Contains the word and the list of documents that contain the document information related to the word. a. char *word: The string of the word b. DocumentNode *page: pointer to the first document that contains this word DocumentNode a. int doc_id: The int for the document_id b. int freq: the int for the frequency of the word in the document c. DocumentNode *next: a pointer to another documentnode that contains the word. This allows us to keep a linked list of documents that contain a word, in the case that a word is in multiple documents (5) *Pseudocode* 1. Check Arguments 2. Initialize variables to hold arguments

  • 3. Initialize array to hold filenames of documents and fill the array 4. Go through array of filenames, load the documents html into a string, get the document id, and parse the html for words, and add each word, and the corresponding document information, to the hashtable. 5. Write the contents of the hashtable into [RESULTS FILENAME] 6. Free all memory we have used 7. If we are testing, then we recreate the hashtable with the [RESULTS FILENAME], parsing the [RESULTS FILENAME] for each word, and the document information associated with the word 8. We save the hashtable into the [REWRITTEN FILENAME] 9. We free all the memory we have used Functional Specification indexer.c: int CheckArguments(int argument_count, char **argument_array) using the argc and **argv array passed into main, it checks the number of arguments, checks if we have a valid data file directory, checks to see if the index file already exists, and if testing, checks that the result index is the same as the previously created index and that there is no already existing new index in the current directory. Int CheckIndexFile(char *path_to_index) This checks for a valid path, and then opens the file, reading the document line by line to check that every line follows the format [word] [number of docs containing word] [doc_id] [frequency]. It also checks that the number of doc_id frequency pairs matches the number of docs that immediately follows the word Char *LoadDocument(char *filename) using the filename, it opens the file, skips the first 2 lines, and parses each line into a document string. Then it returns that string, which contains the html code of the url inside the document. int GetDocumentId(char *full_path_name, char *directory_name) It takes the full filename of the document, and the directory name that contains the documents, in order to delete the directory name from the path_name and convert the remaining string into an integer. It returns an integer of the document id. int UpdateIndex(char *word, int document_Id, HashTable *index) This function uses the word and document id in order to allocate a word node and document node. The word node contains the word, and the document nodes frequency is initialized to 1, and the its doc_id is assigned to the document_Id. The function attempts to add the word node into the hashtable, and if the WordNode is not added, meaning that the word exists in the hashtable, then we search for the WordNode, search for the DocumentNode containing our document_Id, and increment the frequency. Int SaveIndexToFile(HashTable *index, char *filename) We write the contents of the HashTable index to the filename, which is given as an argument. We do this by checking every possible slot for a HashTableNode. Then we go through each WordNode at each slot, counting the number of document nodes, and then writing the word and the number of documents to the file. Then we go through each document node and print the doc_id and frequency. We return 1 if everything went okay, or 0 if something went wrong.

  • struct HashTable *ReadFile(char *filepath) Returns a HashTable struct containing the words inside the [RESULT FILENAME]. We go through each line, and parse it, store the first string delimited by space as the word, and the next string as the number of documents. Then, for each document_id and frequency pair, we use UpdateIndexFromFile for the newly allocated hashtable with the document and frequency for each word. int UpdateIndexFromFile(char *word, int document_Id, int freq, HashTable *index) We create a new WordNode and DocumentNode with the given Document_Id and frequency. We then attempt to add to the HashTableNode, and if we cannot add, we know a WordNode for that word exists. In that case, we then we loop through the list of documents for a given WordNode, and check that there are no duplicates. We then assign the documentnode to the last next pointer on the list of DocumentNodes for a word. Error Conditions Tested Indexer will make sure that there 2 or 4 arguments (That argc == 3 or 5) and will exit upon an incorrect number. It will check the data directory for readability and existence. It will exist if the directory is invalid. It then checks that [RESULTS FILENAME] and [REWRITTEN FILENAME] does not exist already. Indexer opens every document in the data directory and prints to stderr if the document is wrong or if the document file is invalid (nonexistent or corrupt). If [RESULTS FILENAME] cannot be opened for testing, then we exit. Indexer checks that the filenames are all integers. If a document is invalid, then indexer skips it and the function that tests the document returns 1. Indexer also checks that the format of [RESULTS FILENAME] is correct. If the format is wrong, the program exits. Test Cases and Expected Results Test cases are described in detail in BATS.sh and the logfile that is produced by BATS.sh. The tests are: a. Correct number of arguments. It tests 1, 3, and 5 arguments. If the argument number is incorrect, indexer prints and error and exits. b. Correct directory. It tests for a nonexistent directory, a directory without a slash in the end, and a non-readable directory. If the directory is invalid, indexer prints an error and exits. c. Correct [RESULTS FILENAME]. Indexer makes sure that [RESULTS FILENAME] and [REWRITTEN FILENAME] do not exist already. If testing, it makes sure that the [RESULTS FILENAME] we read from is the same one that was created, and tests that the filename to read is in the proper format. If these conditions arent satisfied, the program prints an error and exits. d. Tests with documents in crawler/output/. If the [RESULTS FILENAME] and the [REWRITTEN FILENAME] are different, then the BATS.sh prints an erro