AI - a short summary

7/31/2019 AI - a short summary

1/10

Associative text retrieval from a large document collection usingunorganized neural networks.

1. Importance of information

The previous year the Bank of Sweden Prize in Economic Sciences in Memory of Alfred Nobel was awarded to James Mirrlees and William Vickrey for theirfundamental contributions to the theory of incentives under asymmetricinformation.

With their work they have validate not only the importance of the Information butalso the importance of the accessibility over this information[http://www.nobel.se/economics/laureates/2001/ecoadv.pdf].

Now days everyone in west world, especially after the development of theinternet, has the ability to access a large amount of data, in electronic or paperform. The main problem that we usually face is that the volume of this

information is so wide that we can not easily handle it, or worse it has no use.

In order to take advantage of this information we need to categorize it inthematic cohesion so turn it into manageable data. Until few decades ago thiswas librarians line, but as we mentioned before the volume of data hasincreased dramatically in such degree[http://www.isoc.org/inet2000/cdproceedings/8e/8e_4.htm#s6] that thetraditional methods of indexing is not in position to face this new challenge.

The problem turns bigger when we need to categorize new document based ontheir content, of course in many documents their is an abstract on top of them ;but in fact only scientific papers with a special purpose have this form, for

example an abstract is essential for a paper but not for a newspaper or amagazine.

Some people believe that when we talk about retrieving data through internetthings are very is because there we have the assistant of search engines. TheInternet search engines are some of the largest and most commonly used searchengines on the World Wide Web. The huge databases of millions of Web pagestypically index every word on each one of the pages. By using them, searchersexpect to find every page that contains an occurrence of their search term, whilethe public in general hopes to find pages on the subject of the terms they enter.Unfortunately, the Web search engines and their databases do not do either. Theycan certainly find some pages that contain the search terms and an occasional

page that is actually about the concept represented by the search terms. Butsearch engines does not understand the content of the page, they only play withstatistics and probability. Unfortunately a lot of web designers, in order to attractmore visitors for their pages, use common words in meta-keys or on the body of their web pages that do not have any relation with the content of thePage [http://webreference.com/content/search/how.html].

Matters are more easygoing for classical bibliographical databases, because therewe have ordinal rate of growth and of course we have already very good indexes(thanks to librarians), but also the amount of data grows and we need some toolsthat will search the available data and will categorize it based on their content sothat can categorized real time based on what the researcher of the databasewants at this particular time in other words to deliver customized services.


2/10

2. Text Retrieval Methods

In a conventional information retrieval system the stored text is normallyidentified by sets of keywords known as index terms. Requests for informationare typically expressed by Boolean combinations of index terms, consisting of search terms and the Boolean operators and, or, and not. The termscharacterizing the stored text may be assigned manually by trained personnel oralternatively, automatic indexing methods can be used. In some systems one canavoid the content analysis, or indexing operation, by using words contained in thetext of the documents for content identification. When all text words are used fordocument identification (except for common words) we consider such system tobe a full text retrieval system [1].

All existing approaches to the text retrieval are based on relevant terms found inthe text. A typical approach would be to identify the individual words occurring inthe documents. A stop list of common function words (an, of, the, and, but, etc.)is used to delete the high frequency words that are insufficiently specific to

represent the document content. A suffix stripping routine would be applied toreduce the remaining relevant words to word stem form. At that point the vectorsystem would assign a weighting factor to each term in the document to indicateterm importance and it would represent each document by a set, or vector, of weighted word stems.

Some other systems, like the signature files, would generate the signatures andstore them in some sort of access structure like sequential file or Stree.

The most straightforward way of locating the documents that contain a certainsearch string (term) is to search all documents for the specified string (substring

test). ''String'' is a sequence of characters without ''Don't Care Characters''. If thequery is a complicated Boolean expression that involves many search strings,then we need an additional step, namely to determine whether the term matchesfound by the substring tests satisfy the Boolean expression (query resolution).The search time for this automaton is linear on the document size, but thenumber of states of the automaton may be exponential on the size of the regularexpression. The obvious algorithm for the substring test is as follows:

Compare the characters of the search string against the correspondingcharacters of the document.

If a mismatch occurs, shift the search string by one position to the rightand continue until either the string is found or the end of the document isreached.

Although simple to implement, this algorithm is too slow. If m is the length of thesearch string and n is the length of the document (in characters), then it needsup to O (m * n) comparisons [2].

An other approach over text retrieval has to do with the Signature Files. In thismethod, each document yields a bit string ('signature'), using hashing on itswords and superimposed coding. The resulting document signatures are storedsequentially in a separate file (signature file); which is much smaller than theOriginal file, and can be searched much faster.

The main advantage of signature files is their low storage overhead. However, thesignature file size is proportional to the text database size and that becomes aproblem for massive text databases like digital libraries.


3/10

The signature files contain hashed relevant terms from documents. Such hashedterms are called signatures and the files containing them are signature files.There are several ways to extract the signatures from the documents four basicmethods being WS (Word Signatures), SC (Superimposed Coding), BC (BitBlockCompression) and RL (RunLength Compression). For example, in theSuperimposed Coding (SC) the text database is divided into a number of blocks.A block Bi is associated with a signature Si, which is a fixed length bit vector. Siis obtained by hashing each nontrivial word in the text block into a wordsignature and ORing them into the block signature. The query is hashed, usingthe same signature extraction method used for the documents, into the querysignature Sq. The document search is then done by searching the signature fileand retrieving a set of qualified signatures [Si such that Si AND Sq = Sq] . Thereare designs for signature file storage structures or organizations like sequentialorganization, a transposed file organization or bitslice organization [ROBE79],single and multilevel organizations [OTOO86, SACK87] like5 Strees [DEPP86].The most recent signature file organization is called the partitioning approachwhereby the signatures are divided into partitions and then the search is limitedto relevant partitions. The motivation for the partitioning approach is a reduction

in the search space as measured either by the signature reduction ratio (ratio of the number of signatures searched to the maximum number of signatures) andby the partition reduction ratio (ratio of the number of partitions searched to themaximum number of partitions) [LEE89]. Two approaches to the partitionedsignature files were published [ZEZU91, LEE 89]. One uses linear hashing to hasha sequential signature file into partitions or data buckets containing similarsignatures [ZEZU91]. The second approach [LEE89] uses a notion of a key, whichis a sub string selected from the signature by specifying two parameters the keystarting position and the key length. The signature file is then partitioned so thatthe signatures containing one key are in one partition. The published performanceresults on the partitioned signature files are based mostly on simulations of smalltext databases and are not conclusive. There was not any attempt to address the

scalability of the partitioned signature files to massive text databases. Thepartitioned signature files grow linearly with the text database size and thus theyexhibit the same scalability problem as other text access structures.

On the other hand some scientist believes that we can use inversion method inorder to archive very good retrieval results. Each document can be representedby a list of (key)words, which describe the contents of the document for retrievalpurposes. Fast retrieval can be achieved if we invert on those keywords. Thekeywords are stored, eg., alphabetically, in the 'index file'; for each keyword wemaintain a list of pointers to the qualifying documents in the 'postings file'. Thismethod is followed by almost all the commercial systems [61].

Started from internet a method called metadata-based indexing has gain a greatposition in librarian science. Metadata is not fully data, but it is a kind of fellowtraveler with data, supporting it from the sidelines. A definition is that `anelement of metadata describes an information resource or helps provide access toan information resource'. [Castro 1997].

In the context of Web pages on the Internet, the term `metadata' usually refersto an invisible attached to a Web page which facilitates collection of informationby automatic indexers; the is invisible in the sense that it has no effect on thevisual appearance of the page when viewed using a standard Web browser such aNetscape TM or Microsoft's Internet Explorer TM .


4/10

2.2 Natural Language processingNatural language processing techniques seek to enhance performance bymatching the semantic content of queries with the semantic content of documents [33, 49, 76]. Although it has often been claimed that deeper semanticinterpretation of texts and/or queries will be required before information retrievalcan reach its full potential, a significant performance improvement fromautomated semantic analysis techniques has yet to be demonstrated.

The boundary between natural language processing and shallower informationretrieval techniques is not as sharp as it might first appear, however. Thecommonly used stoplists, for example, are intended o remove words with lowsemantic content. Use of phrases as indexing terms is another example of integration of a simple natural language processing technique with moretraditional information retrieval methods.

2.2 Neural Network as infrastructure in retrievalThe main idea in this class of methods is to use the spreading activation methods.

The usual technique is to construct a thesaurus, either manually or automatically,and then create one node in a hidden layer to correspond to each concept in thethesaurus. Jennings and Higuchi have reported results for a system designed tofilter USENET news articles in [34]. Their Implementation achieves reasonableperformance in a large scale information filtering task.

2.2.1 Latent Semantic Indexing

Latent Semantic Indexing (LSI) is a vector space information retrieval methodwhich has demonstrated improved performance over the traditional vector space.We begin with a basic implementation which captures the essence of thetechnique. From the com plete collection of documents a term document matrix is

formed in which each entry consists of an integer representing the number of occurrences of a specific term in a specific document. The SingularValue Decomposition (SVD) of this matrix is then computed and small singularvalues are eliminated. The effectiveness of LSI depends on the ability of the SVDto extract key features from the term frequencies across a set of documents. Inorder to understand this behaviour it is first necessary todevelop an operational interpretation of the three matrices which make up theSVD.

2.2.2 Latent Semantic AlgorithmLatent semantic analysis (LSA) is used to define the theme of a text and togenerate summaries automatically. The theme information the already known

information - in a text can be represented as a vector in semantic space; the textprovides new information about this theme, potentially modifying and expandingthe semantic space itself. Vectors can similarly represent subsections of a text.LSA can be used to select from each subsection the most typical and mostimportant sentence, thus generating a kind of summary automatically.

2.2.3 Advantages of Neural Network Models over Traditional IR Models

In neural network models, information is represented as a network of weighted,interconnected nodes. In contrast to traditional information processing methods,neural network models are "self-processing" in that no external program operateson the network: the network literally processes itself, with "intelligent behavior"emerging from the local interactions that occur concurrently between thenumerous network components (Reggia & Sutton, 1988). According to Doszkocs,


5/10

Riggia and Lin (1990), neural network models in general are fundamentallydifferent from traditional information processing models in at least two ways.

First they are self-processing. Traditional information processing modelstypically make use of a passive data structure, which is alwaysmanipulated by an active external process/procedure. In contrast, thenodes and links in a neural network are active processing agents. There istypically no external active agent that operates on them. "Intelligentbehavior" is a global property of neural network models.

Second, neural network models exhibit global system behaviors derivedfrom concurrent local interactions on their numerous components. Theexternal process that manipulated the underlying data structures intraditional IR models typically has global access to the entire network/ruleset, and processing is strongly and explicitly sequentialized (Doszkocs,Reggia & Lin, 1990). Pandya and Macy (1996) have summarized thatneural networks are natural classifiers with significant and desirablecharacteristics, which include but no limit to the follows.

Resistance to noise Tolerance to distorted images/patterns (ability to generalize) Superior ability to recognize partially occluded or degraded images Potential for parallel processing

Furthermore LSA algorithm has a unique ability to be (Dumais) to be work incross-language environment with fully automatic corpus analysis.

3 Information Retrieval in Greek Language

Different human languages exhibit significantly different linguistic andgrammatical characteristics which strongly affect how information is structuredand represented in modern databases. This is particularly true of the contrastsbetween western languages (e.g., English, French, German, etc.) and orientallanguages (e.g.Greek, Chinese, etc.). Despite these differences, there arecommon problems associated with online information management and retrievalacross databases created in different languages. In our project we attempted toaddress the information management and retrieval issues related to a multilingualdatabase containing mainly Greek texts.

Greek as English is a phonographic language in which almost every word has oneor more independent meanings [C3]. However there are some significantdifferences that make Greek language more complicated. Greek vocabulary islarger than English furthermore Greek syntax is quiet different. Grammaticalphenomena are more complicate this entire make Symantec processing moredifficult also cause awkwardness in IR on Greek Texts.

With classical methods of IR is quiet hard to create an effective informationretrieval system and we have to face great difficulties when would like toimmigrate to a multilingual IR system.


6/10

4. Scope of the Project

Our aim is to create a prototype that will be a helpful and accurate tool on IRover texts in Greek language; a second point will be high flexibility concerningcross-language platforms. But the most important

We will use som neural network algorithms. The source (inputs) of informationwill be the Greek www with focus on news sites, also there is a request at NationDocumentation Center to give us access at its Ph.D thesis abstracts. This secondoption is more interesting because there is grater possibility to find documentwith gobbledygook structure . So it will be a question what outputs our neuralnetwork will give and of course more challenging.

At a second stage we will benchmark the result of our prototype over traditionalinformation retrieval methods. Also an interface will be construct in order topresent the results of user query in visual way over world wide web.

4.1 Plan of work

First of all we have to create a tool that will collect the text for us, in case that weuse the www a web spider will be created.

At a second leaver a parser will work over the texts in order to omit commonwords [like (and),(in),(for),(in),(with) e.t.c], of course this optionalaction in that some neural networks like LSA are capable through SVD (singularvalue decomposition) to do that job on their own but it possible to have a betterperformance by this way.

After that stage some text data will be sent as input to our neural network inorder to train it, during this face we have to perform an evaluation of our parserin order to prevent unnecessary data to reach our network.

After some training we will start to do real queries; at this stage we need amorphological analyzers.

Morphological analyzers are software tools used to assist the process of identifying and extracting only these words in the query that can be used toretrieve information. During the process of tagging, each word of the query willbe identified according to its function (e.g. noun) and then tagged:

Word in the query "find"Word after Tagging (Verb "find")Word in the query "cat"Word after Tagging (Noun "cat")

Queries in a natural language form will certainly contain word in many forms i.e.nouns in plural, verbs in their past tense etc. Stemming is the process of identifying the root of each word of the query and returning that word:

Word in the query "was"Word after stemming ("be+ed")


7/10

Word in the query "sitting"Word after stemming ("sit+ing")

Figure 1: Prototype Flowchart

the system is required to accept a query about a subject in a natural languageformat, and then extract those words that could be used as keywords to retrieveinformation from the Verity database, and finally display the findings on thescreen.

That is, the process identifies words that can be used as keywords - usuallynouns with some verbs and adjectives. When all these words have beenidentified, they need to be translated to an appropriate default language (in ourcase, English).


8/10

Dictionaries contain only the root of each word or its basic form when it has to dowith verbs. However, natural language queries will probably contain words invarious forms and tenses. In order to extract the root word, each word has to gothrough the process of stemming.

At the end we have to present visual the outputs of our neural network, for thisreason we are going to create a supplementary to query interface that willpresent all relative text in a plain. It is important to give at the end user theability to understand how relative is each document, so the layout must show therelativity.

Figure 2 Proposed Output

4.1.1 Software and Hardarware requirements

Perl has been extentensivly applied on text maniputilation application with verygood results furthermore we can spot some other advantages that makes perl ourfirst choice as has to do with the construction of parser and steamer:

Perl is available with a free license for both commercial and non-

commercial purposes. The Perl community has advanced this far because it takes care of itself -

equal to a savings of as much as $1800 per programmer year. Script writing efficiency. Conservative estimates rank Perl as a third faster

than most other languages for script writing. Let's assume that aprogrammer who makes $40,000 per year and spends 1/3 of his timewriting scripts were to use Perl. The result would be an added $4,400annually to the corporate bottom line. This benefit expands when youconsider that it's often said (less conservatively) that two hours of scriptwriting in Perl can produce the equivalent of a week of scripting in manyother languages. The savings skyrocket.

Perl utilities. There are over 200 modules and other utilities available on

the CPAN, and the number is increasing by an average of two per week.By using (and reusing) these modules, programmers can save weeks.


9/10

Reliability and security. Perl is easier to debug than other languages. Andwhen properly written, it's more secure too. It's difficult to put a dollarvalue on security, and for most of us it's not likely to be an issue. Butthere are times and places where the increased reliability and securityfactor may prove invaluable.

Perl is fun. Perl adds humor to programming - a great buffer againstburnout.

Perl offers creative options. Perl allows the same end result to be achievedby a variety of routes, rather than one way only. Programmers arecreative human beings. The value of incorporating creativity within work isimmeasurable.

Perl is flexible. Changes are incorporated into the language quickly,usually within 30 days. Making changes in most other languagesfrequently requires filtering through standards committees. That takesyears.

Documentation. There are more than 15 Perl books currently in print,several scheduled to come off the presses during the next few months,and a dozen more being written.

number of major applications. Most of the interactive sites on the web usePerl to drive search engines and interactive forms. The number of applications is growing every day because of the amount of interest in thelanguage and because Perl does some things that other language doesn'tdo very well such as string handling, advanced pattern checking, andobject oriented programming.

Availability. Thanks to a unique group of Perl hackers, Perl has beenported to most of the major operating systems currently in use. There aremore than 20 publicly available mirror archive sites around the world foreverything you would want to know about Perl including modules, textfiles, ports of Perl to the various major operating systems (and a numberof min= or ones), source code, and example programs.

Also there are advantage on programming area in that perl include features thatare required for large projects (modularization, object-oriented, arbitrary datastructures, and a large number of built-in checks by the compiler and at runtime)


10/10

[1] Badal, D.Z. and Davis, M.L., ''Investigation of Unstructured Text Indexing'',Proceedings[2] Christos Faloutsos and Douglas W. Oard: A Survey of Information RetrievalandFilteringMethods.http://citeseer.nj.nec.com/rd/40804594%2C12992%2C1%2C0.25%2CDownload/http%3AqSqqSqciteseer.nj.nec.comqSqcacheqSqpapersqSqcsqSq45qSqhttp%3AzSzzSzwwwai.cs.unidortmund.dezSzLEHREzSzAGENTEN97zSzSeminar_PaperszSzFaloutsos__Survey.pdf/faloutsos96survey.pdf] DEXA95 International Conference and Workshop on Database and ExpertSystems, London, September 48, 1995, 387396

AI - a short summary

Documents

Transcript of AI - a short summary