Open Source Web Crawler

Open Source Web Crawler

A Distributed Network-Bound Data-Intensive Application

Van A. Norris

Department of Computer Science and Software Engineering

Auburn University

Technical Report CSSE04-04

April 5, 2004

i

ABSTRACT

Open Source Web Crawler provides an automated user interactive two-pass search tool

to “crawl” the world-wide web for specific websites containing information of interest to

a researcher. By using a two-pass search the application produces results more rapidly

and accurately than traditional human centered approaches.

The application accepts user input in the form of a subject which is passed to the

Google1 organization database which then returns a list of web sites which contain the

subject within the body of the web site text. The files which are returned from the Google

organization are then parsed and searched a second time for user defined keywords.

When a file contains both the subject and the keyword, the URL address and the keyword

are returned to the user in the form of a graphical user interface and saved on the local

disk for later use. The first pass consists of a search of the Google database for instances

of the subject within the body of a websites text. The second pass is a search through the

results of the first pass to locate the presence of user defined keywords. This approach

automates and accurately locates web sites with content specific to the research effort.

Formally this application performs the following Boolean search,

Result = User_Defined_Subject AND (keyword 1 OR keyword … OR keyword n ) 1 2

This application accepts user input which is then translated into SOAP/XML distributed

message traffic by the local Google API’s and sent across the Internet to the Google

organization application server. The Google application server returns those records from

the Google database servers which match the user query.

1 Google is a trademark of Google, Inc, Mountain View, California, 49670

ii

The Google organization API’s were chosen for this application over a custom design

and implementation due to the fact that the Google organization spans 4.28 billion web

pages [1] and stores the contents of the sites in easily parsed HTML format. The Google

algorithm, while proprietary, crawls millions of web sites per week evaluating the sites

for content and potential for inclusion into the central data store. Those with a high link

ranking score are converted into html format, linked, and stored in the central server farm

[1].

The application development uses an incremental component based software

engineering approach. The system architecture is a C2 type design style which requires

distributed message passing. The model-view-controller pattern which includes the

“event” style is used to incorporate the extensive use of graphical user interfaces. In

addition, a pipe-filter style of program flow is used to facilitate speed in select

components. The request-response pattern is present in the use of the SOAP/XML

message interactions. Factory patterns are present in the graphical user interfaces.

The project is extensible and has been used by researchers in literature reviews, plus the

basic application example for research into network security for web-enabled distributed

databases.

iii

TABLE OF CONTENTS

LIST OF FIGURES VI

1.0 INTRODUCTION 1

2.0 PROJECT REQUIREMENTS AND CHOICES 4 2.1 Project Requirements 4 2.2 Component Action Requirements 5 2.3 Components Chosen 6

3.0 OPEN SOURCE SOFTWARE 6 3.1 Open Source Internet Search Components 6 3.1.1 Eclipse Integrated Development Environment 8 3.1.2 Open Source Parsers 8 3.1.3 Re-Usability and Extensibility 9

4.0 APPLICATION SOFTWARE DESIGN 9

5.0 COMPONENT BASED SOFTWARE ENGINEERING 12 5.1 Component Based Software Engineering Overview 12 5.1.1 Frameworks, Middleware, and Design Patterns 14 5.1.2 Engineering Problems with Components and Distributed Applications 16

5.1.2.1 Connection Management 16 5.1.2.2 Service Initialization 16 5.1.2.3 Error Handling 17 5.1.2.4 Flow and Congestion Control 18 5.1.2.5 Event De-Multiplexing 19 5.1.2.6 Distribution 20 5.1.2.7 Concurrency and Stabilization 20 5.1.2.8 Fault Tolerance 22 5.1.2.9 Scheduling and Persistence 22

6.0 QUALITY ASSURANCE 24

7.0 REQUIREMENTS AUDIT 25 7.1 Accept User Input of Subjects 26 7.2 Accept User Input of Parameters 29 7.3 Provide a Listing of URL's Related by Subject and Parameter as Output 29 7.4 Application Progress Screen 30 7.5 Use of Freely Available Software Tools 32

iv

8.0 SOAP and XML 33

9.0 PROGRAM STRUCTURE AND DESIGN PATTERNS 34 9.1 C2 Architectural Styles 34 9.2 Model View Controller 35 9.3Event Style 36 9.4 Factory Patterns 36 9.5 CASE Structure of Open Source Web Crawler 37 9.5.1 Use of Google API 39 9.5.1.1 GUI.java 39 9.5.1.2 SearchGoogle.java 39 9.5.1.3 OpenGoogleCache.com 40 9.5.2 Second Pass Search 40 9.6 Graphical View of Application 41 9.7 Hierachy View of Application 43

10.0 CORRECTNESS 43

11.0 CONCLUSIONS 45

12.0 SOURCES: 47

APPENDIX 1: PROJECT SOURCE CODE 50 1.0 Java Package - GUI 50 1.1 TopGUI.java 50 1.2 UserEntrySubject.java 52 1.3 UserEntryParameter.java 54 1.4 SubjectGUI.java 57 1.5 ParameterGUI.java 66 2.0Java Package - FileParser 76 2.1 Parser.java 76 2.2 UserParseCache.java 79 2.3 UserParamParseCache.java 82 3.0 Java Package - GoogleAP 84 3.1 ParseCache.java 84 3.2 PrintResults.java 88 3.3 OpenGoogleCache.java 89 3.4 SearchGoogle.java 92 3.5 UserOpemGoogleCache.java 95 3.6 UserParseCacahe.java 98 3.7 UserSearchGoogle.java 101 4.0 Final Output - TableBuild.java 104

v

List of Tables and Charts

Subject Page

Figure 1: Data Flow Diagram 10 Figure 2: Project UML Diagram 11 Figure 3: Software Architecture and Middleware 14 Figure 4: Components, Frameworks and Patterns 15 Figure 5: Project Software Quality Metrics 25 Figure 6: Introductory GUI 27 Figure 7: User Provides Subject Input 28 Figure 8: User Default Choice Table GUI 28 Figure 9: User Parameter Input GUI 29 Figure 10: Application Output 30 Figure 11: Console Output and Persistent Storage 31 Figure 12: Example of URL Stored Pair 31 Figure 13: C2 Architectural Style 34 Figure 14: Model View Controller 35 Figure 15: Action Listener Launcher 36 Figure 16: Factory Pattern 37 Figure 17: CASE View of Project 38 Figure 18: Graphical View of Application 41 Figure 19: Tree View of Project 42 Figure 20: Singleton Instance Verification 43

Figure 21: Comparison of Project with Regular Google Search 44

vi

1.0 INTRODUCTION

One sponsor of the Information Assurance Laboratory of Auburn University is

contracting with Auburn University to provide a literature search for potentially sensitive

parameters relating to sensitive software that was being considered for release to allied

countries and coalition partners. This research is being conducted by the Information

Assurance Laboratory, Department of Computer Science and Software Engineering

within the College of Engineering. Specifically, the sponsor the Missile Defense Agency

of the USA Department of Defense directs that an effort be undertaken to locate open

source information relating to missile defense simulation applications being investigated.

Open source information is defined as information readily available to the general public

through electronic media or traditional on-paper resources at no cost [2]. Prior to this

project, the research was being undertaken “by hand” which was a laborious and time

consuming task. It was recognized that this task could be mechanized with a proper set of

software tools and access to the Internet.

The widening acceptance of the world-wide web as a method of information

dissemination allows a researcher to search through vast quantities of data in a

compressed time scale. The breadth and depth of available material has improved to the

extent that most information can be obtained by a methodological search of the world-

wide web using wide area networks commonly referred to as the Internet [3].

While the quantity and ease of access to information has improved dramatically, the

researcher must still review the available source material for applicable content. The

researchers’ manual inspection of electronic source material is both in-efficient and time

1

consuming. The Open Source Web Crawler provides a component based computing

solution for locating specific web pages containing content of interest to the Missile

Defense Agency.

Component based software engineering development using open source applications

and their associated application program interfaces (API’s) has been a goal of software

development for two decades [4]. The use of previously developed and tested software

components will never entirely replace traditional software development since each

problem solution has unique characteristics. More common tasks associated with many

development activities can be packaged into components and shared with other

developers through source code libraries. The source libraries allow developers to save

development time and reduce costs. Linking library components with locally developed

software code to provide a solution to problem requirements is rapidly becoming the

normal practice within the software engineering profession [5].

The proliferation of distributed software applications which use computer and

telecommunication networks forced software engineers to develop methods and

procedures for communications between physically separated software components.

Network software development has progressed from a hardware specific programmable

interface to higher level language protocols which are easier for the average person to

understand. The TCP/IP primitives remain the underlying backbone of the modern

Internet for physically separated applications [6], but the rapid rise and popularity of user

friendly higher level message passing protocols such as XML and SOAP has allowed the

average developer to incorporate distributed applications into a local solution.

2

The standardization of message passing interfaces between networked applications

remains elusive. There are a handful of vendor specific standardized protocols such as

Microsoft COM and DCOM, Rationale Corporations CORBA, and the Java2 RMI. The

specific standard chosen for a particular project is dependent upon the software language

being utilized in the source code. Multiple standards and interfaces create confusion [7].

The most widely used multi-tiered Internet message passing protocol in use today is

SOAP which is an interface to the XML (W3C: CORBA variant) specification.

While the debate over message passing standards continues, developers are providing

components stored in digital libraries readily available for download using the Internet.

Sites such as sourceforge.com, freshmeat.com, and sun.java.com/swing provide

components which satisfy portions of unique problems for developers. Among the more

common components are graphical user interface (GUI) components, search engines,

parsers, and software development tools. The components can be linked with locally

developed software to solve parts of the overall problem.

There were four primary activities involved in the completion of this project.

1. Research the available methods and tools to “crawl” the Internet to identify

the subjects and parameters to meet the requirements. The tools selected

included the Google application program interface, the Eclipse integrated

development environment, and the Java language components and

development kit.

2. Prepare a design for the project software system using readily available open

source software design applications and tools.

3. Software coding for the project. 2 Java is a trademark of Sun Microsystems, 901 San Antonio Road, Palo Alto, CA 94303

3

4. Testing was the fourth step in the process even though testing activities were

present in every phase of the process.

The project was developed using an incremental engineering approach where

components were analyzed, planned, developed, coded, tested, and audited for

requirements coverage separately. The components were then assembled and linked into

an overall structure for evaluation and testing. Only after an incremental step was

complete was any software code developed for the next component in the development

plan. Once all the incremental components were developed and consolidated, the entire

system was subjected to an extensive review and testing process. The final testing and

auditing activities were in addition to those undertaken during the individual and

incremental steps.

2 .0 Project Requirements and Choices

2.1 Project Requirements

Provide a general world-wide web search agent to identify specific web pages which

contain information specific to research efforts while being extensible for use by other

researchers in other fields of study.

2.1.1 Using freely available software tools, search the world-wide web for

URI’s3 containing user defined subjects and keywords.

3 URI: Uniform Resource Identifier - short strings that identify resources in the web: documents, images, downloadable files, services, electronic mailboxes, and other resources.

4

2.1.2 This system should allow breath by allowing user input for any specific

subject, while providing specificity by evaluating the file contents against

user defined keywords.

2.1.3 The system should return subject URL4 locations, if available, for

extraction and parsing

2.1.4 The system shall maintain a record of URL’s returned during each search

for future reference

2.1.5 The solution should return the keyword - URL pair to the user for further

human- centered review.

2.2 Component Action Requirements

Once potential software sources are identified using open-source search engines, those

files exhibiting the greatest promise will have their Uniform Resource Locators (URL)

stored locally. A URL is the network address used to locate the server and file which

contains the requested information. The files associated with the URL will be opened and

examined by an html parser. Words within the file will be evaluated to locate specific

parameters relevant to the sponsor requirements. It should be noted that changing the user

defined evaluation keyword parameters allows this project to be utilized by any

researcher within any research area.

The application enables a researcher to rapidly identify web pages that contain content

relating to the research sponsor’s requirements.

4 URL: RFC 2396: Uniform Resource Locator- resource that is available via the Internet. URL defines the general syntax and semantics of URIs.

5

2.3 Components Chosen

The primary components used for this project included Google organization open-

source search software, SOAP/XML open source connection middleware, Java Swing

open-source graphical user interface modules, and the open-source Java developers’ kit

(JDK). The component modules were linked together with custom designed and

implemented software using the open source integrated development environment.

This application makes extensive use of the Google Search engine API, plus the Google

Corporation storage cache of html files (4.28 billion), therefore coverage of the “world

wide web” is not totally inclusive.

The Google API’s were chosen for their ability to return the contents of html, pdf, doc,

txt, and most other file formats. The Google API utilizes the SOAP message passing

protocol that was designed as a interface for the XML world wide web document content

passing standard for the world-wide web and multi-tiered network intensive data

repositories.

The Google API’s were written to use the Java language interface to the SOAP protocol.

As the basic crawler was written in the Java language, this project was developed using

the Java J2EE graphical user interface classes, the Eclipse integrated development

environment for Java, and the Java development toolkit (JDK)

3.0 Open Source Software

3.1 Open-Source Internet Search Components

Integrating a search of the world-wide web using the Internet requires specialized

software to extract files from various Internet database servers throughout the world. This

6

software must be able to evaluate a set of files from an associated user input rapidly. To

accomplish this task many search engines utilize large locally stored data caches of

address links and html records to allow rapid responses to most search requests.

The Google organization (http://www.google.com ) is well regarded within the Internet

community as a leader in search engines for information retrieval. Google™ provides

geographically separated servers containing links on select subjects. The information is

obtained by crawling and examining billions of web sites periodically for content relating

to a ranked scale of user requests during set time period5.

As part of their ongoing efforts to support the research community, the Google™

organization has made available an open-source Java™ Application Program Interface

(API) to facilitate specialized searches. One open-source download of this product can be

packaged for use as a plug-in API for use in the Eclipse Integrated Development

Environment (IDE).

The Eclipse IDE project is an open-source development environment allowing various

tool suites to be loaded together to provide a unique problem solution. The Eclipse

environment allows for interoperability of components written with the common Eclipse

interface [8].

Google™ provides a set of methods to extract up to ten URL’s per request with

associated html pages per query. The daily no-fee maximum is set at 1000 requests, or

10,000 pages per day. In addition, the Google™ organization allows specificity on the

types of content returned from each request. This allows the exclusion of certain types of

information allowing the researcher to avoid the evaluation of vacuous sources [1].

5 The Google organization searches world-wide web Internet servers for files of interest. The files are converted to html format and stored.

7

http://www.google.com/

3.1.1 Eclipse Integrated Development Environment (IDE) and Tool Integrator

Eclipse is more than an open source IDE; it is also a foundation or technology platform

for tool integration. Eclipse can be considered to be a tool chest. Eclipse provides a way

for all of your tools to work together in an integrated environment for increased

productivity. Eclipse is platform and independent of methodology [9].

Software developers using the Eclipse platform are able to develop additional tools

which interoperate with other tools using the basic Eclipse platform. Some of the open

source Eclipse toolkits available today include the MySQL™ Java API plug-in, JUnit

Java test suite plug-in, Google Java API plug-in, HTML Parsers with various features,

Java application metrics plug-in for software quality assurance, and hundreds more.

The Eclipse environment allows the developer to import a set of tool suites into a

common working area, and then write code to integrate the needed pieces of each

package to provide a solution.

3.1.2 Open-Source Parsers

Software to parse files has been available for many years. The ability to parse multiple

files almost simultaneously can be accomplished using threaded software technologies.

Since this project uses Eclipse Java™ language plug-ins as the component code base, this

project will be implemented using the Java™ language and associated API’s. The parser

for this project is a custom design with general functionality which was not found within

the source code libraries.

8

This custom coded solution was necessary since the functionality was specific to the

project requirements plus the project needed to accept responses from the Google™

API’s in a base 64 html format. An additional factor in providing a custom parser was an

overall goal of extensibility. This mandated that the custom component be able to search

for different subjects with minor changes in the source code.

3.1.3 Re-Usability and Extensibility

One of the primary goals of this project was to provide a re-useable easily modifiable

research tool that can be used in any field of research. The specific search of the world-

wide web should be a function of the researcher’s needs; therefore the subject of the

search and specific parameters should be provided by the researcher. This required the

application be easily modified by researchers with only moderate programming skills.

Within this project, other researchers need to modify one or two files to change the

research direction. Should the researcher desire, the first file to change is GUI.GUI.java

which contains a set subject list from which to choose. Should the researcher wish to

manually enter each subject, then only one file will have to be altered and that file

contains the keywords for specificity in the search. The file containing the key words is

GoogleAP.ParseCache.java which contains the specific keywords to be found by the

search. Either choice for modifications can be made within five minutes using the Eclipse

source editors.

4.0 Application Software Design

The basic designs for this project are shown using flow charts and the Unified Modeling

Language (UML) views in Figure 1 and 2. Figure 1 represents the basic design and

9

overall flow of the program from the initial user interaction through the display of the

final graphical interface containing the results.

Figure 1: Data Flow Diagram

10

Figure 1 represent a high-level view of the flow of data through the application. The

GUI accepts a string representing the subject and parameters. The strings are passed

across the computer network to the Google server which returns a set of header fields in

XML format that include the URL for 10 web pages. The application parses the URL

from the XML document storing the results. The application then again sends a request to

the Google server requesting the contents of all the URL’s obtained in the first pass.

These XML documents are then parsed a second time by the application examining the

documents for user provided keyword. If the keywords are contained in the document a

record of the URL and keyword pair is stored for user display at the conclusion of

processing

Figure 2: UML Diagram

11

Figure 2 represents a CASE tool representation of the classes and their interactions for

the project. The basic project design as shown in Figure 1 has been directly translated to

source code as shown in Figure 2. TopGUI, SubjectGUI, UserEntrySubject,

ParameterGUI, and UserEntry Parameter represent the GUI shown in Figure 1.

PrintResult represents the storage of the URL Result in Figure 1. SearchGoogle and

UserSearchGoogle represent the SearchGoogle web interaction shown in Figure1.

OpenGoogleCache and UserOpenGoogleCache represent the GoogleSearch network

interaction shown in Figure1. The Parsing functions in Figure 1 are represented by

Parser, UserParser, ParseCache and UserParseCache in Figure 2.

The direct mapping of design documents to code is the preferred methodology for

software development. Mapping code to design focuses the application developer efforts

on tasks directly involved with the satisfaction of user requirements and prevent the

introduction of additional functionality not required by the specification.

Section 5.0 Component Based Software Engineering

5.1 Component Based Software Engineering Overview

The movement from client-server computing to multi-tiered computing has created the

need for a collection of products to translate the language of one tier into the language

used by other tiers. The applications that perform the translation are known as software

middleware. Nenad Medvidovic defines middleware as the “interconnections of software

components which are the building blocks for software reuse through the use of

12

connectors” [5] [10]. Middleware products carry a variety of names based upon their

functionality. The names include application servers, workflow products, enterprise

application integration systems (EAI), extract load and transform systems (ELT), and

federated data systems [11]. These applications can also be called packaged components.

The proliferation of integration solutions designed to solve specific integration problems

have in fact created an entirely new set of problems relating to the integration of different

software packages [12]. The myriad of package component connectors has created the

need for another set of middleware to translate message passing between middleware

products; in other words, middleware to connect middleware [13]. Some in the software

engineering community feel a single standard should be agreed upon to integrate different

platforms to allow interoperability [14].

Almost yearly different software vendors and research groups introduce a new product

line which claims to solve the interoperability problem. These products generally

introduce new interfaces aimed at allowing portability of data structures and code base

from one operating system or programming language onto another using a user interface

which allows a component to solve specific needs. The proliferation of interfaces and

components leads to complexity and inefficient environments.

As an example within this project, the base Java application code passes information

through the Google™ API’s (interface) which is then transported across the network to

the Google™ server using the XML and SOAP message protocols. The answer to the

query is then returned as html text using the XML and SOAP protocols. The Google API

running on the resident users machine translates the result into a language format useable

by the host application. The Java line of code which initiated the call to the Google API

13

accepts the returned data as a byte stream. This byte stream is then manipulated using the

Java language.

In essence, this project utilizes various pieces of middleware to pass simple strings of

information between different technologies. The developers of the Google API could

have easily used a different message passing middleware protocol rather than SOAP.

Examples of other middleware message passing protocols are Object Management

Groups’ CORBA [15], Java’s RMI [16], or Microsoft’s COM or DCOM [17]. SOAP -

XML wrappers are readily available as part of the Java 1.4.2 JDK., and have become the

de-facto standard for message passing across the world wide web.

5.1.1 Frameworks, Middleware, and Design Patterns

Current software architecture design attempts to utilize connectable reusable

components to satisfy the user requirements. Figure 3 represents a graphical view of most

distributed software architectures

Figure 3 Software Architecture with Middleware [18]

14

Frameworks provide expertise in the form of reusable algorithms, component

implementations, and extensible architectures. Figure 4 represents the use of

frameworks.

Middleware codifies expertise in the form of standard interfaces and components that

provide applications with a simpler façade to access the complex and powerful

capabilities of frameworks.

Middleware enables expertise in the form of standard interfaces & components that provide applications with a simpler façade to access the powerful (& complex) capabilities of frameworks

Figure 4: Components, Frameworks, and Patterns [18]

15

5.1.2 Engineering Problems with Components and Distributed

Applications

Challenges facing the engineer in developing distributed applications include,

• Connection Management

• Service Initialization

• Error Handling

• Flow & Congestion Control

• Event Demultiplexing

• Distribution

• Concurrency & Synchronization

• Fault Tolerance

• Scheduling & Persistence

5.1.2.1 Connection Management

Initializing and maintaining a network connection is the vital first step for any set of

distributed applications. The launch of the Open Source Web Crawler requires the ability

to make and maintain a wide area network connection with the Google organizations

Internet, application, and database servers.

5.1.2.2 Service Initialization

Service Initialization involves the set of protocols necessary to initialize and maintain a

connection between distributed applications. The services can be as simple as a TCP/IP

three-way handshake, or require additional service setup at multiple endpoints to use

16

protocols such as SOAP, CORBA, DOM, or Java.RMI. This step is critical for

maintaining timely and accurate data flow between applications.

5.1.2.3 Error Handling

The number of errors that can arise when using distributed applications is much greater

than stand-alone applications. Errors must be caught and handled at multiple levels for

distributed applications. The errors can include network errors, interface usage errors,

middleware usage errors, message string format errors, application logic errors, and

language specific errors.

Network errors which must be caught and handled include network connections,

corrupted messages, and improper interface usage, and message formatting.

Message traffic between applications may contain errors that must be caught and

handled such as improper interface use, improper syntax, improper string formats,

improper message content, and errors in logic.

Application and middleware errors that must be caught and handled include logic errors,

improper data formatting, improper use of language specifications, cycle out-of bounds,

user input errors, and violations of input-output in extreme cases.

In summary, the developer must anticipate and plan for errors on multiple levels when

developing distributed applications. Thorough quality assurance is critical with

distributed applications. Random events, especially network events, can crash

applications without warning and the software must be able to handle these events.

An example of the difficulty in distributed application development can be found in

Open Source Web Crawler. Should any portion of the wide area network or Google

17

server cease operation suddenly after the program begins execution, the program will

hang awaiting a response from the remote server. Without changes to the timers in the

underlying operating system kernel, this error cannot be caught. Error handlers are

present and work successfully for network problems at the start of execution, but should

the event occur during remote execution the application will freeze. During all sets of test

cases this event has occurred twice with both occurrences due to the Google servers

going out of service during the program execution. The error condition in the local

application can be linked to the SOAP protocol which “Blocks” the application to await a

response. Without a response, the SOAP remote call will not allow the program to flow

into an error state, instead the application hangs.

5.1.2.4 Flow and Congestion Control

Flow and Congestion Control are problems faced by end-point applications, and within

the overall network.

Flow control deals with the flow of data within the local application and between the

distributed applications. The flow for this project is graphically represented in Figure 4.

Within local area and wide area networks flow control is usually handled by internal

routers using routing tables established by algorithmically calculating the shortest or least

cost path to a destination.

Developers must deal with congestion points within distributed applications. There are

numerous methods to handle application congestion, but the most common method is

controlling the timing of the calls to other distributed applications. This can be

accomplished by not allowing a new call to be made until the last call has returned, or

18

multiple threads of execution can be initiated allowing multiple remote method calls to

occur almost simultaneously. With distributed method invocations across networks, the

application designer must choose the correct methodology that balances speed against

accuracy. To control flow, Open Source Web Crawler uses a short loop to control the

timing of remote method invocations. To speed the return times, each remote method

invocation is launched with a synchronized execution thread. By calling the remote

Google API interface using a thread, the application takes advantage of one of the

strengths of the Java language constructs.

Distributed applications are adversely affected by network congestion. Should

congestion occur at any point within a LAN or WAN there is a probability that the

packets containing the remote messaging will be discarded. While current network

protocols were developed to handle routine network congestion, the synchronization of

the local timers within the operating system and the absolute amount of time needed for a

response may cause acute problems for distributed applications. Since network

congestion is a random event, the application designer must anticipate some network

congestion over time should he specify the use of a packet switching network such as the

Internet. Should the application mandate an absolute guarantee of performance, the

designer will need to specify a network protocol with a guaranteed service time.

5.1.2.5 Event Demultiplexing

The job of delivering the data in a transport-layer segment of the network protocols to

the correct application process is called demultiplexing. The job of gathering data at the

source host from different application processes, enveloping the data with header

19

information (which will later be used in demultiplexing) to create segments, and passing

the segments to the network layer is called multiplexing [19]. Should the developer have

multiple distributed applications executing simultaneously, there will be a need for event

management. Several solutions have been proposed with the most promising being the

“reactor” object behavioral pattern utilized by the Boeing Corporation for their Bold

Stroke Avionics application [18].

Since Open Source Web Crawler is a Java based stand alone application process with

kernel processes being associated only with this application, the difficult issues of

demultiplexing are minimized.

5.1.2.6 Distribution

Distribution of the correct version of an open source application is an issue if the

developers do not utilize version management control. There exist numerous open source

developer server sites that provide open source developers with version control

management tools. For example, SourceForge.com provides a complete open source

developers suite which includes version control management.

In the case of Open Source Web Crawler, there can be no open source distribution as

this is prohibited by the Google User Agreement.

5.1.2.7 Concurrency and Stabilization

Historically concurrency control issues relate to an application processes ability to read

or write a database record. This definition has been extended during the last few years to

include concurrent control over multiple threads of execution in an application. With

20

most third generation languages, control over multiple threads of execution is a

challenge.

One of the real benefits of the Java programming language is the built in interfaces for

thread control. Within the language constructs developers are provided with the

Serializable and Runnable interfaces. By using the Serializable interface, Objects

maintain information that defines their appearance and behavior; for example an objects

properties. There may also be internal data that is not exposed as a property but it plays a

role in defining an objects state. With the Serializable interface the state information of

all the components, as well as the application or applet itself, must be saved on a

persistent storage medium so that it can be used to recreate the overall application state at

run-time.

An important aspect of the application state is the definition of the components

themselves: the persistent state of an application includes a description of the components

being used, as well their collective state. When an object is saved all of its state is saved.

This means that all handles and objects that the saved object refers to are saved. Only

classes that implement the Serializable or Externalizable interface can be written to or

read from an object stream. This is done internally by the Java™ language by attaching a

serial number to each object that implements the Serializable interface.

The Runnable interface deals exclusively with threads and provides methods for

controlling threads. There are multiple ways to implement threads in Java, but the choice

for this project was provided in an example within the Java Thread tutorial [20]. The

class SearchGoogleCache.java implements the Serializable and Runnable interface plus

21

implements the run method. In this case, a Runnable object provides the run method to

the thread.

By using the Serializable and Runnable interfaces for the SearchGoogleCache class, the

state of each threaded object is maintained awaiting completion of the thread of

execution. There are alternative methodologies such as controller design patterns, but in

this case the number of threads is minimal and control can be provided by the Java ™

language packages with state information being held within the local cache memory

controlled by the operating system and the microprocessor.

5.1.2.8 Fault Tolerance

Fault tolerance is multi-tiered within a distributed application. Faults can consist of

processor crashes, processor commission faults, network faults due to multiple

connections, operating system hangs, memory leaks, and software design error [21] [22].

One of the primary features of a fault tolerant design is object determinism. This means

the state of an object is capable of being replicated should a system need to be roll back

in time to restore a previously known safe state.

Studies have concluded that the Java language does not have sufficient support for

developing complex distributed fault tolerant applications. The studies highlight the lack

of support for conversations or atomic actions. [23] [24].

5.1.2.9 Scheduling and Persistence

Scheduling and persistence of data objects within a distributed application environment

is a challenge and depends upon the speed of the data network. The applications ability to

22

efficiently schedule other tasks while waiting on a reply to a query sent across a network

will largely determine if user requirements are met.

In order to execute network-bound distributed data-intensive applications such as Open

Source Web Crawler, the computation, communication and data transfer components of

the application must be mapped to resources. To achieve acceptable performance some

type of adaptive scheduling is necessary. Scheduling models such as the Adaptive

Regression Model [25] have been proposed to increase performance on intensive

distributed data transfer applications.

Atkins and Morrison state that persistence abstraction allows the creation and

manipulation of data in a manner that is independent of its lifetime, thereby integrating

the database view of information with the programming language view [26].

Scheduling and persistence performance in a distributed random event environment is

totally dependent upon the state of the network. Developers can adjust the application in

expectation of defined delay, but delays that exceed the boundaries set within the system

will occur and must be treated as exceptions shifting the focus to the area of fault

tolerance.

23

6.0 Quality Assurance

Software quality assurance is meeting stakeholder needs and requirements while

conforming to organizational standards.

The Open Source Web Crawler project recognized the need to perform multiple types of

tests on the software. During the development of Open Source Web Crawler, each Java

class was analyzed for correctness using stepwise refinement then extreme unit tested. To

assist in quality assurance, the quality metrics plug-in tool for Eclipse was downloaded

and linked with the project. The metrics for the project are shown in Figure 5. The chart

reflects the use of a nested loop where the contents of one file is compared with the

contents of another file in ParseCache.java. This was a design decision based upon the

piped call and return nature architecture of the application. By using this design, the

nested depth (2.259), is larger than would have been the case with another design pattern

for the class ParseCache.java.

After all modules had been tested and configured, the entire application was tested. The

only know bug remaining is caused by the sudden crash or congestion packet loss at the

google server while the request is in the queue. The local application continues to be in a

BLOCKED state waiting a reply which will not be forthcoming. This bug was

experienced in less than 1% of all test cases.

24

Figure 5: Software Quality Metrics

7.0 Requirements Audit

The project requirements were,

2.1.1 Using freely available software tools, search the world-wide web for

URI’s containing user defined subjects and keywords.

2.1.2 This system should allow breath by allowing user input for any specific

subject, while providing specificity by evaluating the file contents against

user defined keywords.

2.1.3 The system should return subject URL locations, if available, for

extraction and parsing

25

2.1.4 The system shall maintain a record of URL’s returned during each search

for future reference

2.1.5 The solution should return the keyword - URL pair to the user for further

human- centered review.

The following sections reflect the specific graphical user interfaces and output screens for the project and highlight how each GUI or screen satisfies a portion of the project requirements.

7.1 Accept User Input of Subjects Related to the Sponsor Requirements

To facilitate multiple types of user inputs the application offers two initial graphical user

interfaces chosen from a welcome screen as shown in Figure 6.

The graphical interface allows extensibility by allowing a researcher to follow two

separate tracks of execution. The user may follow a track comprised of a pre-defined set

of subjects and key-words, or follow the track with user entered subjects and keywords.

In either case, the initial user interface in Figure 6 allows the user to navigate to the track

which satisfies their specific requirements.

26

Figure 6: Introductory GUI

After the user makes the choice of the direct user input or the default list the appropriate

screen represented by Figure 7 or Figure 8 will appear. The user input GUI appears in

Figure 7 and the Default User GUI is shown in Figure 8. The user interface shown in

Figure 7 allows the researcher to enter subjects of interest. This functionality satisfies the

portion of requirement 2.1.6 specifying that users will define the subject of the search. As

a note, the search will be for instances of this keyword within the body of documents held

within the Google cache.

27

Satisfies Requirement 2.1.2 Also Satisfies Requirement 2.3: Multiple Domain Usage

Figure 7: User Provided Subject Input

Partial Satisfaction Requirement: 2.1.2

Figure 8: User Default Choice Table GUI

28

7.2 Accept User Entered Parameters

Should the user decide to enter a set of parameters instead of using the pre-defined set

supplied by the application, the graphical interface shown in Figure 9 allows this

functionality.

Completes Solution for Requirement: 2.1.2

Figure 9: User Parameter Input GUI

If the user decides to use the pre-defined parameter list, this list is contained in the file

ParseCache.java within the package GoogleAP.

7.3 Provide a Listing of URL’s Related by the Subject and Parameters as Output

After accumulating the results from the second pass search, the GUI shown in Figure 10

is output to the terminal. Note the parameter is shown in the left hand column and the

URL is located in the right column.

29

Satisfies Requirement 2.1.5

Figure 10: Application Output – Parameter/ URL Pair

7.4 Application Progress Screen: Quality Control and User Information To provide the user with a status of the progress of the application, output is provided to

the terminal console reflecting which files are being examined at a given time. This is

shown in Figure 11. The items which contain the parameter and the subject are stored in

the local file “subject”KeyWorkCache.txt on the local file system.

30

An example of the persistent file is shown in Figure 12. This file satisfies requirement

2.1.9.

Satisfies Requirement 2.1.3 and 2.1.5

Figure 11: Console Output and Persistent Storage of Results

Satisfies Requirement: 2.1.9

Figure 12: Example of Parameter – URL Pair Stored Local File

31

7.5 Use of Freely Available Software Tools

Requirement 2.1.1 requires the use of freely available software tools to search the world-

wide-web for URI’s containing user defined subjects and keywords

7.5.1 Google Open Source API – Search Agent

The Google API provides a methodology Satisfies Requirement 2.1.1 for extracting subjects from the 4.2 billion

web pages cached on the Google database server.

7.5.2 Java J2EE Open Source Swing GUI examples

Java provides a sample of graphical user interfaces for developers

to modify to satisfy their specific requirements.

7.5.3 Eclipse Open Source IDE

Eclipse provides methods for linking various software tools together

into a unified application to solve problem requirements.

7.5.4 Java 1.4.2 Open Source JDK Development Kit

Sun Microsystems provides the Java language development kit (JDK) as

an open source platform for developing applications. The development

kit contains classes and libraries for input and output as well as methods

data manipulation.

32

8.0 SOAP and XML

Soap and XML are forms of middleware connectors or the “glue” which binds the

components together into a unified whole [27]. Connectors allow message passing

between components and in some cases form a complete component which sits between

other components.

XML is used to describe documents and data in a standardized, text-based format that

can easily be transported via standard Internet protocols. XML is based upon the

Standardized Generalized Markup Language (SGML). HTML was the first popular and

widely used adaptation of SGML. HTML differs from XML in it describes the layout of a

document. XML facilitates data integration by providing a transport with which to

receive and send data in a common format. [28]. Durant and Benz consider XML the

“glue” that holds the data integration solutions together [28, pg. 6]. Data cannot be

formatted into XML without other tools or programming languages that specifically

format data into XML format.

Applications must parse an XML document prior to use. The WC3 Document Object

Model (DOM) Recommendation is the only parsing model recommended for XML

document parsing [29]. DOM parses the entire XML document into a node tree structure.

From this tree either parts, or all, of a document may be retrieved.

The Simple Object Access Protocol (SOAP) is designed to allow the invocation of

remote applications independent of the platform and programming language. Most

distributed applications communicate using remote procedure call (RPC) mechanisms

between objects using protocols such as DCOM or CORBA. Because an RPC usually

carries a request to so do something important, RPC represents a compatibility and

33

security vulnerability that firewalls and proxy servers will usually block. SOAP is the

protocol that will handle this important message traffic across the network. SOAP makes

it possible to communicate between applications running on different operating systems

with different technologies and programming languages [30]. SOAP is a type of DOM

protocol.

Overall, XML is a natural way to format and send data. Soap adds message passing

capabilities, encoding for data representation and RPC descriptors to a XML object.

9.0 Project Program Structure and Design Patterns

9.1 C2 Architectural Style

Figure 13: C2 Architectural Style [31]

The C2 style architecture is a component- and message-based style for highly distributed

software systems. It is generalized from GUI intensive systems’ architectures.

C2 architectures generally consist of networks of concurrent components hooked together

by connectors. Figure 13 provides a graphical view.Among the characteristics are,

• no component-to-component links

34

• one up, one down” rule for components

• connector-to-connector links are allowed

• many up, many down” rule for connectors

• all communication by exchanging messages

The C2 style is the basic style used for SOAP passing messages between components in

the project.

9. 2 Model View Controller

The MVC abstraction can be graphically represented as in Figure 14.

Figure 14: Model View Controller [32]

This style is used extensively in the project. For example, Model: Google API, and

linking source code, View: GUI’s, Controller: GUI interactions.

35

In addition, this pattern was used inside the GUI classes to represent actions performed

when user actions occurred. The controller or GUI code represents a user action and

translate the action into an event.

9. 3 Event Style

Event style software is found within the GUI and the Google API. When a user selects a

particular course of action it triggers an event. This is graphically shown in Figure 15

representing the action preformed method internal to Java Swing components.

Figure 15: Action Listener Event Launcher [32]

9.4 Factory Patterns

This pattern is used with user choices from a GUI. The user selects the entry type from a

list. The choice may be made with a list box, a check box, entry box or any combination.

The user choice is then passed into a factory method to instantiate the correct object.

Below Figure 16 represent the standard UML representation of a Factory design pattern.

36

Figure 16: Factory Pattern

9.5 CASE Structure of Open Source Web Crawler

Modern CASE based tools are particularly useful for abstract visualization of an

application. Figure 17 reflects the top level view of the project. Files which parse

returned documents are contained within the package File Parser.The package GoogleAP

contains the files which interact with the Google API. The package googleapi contains

the jar files to which can be used locally to interact with the Google organization server.

The GUI package contains the classes which provide the user interfaces for the

application. The documentation for each class are provided in the package javadocs.

The license package contains the licenses to interact with Google. The final packages are

helper files and samples. This package contains the original jar and zip files which were

downloaded from the Google download site.

37

Figure 17: Case Based View of Open Source Web Crawler

38

9.5.1 Use of Google API within Open Source Web Crawler

9.5.1.1 GUI.java : The user input their subject selection which is passed

to the SearchGoogle.java class. This call satisfies requirement 2.1.6

SearchGoogle sg = new SearchGoogle();

sg.Search((String) selection);

9.5.1.2 SearchGoogle.java: GUI.java passes the subject selection to this

class, and the following code snippet requests a response from Google.

This set of calls satisfy requirements 2.1.7, 2.1.8 and 2.1.9.

GoogleSearch s = new GoogleSearch();

s.setKey(key);

s.setQueryString((String) AdjustedSubject);

s.setLanguageRestricts((String) "lang_en");

s.setFilter(true);

s.setStartResult(loop*10);

GoogleSearchResult r = s.doSearch(); // return result

GoogleSearchResult RA = r; // set up local scope

RA.getResultElements(); //place the returned results into

an Object array

//Place the results in persistent storage //**************************************** String[] result = new String[10];

for (int x = 0; x <= 0; ++x) {

result[x] = RA.toString();

39

9.5.1.3 : OpenGoogleCache.java:.

Prior to using this class, the local application has parsed the UML’s from

the returned XML provided by the class GoogleSearch.java. The classes

which extract the UML’s are Parser.java and ParseUML.java. The UML’s

are passed to this class (OpenGoogleCache) and the contents of the

Google database cache are returned as an XML document. The following

call is one of the primary components of the application.

byte[] cacheBytes = s.doGetCachedPage(text);

9.5.2 Second Pass Search

Once the UML’s are parsed and the contents of the Google database server

cache have been obtained, the application must parse the XML content returned

against a set of keywords to determine if they are present in a particular UML.

This evaluation is performed by the class ParseCache.java.

As the evaluator finds matches to the user defined keywords, the UML and

keyword are stored locally to satisfy requirement 2.1.4. The contents of this file

are then presented as a user interface (TableBuild.java) as defined by requirement

2.1.5.

40

9.6 Graphical View of the Application

Figure 18 represent a graphical view of the interactions within the application. The

main GUI interacts with the user accepting a subject and parameters. The dark boxes

represent classes which interact with the Google API’s. The remainder are locally written

code.

Figure 18: Graphical View Light Box: Local Code

41

9.7 Hierarchy Overview

Figure 19 represents the Hierarchy View generated by the Java language java document

generator. The hierarchy reflects the use of three packages for the project and their

associated classes. Note the use of the observers, listeners, containers, and the serializable

class.

Figure 19: Project Tree View

42

10. Correctness

To verify the correctness of the application the following test was conducted,

1. Identify a known singleton instance on the web for coverage

The following known instance of a single reference to a web page was known to

exist. Subject: Sachitano, keyword: ADTRAN. The subject keyword pair was known

to exist only on the web server auburn.edu within a resume.

The output of this search is shown in Figure 20 and verifies the application is able to

locate singleton instances.

Figure 20: Correctness of a Singleton Instance

43

2. Comparison of Normal Search to the Application

To better show the advantage of performing a search using Open Source Web

Crawler an inspection of various subjects was performed. The summary report in

Figure 21 reflects the benefit of this methodology. The last two columns compare the

number of sites which contained parameters of interest to the research effort. Note

that the 48.8% average indicates that of almost one-half of the pages extracted from

the Google servers cache contained parameters of interest. This compares with 12.8%

using the regular Google methods.

Figure 21: Comparison of Project with Regular Google Search

The marked difference in effectiveness can be directly related to the fact that the first

pass search extracts and returns only those pages containing the subject within the

text body.

Column one compares the number of web-pages returned during a normal Google

search that match with those returned which contain the subject in the text body.

Viewed differently, the regular search returned web-pages containing content not

related to the research effort.

44

Prior to the development of this project, this researcher was manually searching

through the returned pages of a normal Google search for parameter content. This

effort was arduous and time consuming. Not only does this project return the URL

addresses of more useful web-sites, but it reduced the time spent reading through

vacuous material that contained content unrelated to the research efforts.

The average time required to perform a 50 page two pass search varies between 2

and 4 minutes depending upon the state of the networks involved. A manual review

of the returned material must still be performed by the researcher, but it is undertaken

with the knowledge that parameters (keywords) are present in the material. In the

future this project could be extended to include an identification of phrases using

Artificial Intelligence algorithms.

11.0 Conclusions

The Open Source Crawler application provides an automated two-pass user-defined

search tool that evaluates the contents of the Google organizations 4.28 billion world-

wide web page database for subject matter of interest. The tool then performs a second

search of the pages returned by the Google search to determine if user defined keywords

are located within a page. Those pages which contain both the subject of the search and

the user defined keywords are returned to the user. On average, the application identified

48% of the pages returned during the first pass Google search as items of interest. This

application reduced the research time significantly as the computing application was able

to read and identify records of interest at a much faster rate than a human.

45

The Open Source Web Crawler is an example of current software engineering practices

of using open source applications, middleware and locally developed code to provide a

solution to a problem. Various frameworks, interfaces, components, and patterns are used

in the design of the locally developed code to link distributed applications.

46

12.0 Sources

1.. Google Organization API : http://www.google.com/apis/. Access 1/24/04.

2. The definitive definition of Open Source Software can be found at

http://www.opensource.org/docs/definition_plain.html. Access. 1/24/2003

3.. Tannenbaum, A. Computer Networks, Third Edition, Prentice-Hall, Upper Saddle River, NJ, 1996,

pgs. 3-11.

4. Cargill, C. “Evolution and Revolution in Open Systems”, ACM StandardView Vol.2, No.1, March

1994, (1994): 4-6.

5. Dashofy, E. , Medvidovic, N. , Taylor, R. “Using Off-The-Shelf Middleware to Implement

Connectors in Distributed Software Architectures”, ACM ICSE ‘99’, Los Angles, CA. (1999): 8.

6. Ememrich, W. “Software Engineering and Middleware: A Roadmap”, ACM Future of Software

Engineering, Limerick, Ireland, (2000): 120.

7. Schmidt, D., Buschman, F. “Patterns, Frameworks, and Middleware: Their Synergistic

Relationship”, IEEE 2003, (2003): 696-697.

8. Eclipse: http://www.eclipse.org/eclipse/faq/eclipse-faq.html#about_8. Access 1/24/04.

9. Shavor, S., D’Anjou, J., Fairbrother, S., Kehn, D., Kellerman, J., McCarthy, P. The Java™

Developers Guide to Eclipse, Addison Wesley, New York, NY, ISBN: 0-321-15964-0, 2003, pg.

6-8.

10. Medvidovic, Nenad, “On the Role of Middleware in Architecture-Based Software

Development”, ACM SEKE Ischia, Italy. (2002): 299.

11. Stonebreaker, M. “Too Much Middleware”, SIGMOD Record, Vol.31, No.1, March 2002.

(2002):97-104.

12. Bechini, A., Pierfrancesco, F., Prete, C. “Use of CORBA/RMI Gateway: Characterization of

Communications Overhead”, ACM – WOSP ‘02’, Rome, Italy, (2002): 150 -153.

13. Emmerich, W. “Distributed Component Technologies and Their Software Engineering

Implications”, ICSE ‘02’ ACM, Orland, FL, (2002): 543-545.

14. Juric, M., Rozman, I., Nash, S. “Java2 Distributed Object Middleware Performance Analysis and

47

http://www.google.com/apis/

http://www.opensource.org/docs/definition_plain.html. Access. 1/24/2003

Optimization”, ACM SIGPLAN, August 2000,(2002): 31- 40.

15. CORBA message Passing Protcol: http://www.omg.org/gettingstarted/corbafaq.htm

16. Java RMI : http://java.sun.com/j2se/1.4.2/docs/guide/rmi/spec/rmiTOC.html

17. Bray, Mike “Middleware”, Software Engineering Institute , Software Technology Roadmap,

2003: http://www.sei.cmu.edu/str/descriptions/middleware_body.html. Access 2/26/04

18. Software Architecture with Middleware: http://eecs.oregonstate.edu/icse2003/events/

schmidt-fosp.pdf. Access: 12/16/03.

19. Java Thread Tutorial: http://java.sun.com/docs/books/tutorial/essential/threads/clock.html. Access

1/26/04

20. Demultiplexing: http://www-net.cs.umass.edu/kurose/transport/fund.html . Access 1/24/04

21. Fault Tolerance: http://hera.ist.lu/tiki-download_file.php?fileId=24. Access 1/26/04.

22. Fault Tolerant Causes: http://www-2.cs.cmu.edu/~priya/ftctutor.pdf. Access 1/26/04

23. Avižienis, A. “Fault-Tolerance and Fault In-Tolerance: Complementary Approaches to Reliable

Computing”, ACM SIGPLAN Notices, Proceedings of the International Conference

on Reliable

Software, Volume 10 Issue 6, 1975.

24. Alvisi, L., “Fault-Tolerance: Java’s Missing Buzzword”, Department of Computer Science,

University of Texas, http://ipdps.eece.unm.edu/1998/hcw/alvisi.pdf Access 3/1/04

25. Faerman, M., Su, A., Wolksi, R., Berman, F. “Adaptive Performance Prediction for

Distributed Data-Intensive Applications”, Proceeding of the ACM/IEEE Conference on Super

Computing, Portland OR, Article 36, 1999.

26. Atkinson, M., Morrison, R. “Orthogonally Persistent Object Systems”, VLDB

The International

Journal on Very Large Databases, Vol.4, Issue 3, (1995): 319-402.

27. Fregonese, G., Zorer, A., Colrtese, G., “Architectural Framework Modeling in Telecommunication

Domain”, ACM ICSE Los Angeles, (1999): 528.

28. Benz, B., Durant, J., XML Programming Bible, Wiley Publishing Inc. 909 Third Ave., New York,

48

http://www.sei.cmu.edu/str/descriptions/middleware_body.html

http://java.sun.com/docs/books/tutorial/essential/threads/clock.html

http://www-net.cs.umass.edu/kurose/transport/fund.html

http://hera.ist.lu/tiki-download_file.php?fileId=24

http://www-2.cs.cmu.edu/~priya/ftctutor.pdf. Access 1/26/04

http://ipdps.eece.unm.edu/1998/hcw/alvisi.pdf

NY, 10022. 2003.

29. DOM: XML Parsing: http://www.w3.org/tr/dom-level-2-html Access: 12/17/04

30. SOAP specification: http://www.w3.org or http://www.xmethods.com. Access 12/17/04

31. C2 Architecture Graphic: http://sunset.usc.edu/classes/cs578_2002/February26.pdf

Access:12.17/04

32. Model View Controller Graphic: http://www.enode.com/x/markup/tutorial/mvc.html Access

1/20/04

49

http://www.w3.org/tr/dom-level-2-html

http://www.w3.org/

http://www.xmethods.com/

http://sunset.usc.edu/classes/cs578_2002/February26.pdf

http://www.enode.com/x/markup/tutorial/mvc.html

Appendix 1: Project Source Code

The source code for the project is in the order of presentation to the user.

1.0 Graphical User Interfaces

1.1-TopGui.java

/* * Created on Jan 18, 2004 * /** * @author Van A. Norris * */ package GUI; import java.awt.*; import java.awt.event.*; import javax.swing.*; import SubjectGUI; public class TopGUI extends JComponent implements MouseMotionListener, ActionListener { // Coordinates for the message int messageX = 125, messageY = 95; String theMessage; JButton theButton; JButton button2; // this class set up a use interface asking for user entry or the default list public TopGUI(String message) { theMessage = message; theButton = new JButton("Press for User Entry"); setLayout(new FlowLayout( )); theButton.setBackground(Color.RED); add(theButton); theButton.addActionListener(this); addMouseMotionListener(this); button2 = new JButton("Default Parameters"); setLayout(new FlowLayout( )); button2.setBackground(Color.ORANGE); add(button2); button2.addActionListener(this); addMouseMotionListener(this); }//end TOPGUI inner method public void paintComponent(Graphics g) {

50

g.drawString(theMessage, messageX, messageY); }// end paint component public void mouseDragged(MouseEvent e) { // Save the mouse coordinates and paint the message. messageX = e.getX( ); messageY = e.getY( ); repaint( ); }//end mouseDragged event public void mouseMoved(MouseEvent e) {} public void actionPerformed(ActionEvent e) { // Did somebody push the button? if (e.getSource( ) == theButton) { loadUserEntry(); } // end if else if(e.getSource() == button2){ loadGUI(); }// end else if }//end action performed public void loadGUI() { //GUI g = new GUI(); //g.UseListGUI(); SubjectGUI sg = new SubjectGUI(); sg.UseList(); }// end loadGui public void loadUserEntry(){ UserEntrySubject UES = new UserEntrySubject(); UES.display(); }// end loadUserDisplay public static void main(String[] args) { JFrame f = new JFrame("Open Source Search Welcome"); // Make the application exit when the window is closed. f.addWindowListener( new WindowAdapter( ) { public void windowClosing(WindowEvent we) { System.gc(); System.exit(0); } }// end windowAdapter );//end windowListener event

51

f.setSize(600, 300); f.setBackground(Color.BLUE); f.getContentPane( ).add(new TopGUI("Welcome to Open Source Search!... Web Page Content Search..English Version")); f.setVisible(true); }//end main }// end class

1.2 UserEntrySubject

/** *@author Van A. Norris Jan 2004 *UserEntry Search allows a user to manually enter a subject that *will be searched. This class links to SearchGoole.java which *performs a search using the Google API's. *@see google.search.soap *or review the Google API user docs *available from Google. * * */ package GUI; import java.awt.*; import java.awt.event.*; import javax.swing.*; // import GoogleAP.SearchGoogle; public class UserEntrySubject { private int WIDTH = 300; private int HEIGHT = 275; private JFrame frame; private JPanel panel; private JLabel inputLabel, outputLabel, resultLabel; private JTextField fahrenheit; //----------------------------------------------------------------- // Sets up the GUI. //----------------------------------------------------------------- public UserEntrySubject() { frame = new JFrame ("User Entered Search Item"); frame.setDefaultCloseOperation (JFrame.EXIT_ON_CLOSE); frame.setLocation(200, 50);

inputLabel = new JLabel ("Enter Search Subject then Press ENTER:");

outputLabel = new JLabel ("Sent: ");

52

resultLabel = new JLabel ("---"); fahrenheit = new JTextField (25); fahrenheit.addActionListener (new TempListener()); panel = new JPanel(); panel.setPreferredSize (new Dimension(WIDTH, HEIGHT)); panel.setBackground (Color.yellow); panel.add (inputLabel); panel.add (fahrenheit); panel.add (outputLabel); panel.add (resultLabel); frame.getContentPane().add (panel); } //----------------------------------------------------------------- // Displays the primary application frame. //----------------------------------------------------------------- public void display() { frame.pack(); frame.show(); } //***************************************************************** // Represents an action listener for the temperature input field. //***************************************************************** private class TempListener implements ActionListener { //--------------------------------------------------------------

// Performs the conversion to a string and passes the subject to Parameter GUI

//-------------------------------------------------------------- public void actionPerformed (ActionEvent event) { //int fahrenheitTemp, celciusTemp; String in; String text = fahrenheit.getText(); in = text.toString(); resultLabel.setText (in); String subject = resultLabel.getText(); // SearchGoogle sg = new SearchGoogle(); // sg.Search(subject); // For future use

UserEntryParameter UMP = new UserEntryParameter(subject);

UMP.display();

53

} } } 1.3 UserEntryParameter.java

/** *@author Van A. Norris Jan 2004 *UserEntry Search allows a user to manually enter a subject that *will be searched. This class links to SearchGoole.java which *performs a search using the Google API's. *@see google.search.soap *or review the Google API user docs *available from Google. * * */ package GUI; import java.awt.*; import java.awt.event.*; import javax.swing.*; // import GoogleAP.SearchGoogle; public class UserEntrySubject { private int WIDTH = 300; private int HEIGHT = 275; private JFrame frame; private JPanel panel; private JLabel inputLabel, outputLabel, resultLabel; private JTextField fahrenheit; //----------------------------------------------------------------- // Sets up the GUI. //----------------------------------------------------------------- public UserEntrySubject() { frame = new JFrame ("User Entered Search Item"); frame.setDefaultCloseOperation (JFrame.EXIT_ON_CLOSE); frame.setLocation(200, 50);

inputLabel = new JLabel ("Enter Search Subject then Press ENTER:");

outputLabel = new JLabel ("Sent: "); resultLabel = new JLabel ("---"); fahrenheit = new JTextField (25); fahrenheit.addActionListener (new TempListener()); panel = new JPanel(); panel.setPreferredSize (new Dimension(WIDTH, HEIGHT)); panel.setBackground (Color.yellow); panel.add (inputLabel);

54

panel.add (fahrenheit); panel.add (outputLabel); panel.add (resultLabel); frame.getContentPane().add (panel); } //----------------------------------------------------------------- // Displays the primary application frame. //----------------------------------------------------------------- public void display() { frame.pack(); frame.show(); } //***************************************************************** // Represents an action listener for the temperature input field. //***************************************************************** private class TempListener implements ActionListener { //--------------------------------------------------------------

// Performs the conversion to a string and passes the subject to Parameter GUI

//-------------------------------------------------------------- public void actionPerformed (ActionEvent event) { //int fahrenheitTemp, celciusTemp; String in; String text = fahrenheit.getText(); in = text.toString(); resultLabel.setText (in); String subject = resultLabel.getText(); // SearchGoogle sg = new SearchGoogle(); // sg.Search(subject); // For future use

UserEntryParameter UMP = new UserEntryParameter(subject);

UMP.display(); } } }

55

1.4 SubjectGUI.java

import java.awt.*;

import java.awt.event.*;

import javax.swing.*;

import javax.swing.event.*;

import java.io.*;

import java.util.*;

import GoogleAP.SearchGoogle;

import GUI.ParameterGUI;

public class SubjectGUI extends JFrame implements

ListSelectionListener {

private JList list;

private DefaultListModel listModel;

private static final String hireString = "Add";

private static final String fireString = "Remove";

private static final String goString = "Launch";

private JButton fireButton;

private JButton goButton;

private JTextField employeeName;

private static String trimFile = "theGoods";

private String trim1 = trimFile.trim();

public static Vector subjectVector = new Vector();

private static String passed;

public SubjectGUI() {

56

super("Subject List...Select Subject and Press Launch or Enter

New Subject");

listModel = new DefaultListModel();

//Load the listModel with data from the file Subject.txt

try {

FileReader fr = new FileReader(trim1);

BufferedReader in = new BufferedReader(fr);

String text1 = in.readLine();

while (text1 != null) {

//System.out.println(text1);

listModel.addElement(text1);

subjectVector.add(text1);

text1 = in.readLine();

} //end while

in.close();

}

catch (IOException e) {

System.out.println(

"error reading Subject.txt in SubjectGUI " +

e.getMessage());

}

// test the vector

57

//printList(subjectVector);

//Create the list and put it in a scroll pane

list = new JList(listModel);

list.setSelectionMode(ListSelectionModel.SINGLE_SELECTION);

list.setSelectedIndex(0);

list.addListSelectionListener(this);

JScrollPane listScrollPane = new JScrollPane(list);

JButton hireButton = new JButton(hireString);

// hireButton.setActionCommand(hireString);

hireButton.addActionListener(new HireListener());

fireButton = new JButton(fireString);

// fireButton.setActionCommand(fireString);

fireButton.addActionListener(new FireListener());

goButton = new JButton(goString);

//goButton.setActionCommand(fireString);

goButton.addActionListener(new GoListener());

employeeName = new JTextField(30);

employeeName.addActionListener(new HireListener());

String name =

listModel.getElementAt(list.getSelectedIndex()).toString();

employeeName.setText(name);

//Create a panel that uses FlowLayout (the default).

58

JPanel buttonPane = new JPanel();

buttonPane.add(employeeName);

buttonPane.add(hireButton);

buttonPane.add(fireButton);

buttonPane.add(goButton);

Container contentPane = getContentPane();

contentPane.add(listScrollPane, BorderLayout.CENTER);

contentPane.add(buttonPane, BorderLayout.SOUTH);

//contentPane.setLocation(500,300);

}

class FireListener implements ActionListener {

public void actionPerformed(ActionEvent e) {

//This method can be called only if

//there's a valid selection

//so go ahead and remove whatever's selected.

int index = list.getSelectedIndex();

listModel.remove(index);

//remove the subject from main list

subjectVector.removeElementAt(index);

int size = listModel.getSize();

if (size == 0) {

//Nobody's left, disable firing.

fireButton.setEnabled(false);

59

}

else {

//Adjust the selection.

if (index == listModel.getSize())

//removed item in last position

index--;

list.setSelectedIndex(index); //otherwise select same

index

}

}

}

// ********************New section for launch

class GoListener implements ActionListener {




//so go ahead


Object answer = listModel.getElementAt(index);

String answer1 = answer.toString();

//System.out.println("To Google " + answer1);

// the next line is a local setter to be passed from within

the

60

// GUI code

passed = answer1;

//****************************************************************

// At this point the user has provided the subject

// This subject is then passed to the search engine

// for URL's

// First lets find out what parameters we need at this

// point

startParamGUI();


if (size == 0) {

//Nobody's left,

goButton.setEnabled(false);

}

else {


if (index == listModel.getSize())


index--;

61


index

}

}

}

//This listener is shared by the text field and the hire button

class HireListener implements ActionListener {


//User didn't type in a name...

if (employeeName.getText().equals("")) {

Toolkit.getDefaultToolkit().beep();

return;

}



//If no selection or if item in last position is selected,

//add the new hire to end of list, and select new hire.

if (index == -1 || (index + 1 == size)) {

listModel.addElement(employeeName.getText());

list.setSelectedIndex(size);

//add the subject to the permanent list

subjectVector.add(employeeName.getText());

62

//Otherwise insert the new hire after the current selection,

//and select new hire.

}

else {

listModel.insertElementAt(employeeName.getText(), index +

1);

list.setSelectedIndex(index + 1);


subjectVector.add(employeeName.getText());

}

}

}

public void valueChanged(ListSelectionEvent e) {

if (e.getValueIsAdjusting() == false) {

if (list.getSelectedIndex() == -1) {

//No selection, disable fire button.


employeeName.setText("");

}

else {

//Selection, update text field.

fireButton.setEnabled(true);

String name = list.getSelectedValue().toString();


}

63

}

}

// //Test stub for file

//

// public static void printList(Vector v)

// {

// Vector v1 = v;

// for(int y = 0;y<(v1.size());y++)

// {

// System.out.println(v1.elementAt(y).toString());

// }

//

//

// }

public static void saveList(Vector v) {

Vector v2 = v;

v2.trimToSize();

String outFile1 = trimFile;

try {

File out1 = new File(outFile1);

FileWriter fw1 = new FileWriter(out1, false);

PrintWriter pw = new PrintWriter(fw1, false);

for (int cc = 0; cc < v2.size(); cc++) {

Object o = v2.elementAt(cc);

pw.println(o.toString());

64

}

pw.close();

}

catch (IOException e) {

System.out.println("error writing to " + trimFile);

}

}

public void startParamGUI() {

ParameterGUI pg = ParameterGUI.makeParameterGUI(passed);

//pg.loadParams();

}

public void startSearchGoogle(String passIn) {

String passed = passIn;

System.out.println("This was passed "+ passed);

SearchGoogle sg = new SearchGoogle();

sg.Search((String)passed);

}

//public static void main(String s[]) {

65

public void UseList(){

JFrame frame = new SubjectGUI();

frame.addWindowListener(

new WindowAdapter() {

public void windowClosing(WindowEvent e) {

saveList(subjectVector);

System.gc();

System.exit(0);

}

});

frame.setSize(650, 650);

frame.setLocation(75, 150);

// frame.

//frame.pack();

frame.setVisible(true);

}

1.51-ParameterGUI.java

package GUI;

import java.awt.*;

import java.awt.event.*;

import javax.swing.*;

import javax.swing.event.*;

import java.io.*;

66

import java.util.*;

import GoogleAP.SearchGoogle;

public class ParameterGUI extends JFrame implements

ListSelectionListener {

private JList list;

private DefaultListModel listModel2;

private static final String hireString = "Add";

private static final String fireString = "Remove";

private static final String goString = "Launch";

private JButton fireButton;

private JButton goButton;

private JTextField employeeName;

private static String trimFile = "theParams";

public int testcounter = 0;

public static String subject;

public static Vector paramVector = new Vector();

public ParameterGUI() {

super("Adjust the Parameters For Your Search, Press Launch");

String trim1 = trimFile.trim();

listModel2 = new DefaultListModel();

//Load the listModel with data from the file Subject.txt

try {

FileReader fr = new FileReader(trim1);


67



//System.out.println(

// "Original read " + text1 + " counter" +

testcounter);

listModel2.addElement(text1);

paramVector.add(text1);

testcounter++;


} //end while

} catch (IOException e) {

System.out.println(

"error reading ParameterGUI.txt in SubjectGUI "

+ e.getMessage());

}

// test the vector

// printList(paramVector);

//Create the list and put it in a scroll pane

list = new JList(listModel2);

list.setSelectionMode(ListSelectionModel.SINGLE_SELECTION);

list.setSelectedIndex(0);

list.addListSelectionListener(this);

JScrollPane listScrollPane = new JScrollPane(list);

68

JButton hireButton = new JButton(hireString);

hireButton.setActionCommand(hireString);

hireButton.addActionListener(new HireListener());

fireButton = new JButton(fireString);

fireButton.setActionCommand(fireString);

fireButton.addActionListener(new FireListener());

employeeName = new JTextField(30);

employeeName.addActionListener(new HireListener());

String name =

listModel2.getElementAt(list.getSelectedIndex()).toString();


goButton = new JButton(goString);

goButton.setActionCommand(goString);

goButton.addActionListener(new GoListener());

//Create a panel that uses FlowLayout (the default).

JPanel buttonPane = new JPanel();

buttonPane.add(employeeName);

buttonPane.add(hireButton);

buttonPane.add(fireButton);

buttonPane.add(goButton);

Container contentPane = getContentPane();

contentPane.add(listScrollPane, BorderLayout.CENTER);

69

contentPane.add(buttonPane, BorderLayout.SOUTH);

//contentPane.setLocation(500,300);

}

class FireListener implements ActionListener {






listModel2.remove(index);

//remove the subject from main list

paramVector.removeElementAt(index);

int size = listModel2.getSize();

if (size == 0) {

//Nobody's left, disable firing.


} else {


if (index == listModel2.getSize())


index--;


index

}

70

}

}

// ********************New section for launch

class GoListener implements ActionListener {





//int index = list.getSelectedIndex();

//Object answer = listModel2.getElementAt(index);

//String answer1 = answer.toString();

//listModel.remove(index);

//this section sould read the current list (vector) and

save

// the contents.

saveList(paramVector);

printList(paramVector);

SearchGoogle SGG = new SearchGoogle();

//sg.Search(subject);

SGG.Search((String) subject);

}

}

71

//This listener is shared by the text field and the hire button

class HireListener implements ActionListener {


//User didn't type in a name...

if (employeeName.getText().equals("")) {

Toolkit.getDefaultToolkit().beep();

return;

}


int size = listModel2.getSize();

//If no selection or if item in last position is selected,

//add the new hire to end of list, and select new hire.

if (index == -1 || (index + 1 == size)) {

listModel2.addElement(employeeName.getText());

list.setSelectedIndex(size);


paramVector.add(employeeName.getText());

//printList(subjectVector);

//System.out.println("PrintList in Hire");

//Otherwise insert the new hire after the current

selection,

//and select new hire.

} else {

72

listModel2.insertElementAt(employeeName.getText(),

index + 1);

list.setSelectedIndex(index + 1);


paramVector.add(employeeName.getText());

}

}

}

public void valueChanged(ListSelectionEvent e) {

if (e.getValueIsAdjusting() == false) {

if (list.getSelectedIndex() == -1) {

//No selection, disable fire button.


employeeName.setText("");

} else {

//Selection, update text field.

fireButton.setEnabled(true);

String name = list.getSelectedValue().toString();


}

}

}

// //Test stub for file

//

73

public static void printList(Vector v) {

Vector v1 = v;

for (int y = 0; y < (v1.size()); y++) {

System.out.println(v1.elementAt(y).toString());

}

}

public static void saveList(Vector v) {

Vector v2 = v;

v2.trimToSize();

String outFile = trimFile;

try {

File out3 = new File(outFile);

PrintWriter pw = new PrintWriter(new FileWriter(out3));

for (int cc = 0; cc < v2.size(); cc++) {

Object o = v2.elementAt(cc);

pw.println(o.toString());

}

pw.close();

} catch (IOException e) {

System.out.println("error writing to " + trimFile);

}

}

public static ParameterGUI makeParameterGUI(String subject1) {

74

//public static void main(String s[]) {

subject = subject1;

System.out.println(subject);

ParameterGUI frame = new ParameterGUI();

frame.addWindowListener(new WindowAdapter() {

public void windowClosing(WindowEvent e) {

System.gc();

System.exit(0);

}

});

frame.setSize(650, 650);

frame.setLocation(150, 75);

// frame.

//frame.pack();

frame.setVisible(true);

return frame;

}

}

}

75

2.0 Java Package: File Parser

2.1-Parser.java

package FileParser; import java.util.*; import java.io.*; import GoogleAP.OpenGoogleCache; /* * Created on Dec 30, 2003 * * The class Parser accepts a file containing results from * a Google Search Request and parses the URL associated with each of the ten * results returned. The URL is located by locating (String) URL at * the beginning of a line stream. If Found: URL is located, this * class looks to the third token on that line and extracts the URL. * Since the returned format is consistent, this methodology is effective */ /** * @author Van A. Norris *class Parser parses the XML from the results of a Google Search * */ public class Parser implements Runnable { private String fileName; private String subject; private int counter; /** * see run - parse a file for UML addresses * param fileName - name of the file to parse(physical address) * param subject - string identifying subject * see GoogleAPI searchResult for Details */ public void run() { }

public void parseForUML(String fileName, String subject, int counter) {

this.fileName = fileName; this.subject = subject; this.counter = counter; String OutFile1 = subject + "UML.txt"; String OutFile = OutFile1.trim();

76

try { // Create one object of GoogleAP.OpenGoogleCache OpenGoogleCache OGC = new OpenGoogleCache(); //Set the output file //*********************************************** File out1 = new File(OutFile); FileWriter fw1 = new FileWriter(out1); PrintWriter pw2 = new PrintWriter(fw1);

String parseFile1 = fileName; // internalize parameter

// Establish the file reader //**************************************** FileReader fr1 = new FileReader(parseFile1); // reads file in its entirety BufferedReader in1 = new BufferedReader(fr1); // places the file in a storage buffer String text1 = in1.readLine(); // read the first line while (text1 != null) {

StringTokenizer st1 = new StringTokenizer(text1);

// create a tokenizer object while (st1.hasMoreTokens()) { String tok = st1.nextToken(); // places the next token in storage if (tok.equals("URL")) { String tok2 = st1.nextToken(); String tok3 = st1.nextToken();

//System.out.println("Found UML" + tok + " name " + tok3);

pw2.println(tok3); } //end if } //end while text1 = in1.readLine(); } //end while pw2.close(); //***************************************************************

// This is one of the primary sections of the code since it holds the // current values being investigated along with the temporary files holding the data

77

///************************************************************** // Quality check visible on the console

// System.out.println("Test of Output File Name " + OutFile);

// pass the file with the URl's to //OPenGoogleCache.java

OGC.run(OutFile, subject, counter); } // end try catch (IOException ioe) {

System.out.println("IOException on SearchGoogle " + ioe);

} //end catch } // end method } //end class

78

2.2 – UserParseCache.java

/* * Created on Mar 4, 2004 * @author VanA. Norris package FileParser; import java.util.*; import java.io.*; import GoogleAP.UserOpenGoogleCache; /* * Created on Dec 30, 2003 * * The class Parser accepts a file containing results from * a Google Search Request and parses the URL associated with each of the ten * results returned. The URL is located by locating (String) URL at * the beginning of a line stream. If Found: URL is located, this * class looks to the third token on that line and extracts the URL. * Since the returned format is consistent, this methodology is effective */ /** * @author Van A. Norris * UserParser parses the URL from the results of a Google Search * */ public class UserParser implements Runnable { private String fileName; private String subject; private int counter; /** * run - parse a file for UML addresses * * GoogleAPI searchResult for Details */ public void run() { }

public void parseForUML(String fileName, String subject, int counter) {

/** * @param fileName - name of the file to parse(physical address) @param subject - string identifying subject * */ this.fileName = fileName; this.subject = subject; this.counter = counter;

79

String OutFile1 = subject + "UML.txt"; String OutFile = OutFile1.trim(); try { // Create one object of GoogleAP.OpenGoogleCache UserOpenGoogleCache OGC = new UserOpenGoogleCache(); //Set the output file //*********************************************** File out1 = new File(OutFile); FileWriter fw1 = new FileWriter(out1); PrintWriter pw2 = new PrintWriter(fw1); String parseFile1 = fileName;

// internalize parameter // Establish the file reader //**************************************** FileReader fr1 = new FileReader(parseFile1); // reads file in its entirety BufferedReader in1 = new BufferedReader(fr1); // places the file in a storage buffer String text1 = in1.readLine(); // read the first line while (text1 != null) {


// create a tokenizer object while (st1.hasMoreTokens()) { String tok = st1.nextToken(); // places the next token in storage if (tok.equals("URL")) { String tok2 = st1.nextToken(); String tok3 = st1.nextToken();

//System.out.println("Found UML" + tok + " name " + tok3);

pw2.println(tok3); } //end if } //end while text1 = in1.readLine(); } //end while pw2.close(); //***************************************************************

80

// This is one of the primary sections of the code since it holds the // current values being investigated along with the temporary files holding the data

///************************************************************* // Quality check visible on the console // System.out.println("Test of Output File Name " + OutFile);

// pass the file with the URl's to UserOpenGoogleCache.java

OGC.run(OutFile, subject, counter); } // end try catch (IOException ioe) {


} //end catch } // end method } //end class

81

2.3- UserParamParseCache.java

package FileParser; import java.util.*; import java.io.*; /** * * @author Van A. Norris * January 2002 * Title:UserParam_ParseCache * passToken: the word passed from UserParseCache.java * throws java.IO exception */ public class UserParam_ParseCache { private final String file = "Parameters.txt"; private String passToken; private String adjTok; public boolean Look(String passToken) { //int count = 0; this.passToken = passToken; boolean check = false; try { FileReader fr1 = new FileReader(file); // reads file in its entirety BufferedReader in1 = new BufferedReader(fr1); // places the file in a storage buffer String text1 = in1.readLine(); // read the first line while (text1 != null) {


// create a tokenizer object while (st1.hasMoreTokens()) { String tok = st1.nextToken(); adjTok = tok.trim();

if(adjTok.equalsIgnoreCase(passToken)) { check = true; = null; //st1 break; } else { //count++; } //text1 = null; try this } text1 = in1.readLine(); } in1.close();

82

} catch (IOException e) { System.out.println(e.getMessage()); } return check; } }

83

3.0 Java Package GoogleAP

3.1-ParseCache.java

package GoogleAP;

import java.io.*;

import java.util.*;

import GUI.TableBuild;

import FileParser.Param_ParseCache;

/*

* Created on Jan 8, 2004

/**

* @author Van Norris

*

* throws IOException for input and output

* subject+KeyWordCacahe.txt

* @version 2

*/

public class ParseCache {

private String fileName;

private static String localURL;

private String subject;

boolean foundParameter;

private int counter;

/** startParse accepts a strig which represents a file which has

already been parsed

by OpenGoogleCache. This method evaluates the file for key words

specified in the

84

series of if statements.

*/

public void startParse(String fileName, String subject, int

counter)

throws IOException {

/**

* @param fileName: the file being passed in by

OpenGoogleCache

* @param localURL the filename that was parsed

* @param subject the subject of the URL listing

*/

try {

this.fileName = fileName;

foundParameter = false;

System.out.println("Extracting " + fileName);

String trimFileName = fileName.trim();

String OutFile = subject + "KeyWordCache.txt";

this.subject = subject;

this.counter = counter;

// Set the output file

//***********************************************

File out1 = new File(OutFile);

FileWriter fw1 = new FileWriter(out1, true);

PrintWriter pw = new PrintWriter(fw1, true);

FileReader fr = new FileReader(fileName);

//URL1 is the file name

85





while (st1.hasMoreTokens()) {

String tok = st1.nextToken();

if (tok.compareTo((String) "URLname:") == 0) {

localURL = in.readLine();

break;

} //end if

else {

Param_ParseCache ppc = new Param_ParseCache();

boolean answer = ppc.Look(tok);

if(answer == true)

{

pw.println(tok + " " + localURL);

}//end if

}

} // end while

86


} //end while

// close the output file;

pw.close();

//The variable counter has been passed from Search Googlge

and represents the number of groups of ten Google results

which were parsed.If the counter in the Search is set to 4

then this check variable should be set to five since the

post incremented loop 0 to 4 in SearchGoogle contains 50

responses. The counter allows the final presentation GUI

to not be shown until after all processing is completed.

counter MUST be set to match the loop variable in Search

Goggle. Example: loop variable 0 equals review of

documents 0-10, the counter goes to 1. loop variable 1,

review documents 11-20, counter equals 2, the final GUI is

presented if counter == 2.

/** @param counter launches GUI Based on count of loop in

SeachGoogle

*

*/

if (counter == 5) {

// launch the GUI containing the results

TableBuild tb = new TableBuild();

tb.buildTable(subject + "KeyWordCache.txt");

}

87

} catch (Exception e) {

System.out.println("Exception on ParseCatch" +

e.getMessage());

}

} // end method

} //end class

3.2 PrintResults.java

package GoogleAP; import java.io.*; /** * @author Van Norris * see java.io for I/O methods see PrintResults accepts an array of Strings and a * subject return null throws IOException (string)* @version 1 * param results is an array of Strings * param subject is the subject of the data to name the output file */ public class PrintResults { private String[] results; private String subject; /**see Print accepts an array of strings and places the contents into * a file named subject.txt * param results an array of strings * param subject the subject matter of the array */ public void Print(String[] results, String subject) { this.results = results; this.subject = subject; String outFile = subject + ".txt"; try { File out = new File(outFile); FileWriter fw = new FileWriter(out); PrintWriter pw = new PrintWriter(fw); pw.println("Results from "+ subject+" array "); if (results.length > 0) { for (int loop = 0; loop < results.length; loop++) { pw.println(results[loop]);

88

//results.length was 10. Changed 1/17/04 } pw.close(); }//end if else {

System.out.println("In PrintResult.java the passed in array parameter was empty");

}//end else } catch (IOException e) { System.out.println(e); }// end catch }// end method }// end class

3.3 OpenGoogleCache.java

package GoogleAP; import com.google.soap.search.GoogleSearch; import com.google.soap.search.GoogleSearchFault; import java.io.*; import java.lang.Thread; /* * Created on Dec 31, 2003 * * OpenGoogleCache accepts the URL's which have been parsed by * the class FileParser.Parser.java.These files are located either on disk or* cache at subjectUML.txt..OpenGoogleCacahe.java takes the URL one at a time and * extract the information held within the Google organizations cache and returns* this information to a local file subject + "Cache.txt" */ /** *@author Van A. Norris *param URL1 represents the file containing the list of URL's *param subject represents the overall subject of the URL's *return output is placed into the file "subject.Cache.txt" *exception IOException for read and write *exception GoogleSearchFault part of the Google API's *see GoogleSearch.doGetCachedPage *see GoogleSearchFault.html *@version 2 */ public class OpenGoogleCache extends Thread implements Runnable{

89

private String fileWithURLs; private String URL1; private String subject; private final String key = "llZebdpQFHK/yxChBPgwJ6O5ezWm4pMs"; private int counter; //public void OpenCache(String URL1, String subject) { // cha onge fr m version 1 public synchronized void run(String URL1, String subject, int counter) { this.URL1 = URL1; this.subject = subject; this.counter = counter; try { // Output file name String OutFile = subject + "Cache.txt"; // Set the output file //*********************************************** File out1 = new File(OutFile); FileWriter fw1 = new FileWriter(out1); PrintWriter pw = new PrintWriter(fw1); GoogleSearch s = new GoogleSearch(); s.setKey(key); // parse through the file with the adresses one at // a time and perform a cache search FileReader fr = new FileReader(URL1); //URL1 is the file name BufferedReader in = new BufferedReader(fr); String text1 = in.readLine(); while (text1 != null) { String text2 = text1.replace('"', ' '); // removes the quotes from the string String text = text2.trim(); //At this point the code will launch a Google cache only. System.out.println("Extracting: "+text); //**************************************************** // This is the key call in most of the application // This call to GoogleSearch.doGetCachedPage returns the // contents of the Google's cache irregardless of the

90

// type of file. In fact this returns data on pdf file. //***************************************************** byte[] cacheBytes = s.doGetCachedPage(text);

//Note - this conversion to String should be done with reference to the encoding of the cached page, but we don't do that here. Note in GoogleAPI

String cachedString = new String(cacheBytes); pw.println("\nURLname:\n"+text1+"\n\n"+"*******\n"+cachedString); text1 = in.readLine(); } pw.close(); in.close(); System.gc(); // This section passes control to GoogleAP.ParseCache // which parses each returned file for select keywords. // *************************************************** //ParseCache pc = new ParseCache(); // the name of the output file from above if // subjectCache.txt pc.startParse(subject + "Cache.txt", subject, counter); //System.out.println("Ending ParseCache"); //************************************************ // This section is the second version which passes control // to a user GUI asking which parameters need to be parsed. } catch (GoogleSearchFault gsf) {

System.out.println("Error in OpenGoogleCache.java - GoogleSearchFault" + gsf);

} catch (IOException ioe) { System.out.println( "IO Exception on OpenGoogleCache" + ioe.getMessage()); } }

91

3.4 SearchGoogle.java /** author Van A. Norris */

• package GoogleAP; import com.google.soap.search.GoogleSearch; import com.google.soap.search.GoogleSearchResult; import com.google.soap.search.GoogleSearchFault; import java.io.*; import FileParser.Parser; public class SearchGoogle { private String subject; private final String key = "*$^&^%$$^^%0*^*T^%$$$&%^%; private int counter; public void Search(String subject) { this.subject = subject; //************************************************ // This section sets the parameters for the search //************************************************ System.out.println("Search Google has started with " + subject); //The following parameter "allinttext" is a key part of the Google API. // This call searches for the subject(entered at the GUI)within the text of a website. // Google searches for an occurrence of this subject within the text // of every document within their database. The call is a parameter // sent prior to the subject. The actual call appears as // allinttext:"search word" // From this call an array of sites and a summary is returned in // base 64 html format // Clean up the input stream String AdjustedSubject = "allintext:" + subject; String fileOut1 = subject + ".txt"; String fileOut = fileOut1.trim(); try { // set up the output file //******************************************* File out = new File(fileOut); FileWriter fw = new FileWriter(out); PrintWriter pw = new PrintWriter(fw);

92

//At this point the code will launch a Google search. //The setters format the network call correctly //***************************************************** for(int loop = 0; loop <= 2; loop ++){ GoogleSearch s = new GoogleSearch(); s.setKey(key); s.setQueryString((String) AdjustedSubject); s.setLanguageRestricts((String) "lang_en"); s.setFilter(true); s.setStartResult(loop*10); counter++; //*************************************************** // This section performs the search and outputs the // results. //************************************************** // Using GoogleSearchResult from the API. In this case // I am accepting an array of strings form the call to GoogleSearch // This allows manipulation of the array //*************************************************** GoogleSearchResult r = s.doSearch();

// return result GoogleSearchResult RA = r;

// set up local scope RA.getResultElements();

// place the returned results into an Object array //Place the results in persistent storage //**************************************** String[] result = new String[10]; for (int x = 0; x <= 0; ++x) { result[x] = RA.toString(); } //end for // call to PrintResult class for output PrintResults PR = new PrintResults(); PR.Print(result, subject); //write to the console and a file called out.txt pw.println(result);

// write to external file with subject name

93

System.out.println("Google Search Results"); System.out.println("======================"); //System.out.println(r.toString());

// this will allow console out of results pw.close();

//}// end while loop .. Only if a single search of 10 URL's

// call to Parser class in FileParser container //****************************************** Parser pars = new Parser(); pars.parseForUML(fileOut, (String) subject, counter); }

// end while. Position reflects multiple iterations within Search Google//instead of singleton with URL 0-9.

}// end try catch (GoogleSearchFault sf) {

System.out.println("Call to GoogleSearch Failed " + sf);

} // end catch catch (IOException ioe) {


}//end catch }// end class }

94

3.5 UserOpenGoogleCache.java package GoogleAP; /** * author Van.A. Norris */ import com.google.soap.search.GoogleSearch; import com.google.soap.search.GoogleSearchResult; import com.google.soap.search.GoogleSearchFault; import java.io.*; import FileParser.Parser; public class SearchGoogle { private String subject; private final String key = "llZebdpQFHK/yxChBPgwJ6O5ezWm4pMs"; private int counter; public void Search(String subject) { this.subject = subject; //************************************************ // This section sets the parameters for the search //************************************************ System.out.println("Search Google has started with " + subject);

// The following parameter "allinttext" is a key part of the Google API. // This call searches for the subject(entered at the GUI)within the text

// of a website. // Google searches for an occurrence of this subject within the text // of every document within their database. The call is a parameter

// sent prior to the subject. The actual call appears as // allinttext:"search word" // From this call an array of sites and a summary is returned in // base 64 html format // Clean up the input stream String AdjustedSubject = "allintext:" + subject; String fileOut1 = subject + ".txt"; String fileOut = fileOut1.trim(); try { // set up the output file //*******************************************

95

File out = new File(fileOut); FileWriter fw = new FileWriter(out); PrintWriter pw = new PrintWriter(fw); //At this point the code will launch a Google search. //The setters format the network call correctly //***************************************************** for(int loop = 0; loop <= 2; loop ++){ GoogleSearch s = new GoogleSearch(); s.setKey(key); s.setQueryString((String) AdjustedSubject); s.setLanguageRestricts((String) "lang_en"); s.setFilter(true); s.setStartResult(loop*10); counter++; //*************************************************** // This section performs the search and outputs the // results. //**************************************************

// Using GoogleSearchResult from the API. In this case // I am accepting an array of strings form the call to GoogleSearch

// This allows manipulation of the array //*************************************************** GoogleSearchResult r = s.doSearch();



// place the returned results into an Object array //Place the results in persistent storage //**************************************** String[] result = new String[10]; for (int x = 0; x <= 0; ++x) { result[x] = RA.toString(); } //end for // call to PrintResult class for output PrintResults PR = new PrintResults(); PR.Print(result, subject); //write to the console and a file called out.txt

96

pw.println(result);

// write to external file with subject name System.out.println("Google Search Results"); System.out.println("======================"); //System.out.println(r.toString());

// this will allow console out of results pw.close(); // call to Parser class in FileParser container //****************************************** Parser pars = new Parser(); pars.parseForUML(fileOut, (String) subject, counter);

} // end while. Position reflects multiple iterations within Search Google

//instead of singleton with URL 0-9. } d try // en catch (GoogleSearchFault sf) { System.out.println("Call to GoogleSearch Failed " + sf); } // end catch catch (IOException ioe) { System.out.println("IOException on SearchGoogle " + ioe); }//end catch }// end class } .

97

3.6 UserParseCache.java package GoogleAP; import java.io.*; import java.util.*; import GUI.TableBuild; import FileParser.UserParam_ParseCache ; /* * Created on Jan 8, 2004 * UserParseCache - evalutates a file word by * word for equality */ /** * @author Van Norris * * exception throws IOException for input and output * return subject+KeyWordCacahe.txt */ public class UserParseCache { private String fileName; private static String localURL; private String subject; boolean foundParameter; private int counter;

/**method: startParse accepts a strig which represents a file which has already been parsed by OpenGoogleCache. This method evaluates the file for key words specified in the

series of if statements. * fileName represents the subjectCache.txt */

public void startParse(String fileName, String subject, int counter)

throws IOException { try { this.fileName = fileName; foundParameter = false; System.out.println("Extracting " + fileName); String trimFileName = fileName.trim(); ng OutFile = subject + "KeyWordCache.txt"; Stri this.subject = subject; this.counter = counter; String tok; // Set the output file //*********************************************** File out1 = new File(OutFile); FileWriter fw1 = new FileWriter(out1, true);

98

PrintWriter pw = new PrintWriter(fw1, true); FileReader fr = new FileReader(trimFileName); //URL1 is the file name BufferedReader in = new BufferedReader(fr); String text1 = in.readLine(); while (text1 != null) {


while (st1.hasMoreTokens()) { tok = st1.nextToken();

if (tok.compareTo((String) "URLname:") == 0) {

localURL = in.readLine(); break; } //end if else

{ UserParam_ParseCache uppc = new UserParam_ParseCache (); boolean answer = uppc.Look(tok); // System.out.println(answer); if (answer == true) { pw.println(tok + " " + localURL); } //end if } } // end while text1 = in.readLine(); } //end while // close the output file; pw.close(); in.close();

//The variable counter has been passed from Search Google and // represents the number of groups of ten Google results which were parsed. // If the counter in the Search is set to 4 then this check variable // should be set to five since the post incremented loop 0 to 4 in SearchGoogle

// contains 50 responses. // The counter allows the final presentation GUI to not be shown until // after all processing is completed. counter MUST be set to

99

// match the loop variable in Search Goggle. Example: loop variable 0 equals review // of documents 0-10, the counter goes to 1. loop variable 1, review documents // 11-20, counter equals 2, the final GUI is presented ig counter == 2. /** @param counter launches GUI Based on count of loop in SeachGoogle

* */ if (counter == 5) { // launch the GUI containing the results TableBuild tb = new TableBuild(); tb.buildTable(subject + "KeyWordCache.txt"); System.out.println("counter test"); } } catch (Exception e) {

System.out.println("Exception on UserParseCatch" + e.getMessage());

} } // end method } //end class

100

3.7 UserSearchGoogle.java /** * @author Van A. Norris * * To change the template for this generated type comment go to * Window>Preferences>Java>Code Generation>Code and Comments */ package GoogleAP; import com.google.soap.search.GoogleSearch; import com.google.soap.search.GoogleSearchResult; import com.google.soap.search.GoogleSearchFault; import java.io.*; import FileParser.UserParser; /** * oid return v* @version 1.0 * see google API for details on interface * */ public class UserSearchGoogle { private String subject; private final String key = "987654321098766"; // Key changed for security prior to print private int counter; public void Search(String subject ) { this.subject = subject; //************************************************ // This section sets the parameters for the search //************************************************

System.out.println("Search Google has started with " + subject);

// The following parameter "allinttext" is a key part of the Google API. // This call searches for the subject(entered at the GUI)within the text

// of a website. // Google searches for an occurance of this subject within the text // of every document within their database. The call is a parameter

// sent prior to the subject. The actual call appears as // allinttext:"search word" // From this call an array of sites and a summary is returned in // base 64 html format

101

// Clean up the input stream String AdjustedSubject = "allintext:" + subject; String fileOut1 = subject + ".txt"; String fileOut = fileOut1.trim(); try { // set up the output file //******************************************* File out = new File(fileOut); FileWriter fw = new FileWriter(out); PrintWriter pw = new PrintWriter(fw); //At this point the code will launch a Google search. //The setters format the network call correctly //*****************************************************

// NOTE: if you change the loop max you MUST CHANGE the check value

// in ParseCache.java for(int loop = 0; loop <= 4; loop ++){ GoogleSearch s = new GoogleSearch(); s.setKey(key); s.setQueryString((String) AdjustedSubject); s.setLanguageRestricts((String) "lang_en"); s.setFilter(true); s.setStartResult(loop*10); counter++; //*************************************************** // This section performs the search and outputs the // results. //************************************************** // Using GoogleSearchResult from the API. In this case

// I am accepting an array of strings form the call to GoogleSearch

// This allows manipulation of the array //*************************************************** GoogleSearchResult r = s.doSearch();



// place the returned results into an Object array //Place the results in persistent storage //**************************************** String[] result = new String[10]; for (int x = 0; x <= 0; ++x) {

102

result[x] = RA.toString(); } //end for // call to PrintResult class for output PrintResults PR = new PrintResults(); PR.Print(result, subject); //write to the console and a file called out.txt

pw.println(result); // write to external file with subject name System.out.println("Google Search Results"); System.out.println("======================");

//System.out.println(r.toString()); // this will allow cpncole out of results

pw.close();

//}// end while loop .. Only if a single search of 10 URL's

// call to Parser class in FileParser container //****************************************** UserParser pars = new UserParser();

pars.parseForUML(fileOut, (String) subject, counter);

}// end while. Position reflects multiple iterations within Search Google

//instead of singleton with URL 0-9. }// end try catch (GoogleSearchFault sf) {

System.out.println("Call to GoogleSearch Failed " + sf);

} // end catch catch (IOException ioe) {


}//end catch }// end class }

103

Final Output TableBuild.java package GUI; import java.awt.event.*; import javax.swing.*; import java.io.*; import java.util.*; /** * * @author Van A. Norris * * @see TableBuild * * * @version 2 * This class builds a table GUI containing the output pair of parameter * then URL containing that keyword. the data is transfered from a * persistent file into an object file for use inside this class. * */ public class TableBuild { private String fileName; private static int count = 0; private static int count1 = 0; // public static void main(String[] args) throws Exception{ public void buildTable(String fileName)throws Exception{

/**@param fileName contains name of file containing th for inclusion e data

* @param count used for object array position * @param count1 used for object array position * */ try{ this.fileName = fileName; String parseFile1 = fileName; String parseFile2 = parseFile1.replace('"',' '); String parseFile3 = parseFile2.trim(); // Establish the file reader //**************************************** FileReader fr1 = new FileReader(parseFile3); // reads file in its entirety BufferedReader in1 = new BufferedReader(fr1); // places the file in a storage buffer String text1 = in1.readLine(); // read the first line // This section establishes the number of rows needed

104

// for the output GUI. //*************************************************** while(text1 !=null) { count++; text1 = in1.readLine(); }

//System.out.println("Array Row Size"+ count);// Output number of lines

//*************************************************** //The following section reads the data file and places the data // into an Object array for use with the output GUI Object[][] data = new Object[count][2]; // contents of GUI // Establish the file reader //**************************************** FileReader fr2 = new FileReader(parseFile3); // reads file in its entirety BufferedReader in2 = new BufferedReader(fr2); // places the file in a storage buffer String text2 = in2.readLine(); // read the first line while (text2 != null) { StringTokenizer st2 = new StringTokenizer(text2); while (st2.hasMoreTokens()) { String param = st2.nextToken();

// System.out.println("param on line "+count1+" " + param); String URL2 = st2.nextToken(); // System.out.println("URL on line "+count1+" " + URL2); data[count1][0] = param;

// System.out.println("data array "+count1+" Param: "+data[count1][0]);

data[count1][1] = URL2; // System.out.println("data array "+count1+" URL: "+data[count1][1]);

}// end while count1++; text2= in2.readLine(); } //end while // create some tabular data String[] headings = new String[] {"Parameter", "URL"};

105

106

// create a JFrame to hold the table JFrame f = new JFrame("Open Source Search Results Page"); f.addWindowListener( new WindowAdapter( ) {

public void windowClosing(WindowEvent e) { System.exit(0); }

}); f.setSize(800, 800); f.setLocation(200, 50); // create the data model and the JTable JTable table = new JTable(data, headings); // put it all together f.getContentPane( ).add(new JScrollPane(table)); f.setVisible(true); }// end of try catch(Exception e) {System.out.println("Exception in Table Build: "+e.getMessage()); }//end catch }// end of outer method }// end class

Open Source Web Crawler

Documents

Transcript of Open Source Web Crawler