Manual - nlp.cs.nyu.edu€¦  · Web viewPlease find MySql command detail in online MySql manual....

70
PROTEUS PROJECT NEW YORK UNIVERSITY Ali Argyle Darren Jahnel Jon Liebowitz Sachiko Omatoi Jeremy Shapiro Graig Warner Dan Melamed A Bitext Harvesting

Transcript of Manual - nlp.cs.nyu.edu€¦  · Web viewPlease find MySql command detail in online MySql manual....

PROTEUS PROJECT

NEW YORK UNIVERSITY

Ali ArgyleDarren JahnelJon Liebowitz

Sachiko OmatoiJeremy ShapiroGraig WarnerDan Melamed

A Bitext Harvesting

and Distribution System

1

PROTEUS PROJECT AT NEW YORK UNIVERSITY

Bitext Harvester Project Guide

NYUProteus Project

715 Broadway 7th floorNew York NY, 10003

Table of Contents

C H A P T E R 1

What is a Bitext Harvester 1

C H A P T E R 2

Bitext Harvester Database 3

Database Requirements 3

Database Installation 3

C H A P T E R 3

Bitext Harvester Spider 5

Spider Introduction 5

Spider Installation 6

Spiderman Menu Command

Summary 9

C H A P T E R 4

Bitext Harvester Filters 11

Filter Setup 12

Running a Filter 13

Terminating the Filter 13

A Sample Scenario 13

Notes on Filters 14

C H A P T E R 5

Web Based User Interface 17

Installation Steps 17

C H A P T E R 6

Developer Wish List for

Future Improvement 20

A P P E N D I X A

Requirements Document 22

A P P E N D I X B

Database Spec 51

What is a Bitext Harvster?We must acknowledge the "fact" of bilingualism and build upon it.       -Maurice Beaudin

1

Chapter

1

he Bitext Harvester Suite of applications was created to exploit the ever expanding resource of online parallel text data. A parallel text or 'bitext' is a pair of documents that are identical in

content but have been written in different languages. These bitext resources are extremely useful in the development of natural language processing techniques, and can serve as both training and testing data for the NLP community.

The Bitext Harvester is a system for collecting processing and distributing parallel texts retrieved from the Web.

T

The system works as follows:

Spider - A spider constantly trolls the Web looking for pairs of documents that might be parallel texts, downloads them locally, and places key management information into a database.

Filter - The resulting documents are processed by filtering programs, which help decide whether the pair is indeed a parallel text worth saving. For example one filter might decide what language each of the documents is written in. The results are recorded in the database.

Web Interface - A web site enables people to investigate the progress of the spider and the filtering; for example, someone could specify 2 languages and ask for all parallel texts in those particular languages The value of this resource will be significantly increased by the harvester, and the tools that have been developed to tag and keep track of the candidate parallel texts that have been gathered.

The Bitext Harvester Application consists of four main components, the spider, the user interface, the filter capability, and the database backend. Each of the first three may be used independently along with the database to perform its specific function. The web spider downloads pages and deposits data into the database, the filters serve to analyze and tag the data in the database, and the user interface allows a zipfile summary of the data to be downloaded via a web interface. The installation instructions pertain to all four components.

2

Bitext Harvester DataBasehe Database works in conjunction with all parts of the Bitext Harvester Suite, and is the first thing you will need to install. The Database installation instructions below will help you setup your

MySQL database with the appropriate tables and fields.TDatabase Requirements

The first step to getting any of the three main components working is to have a working database for them to talk to. The server that was used for development and testing was MySQL Version 4.1 which can be found at http://www.mysql.com. Note: As of Dec 15, 2003 this was the only version of MySQL that had the subquery capabilities which were needed in the project. Hopefully any future version of the Bitext Harvester will be able to use stored procedures which are going to be supported with version 5.

Database Installation

1. Download and Install MySQL >= 4.1 from http://www.mysql.com

2. Run mysql database server daemon on your machine.

3. Create a database named bitextharvester as root. – mysqladmin create bitextharvester

4. Create tables – mysql –u root –p[password] bitextharvester < [TableName].tbl

5. Load Static data - mysql –u root –p[password] bitextharvester < [TableName].dat (for Languages and Topics.)

3

Chapter

2

6. Add rows into Filters and FilterInstances tables, as Filter user guide suggested.

7. Please find MySql command detail in online MySql manual.

8. Please see Table.xls for table details, such as data type, field description.

9. CreateBitextHarvester.sh [database password] will do step 3 – 5.

Developer Notes All Id's are auto_incremented. (MySql assigns Id automatically. Deleting entire table will not reset the row count. Please re-install table or use the truncate command). http://www.mysql.com/doc/en/example-AUTO_INCREMENT.html At this point, there are no specific users or table access permissions for different processes (Spider, Filter, and Web file download).

Bitext Harvester SpiderNot all keys hang from one girdle.       -Anonymous

he spider’s job is to trawl the web looking for possible parallel text candidate pairs. Each spider has some basic configuration information like where to start, how to choose the pages that are

most likely to be bitext pairs, what to call itself, etc. For your first spider we recommend that you try the default spider simplespider. Once you

T4

Chapter

3

have a better idea of how the spider works you will be able to build one that does exactly what you need, and grabs the type of documents that you want.

Spider Introduction

Now that the database is in place you can install the spider and start putting data into it. There are two different executables that you will need, the spider.sh executable is used to instantiate a new spider and the spiderman.sh executable is used to manage any spiders you may have running. Two different spider package tars are included in the distribution, one for developers with all of the source included and one for run-only purposes. If you are following this demonstration just use the simpler run only version called spiderman_bin.tar.

Once the SpiderMan has been started, it will run until manually shut down. When the SpiderMan starts up, it also starts a background thread. This thread monitors the SPIDER database table for any entries of spiders. The entry in the table gives the SpiderMan info about this spider's name, along with the RMI registry where it has registered [RMI is a java tool to allow remote processes to call methods on one another. A more detailed explanation is beyond the scope of this doc - but easily found on the web]. For all spiders found in the DB, the SpiderMan attempts to look them up, and validate that they are alive. If so - we store this spider in a Hashmap. If not alive, the row is deleted from the DB by the SpiderMan. Once the spider is in the map, the user is able to call various actions on the spider.

The SimpleSpider implementation is a webcrawler that will iterate through a list of queries, and call a particular search engine with that query. The resulting html page will be scraped for all links. For each link found, that page will be retrieved. That page will then be scraped looking for the potential link that cause this page to be retrieved by the engine in the first place. If this link exists - both files will fill be retrieved and stored (along with duplicate checking using a hash scheme to avoid unnecessary duplicate files). An example of a query might be 'click for

5

French', in which case the page found, and the page that it points to may be retrieved as a potential bitext. That being said, this demo spider does a lousy job of finding accurate bitext pairs, but is set up to be greedy to bring in many texts so additional filters may decide to keep or discard them. Why is it lousy? When a page is found by using a search engine to query on 'click for french', the page that may be the French version is very seldom a link with the words 'click for french'. What I did in this 'greedy' implementation is just search for any link that contains any word in the query (and hope for the best). Other problems occur in how these links are formatted (meaning how there paths are structured, etc., so a high percent create Malformed URL's). More precise scraping methods may be used, but were beyond the scope of this project.

Spider Installation 1. Make certain that you have J2SE1.4.1 or greater

2. Follow the database installations above if you haven’t already done so

3. Choose which tar file (run only or src) you would like to use and unpack the tar file. Unpack the run only tar file: > tar xzf spiderman_bin.tar.gz

The following structure will be created:

startrmi: start script for RMI spiderman spiderman.sh: start script config lib log spider spider.sh: start script config lib log

Or unpack the developer tar file: > tar xzf spiderman_dev.tar.gz The following structure will be created:

dist: (distribution of spiderman and simplespider) lib: (all needed jars) config: (all config files) src: (java files)

6

edu.nyu.bitext.server edu.nyu.bitext.server.db edu.nyu.bitext.spider edu.nyu.bitext.spider.db

build (class files)

edu.nyu.bitext.server edu.nyu.bitext.server.db edu.nyu.bitext.spider edu.nyu.bitext.spider.db

5. Starting the RMIRegistry The rmiregistry will need to be running for the apps to work correctly. Start the RMIRegistry by typing: > ./startrmi

6. Starting a Spider with Spider.sh: The spider.sh executable is responsible for instantiating a spider. Before running the spider manager lets go over an example of starting a spider so we will have something to manage. By default ./spider.sh will use ./config/simplespider.cfg for its configuration. If you look inside this file you will notice that the name of the spider is specified as 'SimpleSpider', this is the name you will use when refering to the spider in the spider manager utility so make sure it is unique and descriptive for each spider that you start.

FIGURE 1 uses shows steps 5 and 6 in a terminal window.

7

7. Using the Spider Manager ('SpiderMan'): The spider manager ('SpiderMan') system is a mechanism built to allow for central control of multiple spiders. A spider may be loosely defined as a process that scans the web for possible bitexts (a bitext is two 'identical' texts in different languages). Technically, a spider can be anything, as long as it implements the edu.nyu.bitext.shared.Spider interface. For instance, a spider could potentially scan repository files instead of web crawling, and still be controlled via the SpiderMan.

To run the SpiderMan, go to the install directory: type: >./spiderman.sh

A usage menu will be shown [NOTE: items in brackets are optional parameters]: Implementations of these commands are spider specific, and it is up to the individual spider to implement them how they choose. The SpiderMan provides a centralized, and convenient way to run these commands against any registered spiders. The spider developer should do their best to implement these commands in a reasonable manner. The summary below tries to describe how these commands should act in a generic way.

8

FIGURE 2 – the spiderman in action

In the SpiderMan menu, 'spider' refers to the spider name that you would like to run the command for. The spider name is specified by the user in the spider config file. All spiders must have a unique id.. If you look at the spider_french/config/spider.cfg example:

SpiderID is set in first line: SPIDERID=SimpleSpider_FRENCH

This would be input into 'name' field in SPIDER table in the database. The 'id' field is an autoincrement field in MYSQL, so for this example the following would be generated by MYSQL.

9

Id name host port status created

4 SimpleSpider

localhost 1099 0 2003-12-22 14:24:56

And yes (as I'm sure you're wondering), unfortunately it is possible that another spider could overwrite an existing spider using the same name, which would cause the SpiderMan to use the most recently registered. In some ways this might make it easier to redeploy - without worrying about running an old version, but also has the problem of accidentally (or unknowingly) using same name. In future versions this might change.

[Note that in spiderman - there is a hashmap of [spidername (not id) as key, to spider ] - so if you changed to use spider ID for lookup, map would have to change.]

Spiderman Menu Command Summary

run spider [configfile] [queryfile] Tells the spider to begin crawling web (or file systems), in order to locate bitexts. This command may take two parameters:

configFile - This is a property file that can be set for a spider. Individual spiders will most likely need specific property files, which should be documented by the spider developer. A configFile for the included 'SimpleSpider' implementation is given as an example in the spiderman/config folder.

queryFile - This is also spider specific, but should most likely be a list of queries for a spider to send to a search engine. A queries.txt for the included 'SimpleSpider' implementation is given as an example in the spiderman/config folder.

setconfig spider configFile

This command has the ability to send property files to spider. Spider may be implemented to act on these while running (this will be up to spider implementation)

10

setquery spider queryFile This command has ability to send query list to spider. Spider may be implemented to act on these while running (this will be up to spider implementation)

halt spider This halts the spider. Ability to pause spider from fetching bitexts.

continue spiderThis wakes up spider from halted state

throttle spider This is an optional feature that one could build in to a spider's activity. ay to automatically halt spider, if number of bitexts retrieved is ahead of filtering. It will give a chance for filters to catch up.

status [spider] Shows the status of all registered spiders. It is up to spider implementation to provide status information for when this command is called.

showquery spidershowconfig spiderDumps properties/queries for spider to console.

quitExit SpiderMan

helpDisplay the full menu of options.

11

Bitext Harvester FiltersAll roads do not lead to Rome.       -Slovenian Proverb

ow that you have gathered some bitext candidate documents, the next step is to perform any additional processing that can help you decide which of the text pairs are worth

keeping. Some basic filters are included in the distribution and should be used as a guide to help you build filters of your own. The filtering system provides a means for automating the process of filtering the documents and bitexts downloaded by the spider. It accomplishes this by storing information about the filtering results in the database and judging, based on user-defined criteria, whether the filtered entity should remain valid for further testing or marked as invalid.

N

Three varieties of filters:

1. Text filters (update TestResults)

2. Bitext filters (update TestResults)

3. Column value filters (update the Texts or Bitexts table)

Filtering runs from the command line (include screen shot of command line options). User needs to provide the following: Configuration file, properties file, and the executable file. Configuration and executable file locations can be loaded from the database for filters that have already been inserted into the database

12

Chapter

4

Purpose of the various files:

Properties file – environmental settings (database driver / location / username / password, directory where logs are written to)

Executable file – Performs the testing on one or more files. At minimum, read permission should exist for the executable. Execute permission needs to exist if this file is run without prefixing its name with a shell or interpreter language name (e.g. perl, sh, awk, and so on). It's recommended that the minimum permissions set for this file be 755.

Configuration file – settings specific to a filter executable. Authored in XML. This communicates the following to the filter system (details included later):

How many files are to be tested by the script What is the syntax used by this script at the command line Of the texts / bitexts kept in the database, which ones should

be tested by this filter What results, when printed to standard output, imply that

the test has passed or failed Which tables and columns in the database need to be

updated

Filter Setup

1. For developers and users: filters.zip

>unzip filters.zip

The following directory structure will be created: filters

o filter_launcher.sh: the launch script o properties: the default properties file. Logging is disabled by

default. o edu/nyu/bitext/filters: source and class files o edu/nyu/bitext/filters/utils: source and class files

lib: all needed jars logs: default directory where logs will be stored

13

scripts: sample filters config: DTD and XML configuration files for the above scripts

2. If you are running on a linux machine you may need to change the end of line characters to linux format in filter_launcher.sh. You also may need to change some of the paths in filter_launcher.sh to correspond to your install locations.

Running the Filter

Execution from the command line takes one of three forms: 1. To add a new filter to the database and execute it on the bitexts or

texts in the database: >filter_launcher.sh –n FILTER_NAME –c CONFIG_FILE –EXECUTABLE [–d DESCRIPTION] –o PROPERTIES_FILEwhere the description is optional and the other tags mandatory.

2. To launch a filter with information in the Filters table of the database:

>filter_launcher.sh –i FILTER_ID –o PROPERTIES_FILE where FILTER_ID is the primary key value from the Filters table.

3. To query the database for information about available filters:>filter_launcher.sh –q [FILTER_ID] –o PROPERTIES_FILEwhere “–q FILTER_ID“ gets information about one filter, and “-q“ without a FILTER_ID gets information about all filters.

After executing the application in the first or second manner, it would be useful to type CTRL-Z and subsequently bg in order to run the application in the background.

Logging may be enabled by adding a setting logs.level=SETTING (where SETTING is one of the following: ALL, DEBUG, ERROR, FATAL, INFO, WARN, or OFF) to the properties file.

Terminating the FilterThe application may be terminated under one of the following conditions:

14

1. Retry count exceeded – while the run() method of a thread is executing, failed connections to the database or SQLExceptions may occur. After five retries, the threads spawned by the filter launcher will die and the application will terminate gracefully.

2. Forceful termination – under normal circumstances, this is the means by which a user will wish to end a filtering process. Using the Unix kill command should terminate the process without causing database inconsistencies since transactions and batched queries are utilized. Future developers may wish to make use of the methods provided in the edu.nyu.bitext.filters.Filter interface in order to provide a means to shut down the filters programmatically.

A Sample Scenario:We want to ensure that the files we download are comprised of text data (e.g. HTML and text pages) rather than binary data (images and sound files). The executable file, in this case, would be /usr/bin/file, a UNIX command that returns a file's type. On non-BSD UNIX systems, the file command will return output that contains the word “text“ if the contents of the file being tested are readable by humans. Therefore, we have to look for the word “text“ in order to ensure that the file should pass this test.

The syntax of the command would be /usr/bin/file –b filename. The configuration file for this command represents this as $0 –b $1, where the general rule for specifying command syntax is: $0 [some_parameters] $1 [some_parameters] $2 [some_parameters]... $n

The term $0 represents the command itself, while arguments such as $1, $2, and so on represent the first and second files respectively.

The configuration file specifies the logic for evaluating the output of a filter command. The units of evaluation are called “rules;“ in turn, each rule consists of one or more “tokens,“ which are regular expressions that the filtering output must match to either pass or fail a rule. If the test passes all of the tokens in a rule, the rule specifies whether this means that the rule passes or fails. If the test passes all of the rules specified in the configuration file, then the test is considered to be successful and the bitext or text remains as a valid candidate for further evaluation by other filters.

In our sample scenario, there is only one rule (or group of tokens) that needs to pass, and the rule consists of one token – the regular expression

15

“text“. If the output of the file command emits a string that contains the word text in it, then the test passes.

[graig@csstupc16 graig]$ file -b bitextharvester.txt ASCII text

In the above case, a command line execution of the filter emits “ASCII text,“ which matches up with the regular expression supplied in the configuration file. Therefore, the test would be successful and the appropriate tables in the database updated.

Notes on Filters

Java notes - Please see the javadoc included in the code for notes regarding the Java code. XML notes - The structure of the XML documents parsed by objects of this class is specified by the DTD or XSD supplied to the SAXParserFactory, which the instance of this class is bound to.

Relevant elements include: config - The root node of the document. Must contain attribute

textCount and elements cmdSyntax and optionally, elements tables, features, and rule.

tables - Information for filters that need to update either the Texts or BitextPairs table. This is optional. Elements enclosed by this element include: updateColumn - The column in Texts or BitextPairs that will be updated. This column is a foreign key to lookupColumn. tableToLookup: The table that contains possible return values. lookupColumn -The primary key of the aforementioned table. lookupAuxColumn - The column in tableToLookup to which we compare the test's return value..

cmdSyntax - Specifies the command syntax. $0 = script name, $1 = first file, $2 = second file...

features - Defines a group of features that a text or bitext must possess in order to be processed

date - Specifies a minimum or maximum date of discovery for a text or bitext. Attributes used by this element include: after -The minimum date of discovery before - The maximum date of discovery

16

exclude - A true or false value that determines whether or not files in this range should be excluded from testing.

Languages - Defines a group of languages that the filter should run against or exclude

length - Specifies a minimum or maximum size of a text to be filtered. Attributes used by this element include: minimum - The minimum size maximum - The maximum size

pattern - A specific pattern being searched for in the result string rule - A rule that may be applied to the text or bitext. Attributes used

by this element include: ignoreCase - A boolean that determines case sensitivity

topics - Defines a group of topics that the filter should run against or exclude. Attributes used by this element include: exclude - A true or false value that determines whether or not files in this range should be excluded from testing.

Please note that additions to the configuration file schema should either be performed on the DTD that comes with the filtering system (in filters/config/filterconfig.dtd) or by creating a new DTD that extends the currently existing DTD.

The recommended way to extend parsing of XML files that use tags that are defined in a different DTD is:

1. In the <config> tag, have an attribute xsi:noNamespaceSchemaLocation="your_dtd_name.dtd" that contains the new tags. 2. In the application, create a new subclass of FilterImpl (for filters that update the TestResults table) or ColumnValueFilter (for filters that update only one value per row in the BitextPairs or Texts table). In the subclass, it would be wise to override the startElement(…) and endElement(…) methods to parse tags defined in the new or updated DTD. 3. Update the FilterFactoryImpl class to determine the appropriate Filter implementation to return when the newFilter() methods are called.

The choice of DTD instead of XSD is due to the fact that DTD is easier to maintain than XSD. However, XSD can be used in place of DTD with no changes to the Java application code.

17

Web Based User InterfaceThey are ill discoverers that think there is no land, when they can see nothing but sea.       -Sir Francis Bacon

ow that you have gathered some bitext candidate documents , the next step is to perform any additional processing that can help you decide which of the text pairs

are worth keeping. Some basic filters are included in the distribution and should be used as a guide to help you build filters of your own. NInstallation Steps for the Web Based User Interface

Preconditions: MySQL is installed JWSDP 1.2 is installed and JAVA_HOME environment variable is defined The bitext harvester DB has been set up

Steps: 1. Get the MySQL driver jar (mysql-connector-java-3.0.9-stable-bin.jar or the appropriate jar for the version of MySQL that is installed), this file can be downloaded at http://www.mysql.com/ 2. Put that jar in jwsdp-1.2/common/lib/ 3. Create a directory named "bitext" under jwsdp-1.2/webapps/ 4. Put the 5 JSP files in the bitext directory

18

Chapter

5

5. Put the css, images, and WEB-INF directories in the bitext directory on the same level as the JSPs 6. Let the existing directory structures under WEB-INF exist as they are 7. Start the server by going into the bin subdir of your jwsdp-1.2 installation and typing ./startup.sh

FIGURE 3 – Starting the server

8. That's it! Go to http://MACHINE:PORT/bitext/index.jsp to see if it's working it should look like the following:

FIGURE 4 – the intro page

19

When you click ‘Search and download bitexts” you will get the following choice screen:

FIGURE 5 – SEARCH PAGE

Once you have made your selections, click the submit button and proceed to download the zipfile. Add the file extention .zip to the filename if you are on windows.

Notes on the web site

20

1. All of the .java files are included in the WEB-INF directory if you need to make changes.

2. Note that the zip file for download should contain both files and a tab file that notes the relationship between the downloaded files.

3. The filter display will look for filter names inserted into the database and display them dynamically.

Developer Wish List For Future ImprovementsImage creates desire. You will what you imagine.       -J. G. Gallimore

s with any other project in development, there are many features that were not implemented in the first version of the Bitext Harvester. Here are a few of the enhancements that might be

implemented for future versions.A

Enhance Topics table to support multiple layer topics by adding ParentTopicId (News will picks up Domestic News and

International News.)

21

Chapter

6

Create a new table called TextCategories (TextId, TopicId) to support multiple topics of a text.

Create Account table to store Bitext Delivery System users. Achieve mechanism.

Below are items that were not in scope, but the development team has noted as items that could be visited in later development. Listed in order of importance.

The Bitext Delivery System will let user to decide if he/she want to download bitexts or to have a list of URLs.

File Clean Up Tool A Web based tool, which enables the system administrator to

purge text files and their related database table entries, downloaded before the date of his/her choice. Orphan files, the files without their information in database for some reason, may be deleted at the same time.

Account Manager - A part of Bitext Delivery System, which creates and manages user accounts to distinguish user affiliations within or outside of New York University, for Copyright issue.

Bitext Link Queries - A part of Bitext Delivery System, which allows outside of New York University users to get list of URL for bitexts, instead of allowing them to download text files, for not to violate the Copyright.

Spider Manager Graphic User Interface - A graphic user interface, which let the system administrator to execute and to stop spiders. It also allows his/her to set up configuration and arguments, and to see the performance statistics for spiders.

Filter Manager - A process which will automatically run and check status of filter processes, and report performance

Filter Manager Graphic User Interface - A graphic user interface, which let the system administrator to execute and to stop filters. It also allows his/her to set up configuration and arguments, and to see the performance statistics for filters.

Multilingual (more than bilingual) filtering A filter mechanism, which allows a filter to test more than two text

files at the same time.

Enhanced Topics: Change table to support multiple layer topics by adding

ParentTopicId (News will picks up Domestic News and International News.)

Create a new table called TextCategories (TextId, TopicId) to support multiple topics of a text.

22

If the number of the topics increases to a very large number, may want to create a new page for Binary Delivery Systems.

Bitext Harvester Project

11/2/2003

23

Appendix

ATransition Meeting Documentation

Table of Contents

Overview and Project Status 24

1. Requirements Introduction 26

2. General Description 27

3. Functional Requirements 31

4. Database Schema and Dictionary 33

5. System Architecture Summary 35

6. Interface Requirements 36

7. Deployment Requirements 37

8. Non-Functional Requirements 38

9. Preliminary Design Documents 39

10. Team Composition and Assignments 48

11. Preliminary Schedule 49

12. Demo Plan for the Demo Show on December 18th, 2003...........................................................................................................50

24

Overview and Project Status

The project team has gone through a full requirements gathering process with Professor Melamed resulting in a finalized requirements document, and has made it through most of the design phase of the project. It is our goal through the transition meeting to exit the design phase of our project and move fully to the development phase.

Key accomplishments include:

Rebuilding much of the interface and the database used by the PT Miner spider.

A signed-off requirements document by Professor Melamed.

Design of the database to be used for the project.

Creation of the My-SQL database to be used for the project

UML schemas completed for the filter (test) interfaces

User Interface drafts of the researcher web site.

Identification and testing of a compression solution for the researcher web site.

Both development and staging servers are up and running, development with preliminary code for the system.

With some strong progress made, we now are entering a rapid development phase to bring together all of the pieces of this project. As you will note in the final sections, the timeline to delivery is tight, however our project plan was designed to accomplish enough up-front work to make the development process less labor-intensive.

25

In this document, we include the requirements document completed, and have added to it design materials, and fleshed out data dictionaries.

26

1. Requirements Introduction1.2 Scope of this documentThe scope of this document is to outline the functional requirements for the Melamed database project. 1.3 OverviewIn this project, Professor Melamed is looking for the project team to build a database candidate parallel texts that can be used in current ML research in language translation. This “candidate” database will be populated by web crawlers provided by the professor. The database will be able to house, display to users, and be a management tool for these candidate bitexts.

1.4 Business ContextProfessor Melamed's work uses parallel texts--copies of the same document in 2 languages--to help develop and test his translation software.  This project involves designing and building a system for collecting and processing parallel texts retrieved from the Web. The system will work as follows:

A spider constantly trolls the Web looking for pairs of documents that might be parallel texts; an existing spider will be used

The resulting documents are processed by some filtering programs (which Prof. Melamed has) which help decide whether the pair is indeed a parallel text worth saving; for example, one filter decides each document's language

A database records the results of the previous steps

A web site enables people to investigate the progress of the spider and the filtering; for example, someone could specify 2 languages and ask for all parallel texts in those particular languages

1.5 Definitions Used in this Document

o Bitext: A bitext is a pair of documents that are identical in content, but written in a different language.

o Candidate Bitext: An suspected bitext pair that needs to be tested.

o ML : Machine Learning

o Web Spider : A tool that downloads web pages

o Test: A test examines candidate bitexts to determine if they are real bitexts.

27

2. General Description2.1 Product FunctionsThe final deliverable will be a database for candidate bitexts, an adequately robust input routine from a web crawler populating the database with candidate bitexts, and a web-based download interface to allow researchers to download candidate bitexts.

2.4 User Problem Statement: The ML project requires a huge quantity of bitexts. Currently, a web crawler exists to spider and download candidate bitexts, however, this spider is not automated. This limits the amount of candidate bitexts can be loaded into machine learning algorithms.

2.3 User Roles and Characteristics

o Professor Melamed. An administrative user with access not only to the entire system, but will require some basic tools to manipulate data easily.

o Assistants to Professor Melamed. Will be working with the candidate bitexts, potentially moving data to other databases, etc.

o Outside Researchers. Outside researchers will want to enter a web site, search for candidate bitexts, and download bitexts.

o Future Developers.

2.4 Workflow Summaries of Each User Role

28

T H E S P I D E R W O R K F L O W

29

A D M I N

N Y U

D ATA B A S E

R E S E A R C H E R S

S P I D E R S

Spiders call for a database insert

Feed spiders with seed queries to search for bitexts

Bitext database

Filters activated to scan for bad bitexts Move?

Export data

Researchers

30

S P I D E R S

Search

ResultsBitext database

A D M I N

N Y U

D A T A B A S E

R E S E A R C H E R S Visit Bitext site

Search

database of bitexts

Download??

Files Downloaded

2.5 General Constraints

Software Cost: All software used in the project must be free (no charge to the final user)

Timing: Resources will be limited to a five member team, and approximately 100 to 150 man hours until mid December, 2003.

31

3. Functional Requirements

3.1The Web Spider Manager – ‘Spiderman’

The ‘Spiderman’ is responsible for monitoring and controlling the running web spiders that have registered with the ‘Spiderman’.

NOTE: All web spiders are responsible for implementing the spider interface, and registering with the ‘Spiderman’. They can be run remotely or locally. This project will ship with the ‘SimpleSpider’ web crawler [details on this spider are out of scope for this document].

Additional ‘Spiderman’ requirements:# Name Description Priority1 Duplicate detection Are 2 different records pointing to the same file on the web? 22 Use configurable parameter

listsTo allow the spider to find different web pages without duplicating effort.

1

3 Each text should be fingerprinted

Using MD5, this will help to identify if we have duplicates, even when files names change.

1

5 Throttling Give the admin a throttling capability so downloads don’t get too far ahead of filtering.

2

6 File naming On download of a document, need to append a unique ID to the name so it is a unique file. [possibly using the IP address as a trie data structure]

1

3.2 Test Interface

The filter interface allows an Administrative researchers to grab candidate bitexts in the database and to update the database after a filter is applied. # Name Description Priority1 Provide a generic API to call

for bitext pairsThis will allow test to be conducted on candidate bitext pairs to determine if they are true bitexts

1

2 Record the results of the test This is a also an API where the call into the system will record the results of the test

2

3 Plan for future enhancements Parameters may be enriched in the future. Allow for expansion 34 Provide a generic API to call

for textsFor tests that need to be conducted directly on the text data. 2

3.3 Bitext Delivery Web Portal# Name Description Priority1 Provide a search of candidate

bitextsBy languages, date, and test results (one for each test.) 1

2 Return a list of search results of candidate texts

Allow users to look at a list of the bitexts 2

3 Zip search results Provide a zipped set of results to download 14 The package download Should have the files to download, plus a tab delimited file of 1

32

bitext pairs.

3.4 Database Requirements# Name Description Priority1 Texts are single records One text may be translated into many different languages.

Keeping the text as a single row will allow this 1

2 Bitexts are single records that relate to texts in different relationships

Another table will store relationships between texts 1

3 Provide table locks for tests Allow for a row to be unavailable to a test by flagging it as “in-process”

2

4 Record Test results For each test on a particular bitext, record the test results. 2

33

4. Database Schema and DictionaryThe database schema should follow the rules set in the requirements above and the following relationships.

Data Dictionary for Major TablesTexts A row stores a single file downloaded from the web spider. Texts are related to bitexts – and can be associated with many bitext pairs. This will allow us to create many Bitext relationships while reusing the same texts.

Text ID Unique ID URL This is the URL of the text being examined.Fingerprint The text will be hashed using a technique to be determined that can be used to help

determine duplicate data.Date of discovery Downloaded date.TextPath The named file path to retrieve the document.Stage of Process This is a flag to indicate whether a text should be kept as a valid text in a bitext pair.Length Bit length of the file.Topic ID Allows a text to be categorized.Language ID Tests will determine what language this document is in.

BitextsBitext rows associate texts in candidate bitext pairs. This is where relationships will exist and where tests will look for documents to examine.

Bitext ID Unique IDText1ID RelationText2ID RelationFail After a test is run on a bitext, we will want to mark a pair as failed, to remove the bitext

relationship.Creation Time Time the association was created.

TestResults

34

Tests will be run on bitext pairs. This table will store the results of those tests.Test ID Unique IDBitext Pair ID The test refers to this bitext pairFilter ID The test refers to this specific test typeResultValue Tests may pass back valuesResult Tests may also pass back pass / failTest time Time test was run

Filters

A list of all filters that run tests against bitext pairs.FilterID Unique IDFilter NameFilter File Name Location of the filter codeDescription Up to admin to useHostname Where the filter is coming fromPort Expected port filter runs on NumofInstances Number of instances of the filter running from the filter managerCreation Time

35

5. System Architecture Summary

36

6. Interface RequirementsThe interface requirements for the application will be as follows:

6.1 User Interfaces

6.1.1 ResearcherThis is a simple web interface using HTML 4.0 as the maximum browser type.

6.1.2 AdminWill be a combination of API and web interfaces.

6.2 Software Interfaces

6.2.1 Spider InterfaceThis is an API only

6.2.2 AdminWill be a combination of API and web interfaces.

- 37 -

7. Deployment Requirements7.1 Documentation: In the final deliverable, one (1) document is needed to generally describe functions and API definition for

tests (filters)and for spiders manager – how to run the script and the config. files. Also noting the structure of calls. Included in the documentation, it must have information on:

7.1.1 System requirements for deploy 7.1.2 Test API7.1.3 Spider manager API

7.2 Packaging: The final deliverable must be packaged and then deployed by someone not on the development team successfully to a non-development machine used in this development.

- 38 -

8. Non-Functional RequirementsSoftware Tool ChoicesWhy we made these choices:

1. The software chosen must be free software so that the final user (possibly DARPA) has no financial commitment locked-in to adopt it.2. They represent best of class for open source software. There are certainly other databases available. However, MySQL is widely adopted, and happens to be used in the

main Spider tool we are using.

Software Selection:

OS – Red Hat Linux 7.3 Development Language – Java Execution Environment - J2SE1.4.1

Web Server – Apache HTTP Server 2.0.40 Execution Environment - J2SE1.4.1 Application Server – Tomcat 5.0.x Database – MySQL 4.1

- 39 -

9. Preliminary Design Documents9.1 Researcher Web-based Front End

- 40 -

index.jsp(home page)

search_form.jsp(bitext search

selection criteria)

information_menu.jsp(links to various sites and articles pertaining to the

project)

spider_manager_hub.jsp(might not be implemented)

search_results.jsp(bitext search results)

System Functionality(allows the user to

download the zip file)

9.1.1 Bitext Delivery System – Home Page

- 41 -

9.1.2 Bitext Delivery System – Search Page (Under Construction)

- 42 -

9.1.3 Bitext Delivery System – Search Results Page

- 43 -

9.1.4 Bitext Delivery System – After clicking on the download link, the system will ask you if you want to save the file to your hard drive.

- 44 -

Design for Spider Management Interface

SpiderMan console menu:

**********************************SPIDERMAN MENU

1. run [spider] [configfile]2. halt [spider] 3. continue [spider] 4. status [spider] 5. showconfig [spider] 6. throttle [0=none] 7. help **********************************?

SimpleSpider config sample:

#Search Engine parametersENGINE=http://www.google.com/search?ADDTL_PARAMS=&ie=UTF-8&oe=UTF-8&num=100

#Language InfoSEARCH_LANG=lh=enTRANS_LANG=fr

#Regular Expressions used for html scraping# LINK_REGEXP: for finding links in search engine results # NEXT_REGEXP: for next group of search resultsLINK_REGEXP=/<br><font color=#008000>(.*?) -/iNEXT_REGEXP=/<td nowrap><a href=(.*?)<img src=/nav_next.gif width=100/i

#Location of Query fileQUERYFILE=c:\\eclipse\\workspace\\PTMiner\\config\\queryfile.txt

- 45 -

#DATABASE SettingsDB_URL=jdbc:mysql://localhost/web_mining?user=miner&password=minerDB_USER=minerDB_PASS=minerDB_DRIVER=org.gjt.mm.mysql.Driver

#RMI SettingsRMI_REGISTRY=localhost:1099

Query File sample:

click+for+french aliceclick+for+french bobclick+for+french calvinclick+for+french davidetc…

9.2 UML Diagram of the Filter (test) Interface

- 46 -

- 47 -

Two Example Algorithms for Filters:

1. Run Tests Fill operand queue with bitexts / texts that have not been tested While the filter is active (i.e. running or paused) If state is paused:

Wait for a finite period of time Else, if state is running:

Get an operand from the queue Launch child process and get result Test result according to rules in configuration file Update DB accordingly

End if End while Stop the filter and remove from server

2. Manage Filters Access Filter Manager to query / update filter configuration Filter Manager chains commands to the appropriate Filter Server Filter Server updates the database Filter Server stops the filter and reloads it from the database with updated information

Run Tests Fill operand queue with bitexts / texts that have not been tested While the filter is active (i.e. running or paused) If state is paused:

Wait for a finite period of time Else, if state is running:

Get an operand from the queue Launch child process and get result Test result according to rules in configuration file Update DB accordingly

End if End while Stop the filter and remove from server

Manage Filters Access Filter Manager to query / update filter configuration Filter Manager chains commands to the appropriate Filter Server Filter Server updates the database Filter Server stops the filter and reloads it from the database with updated information

- 48 -

Test Cases:

Followings will only describe the critical test case to meet requirements. Details and more test cases will be added later. When performed, each test case will be recorded with its expected and actual results, comment, and performed date.

Environment/Test Data: Populate some bitexts/bitext candidate by hands for filtering and bitext delivery system testing. May create a test web site dedicated for Spider testing.

Spider Manager Duplicate Text detecting. Adjustment of Spider execution timings to avoid overflow of unprocessed bitext candidate files. Appropriate file allocation. (Avoid having too many files in one directory) Alert for disk space shortage. Exception handling, such as a broken link, too long URL, no file found. Unavailable Search Engine (due to server its server failure or something. But is hard to imitate

the situation.)

Filter Manager Filtering criteria handling in configuration File. (including one text filtering) Allocation of right number of filter instances. Exception handling, such as a broken text path. Unexpected filter return value handling.

Bitext Delivery System Exception handling, such as a broken text path or no bitext is available for the query. Selection criteria invalid input error handling, such as out of range, invalid data type. Alert for disk space shortage while making a zip file.

After all, the main purpose of the project is to automate Spiders and Filters, and run them 24/7. Ideally, we would run them at least two days without any interruption to see if they are stable or not. Same for Bitext Delivery System.

10. Team Composition and AssignmentsAll members will be involved in each aspect of the project, but the final deliverable for each area will be owned by an individual. During the design and development process, the following will be the deliverable owners:

Name Roles

Jeremy Project Management, QA , system testing, documentation

Sachiko Database Design and Development

Darrin Front-end and bitext delivery system

Jon Spider Management

Graig Filter Management

Our team will continue to meet weekly and hold meetings as needed or determined by our schedule with Professor Melamed. Our development environment is accessible remotely for all users.

- 49 -

11. Preliminary Schedule Date Deliverable Status

Oct 1 Requirements Gathering Part 2 Complete

Oct 13 Begin Design Document & Prototype PortalComplete

Oct 21 Requirements Completed and sign offComplete

Oct 28 Design document and prototype completeComplete

Oct 28 Dev environment must be fully tested and in placeComplete

Nov 3 Design Sign offPending

Nov 4 Development BeginsPending

Nov 20 Rev 1 demo

Nov 21 Rev 1 revisions

Dec 4 Rev 2 demo & sign off

Dec 5 QA / Deploy prepping begins

Dec 18 Final Deliverable in hands of owner

- 50 -

12. Demo Plan for the Demo Show on December 18th, 2003Schedule and outline of the demo plan for Demo Show are follows;

Note: Audience includes other project teams’ clients, interested faculty and other students, who will be the first time to hear about Bitext Harvester Systems

Preparation Schedule:

12/11 Each person will create a Power Point file for responsible pieces. Jeremy and Sachiko will integrate those files.

12/17 Practice presentation and Demo, and update the power point file if necessary. Prepare handouts.

12/18 Do presentation in the Demo Show.

Outline:

1. Introduction (by Jeremy or Sachiko)a. Team Member Introductionb. Client Introduction (Prof. Melamed & His research interests)c. Objective of the project

i. Explanation of bitextsii. Example of bitexts

iii. Goals of Bitext Harvester Project.2. Functional & Non Functional Requirements (by Jeremy or Sachiko)3. System Overview (by Jeremy or Sachiko)

a. Existing Softwareb. Development Planc. System Architecture

4. Detailed Designs and Demo of each components a. Spider & Spider Manager (by Jon)b. Filters & Filter Manager (by Graig)c. Bitext Delivery System (by Darrin)

5. Further Developments (by Jeremy or Sachiko)6. Q & A (by all)

- 51 -

Bitext Harvester Database Spec

Bitext Harvester Database

Static Tables (Manually maintained)    

         Table Name FiltersDescription Stores Filter Information.

FieldName DataType Attribute NULL DescriptionFilterId int Primary Key, Auto_increment N Unique Filter IdFilterName varchar(255) N Filter Name Description varchar(255) Y Filter DescriptionFilterFileName varchar(255) N Executable File Name

- 52 -

Appendix

B

ConfigFileName varchar(255) N Configuration File Name

NumOfInstances int default 1 NNumber of the instances to be run from filter manager simultaneously

CrTime timestamp N Creation Time of the filter (may be used as "valid from" time)

* Need maintenance for new filters.

         Table Name FilterInstancesDescription Stores Filter Instance Information.

FieldName DataType Attribute NULL DescriptionFilterId int Foreign Key (Filters.FilterId) NInstanceId int N Should be unique for the same FilterId.Hostname varchar(30) N Hostname used to run filterPort int N Port number used to run filterCrTime timestamp N Creation time for this filter instance.

* Need maintenance for new filters or filter instances.* Combination of FilterId and InstanceId is Primary Key* Instance Id should be carefully populated. (Must be unique and consecutive for the same FilterId)

         Table Name TopicsDescription Store web page topic information.

FieldName DataType Attribute NULL DescriptionTopicId int Primary Key, Auto_increment N Unique Filter IdTopic varchar(255) N Topic Name

Note: TopicId = -1 for undefined topics* May want to modify the table, by adding ParentTopicId, to be able to get multiple layered Topics.

- 53 -

(e.g. News also picks up both Domestic News and International News.)

         Table Name LanguagesDescription Stores Language Information.

FieldName DataType Attribute   DescriptionLanguageId int Primary Key, Auto_increment N Unique Language IdLanguage varchar(255) N Language (e.g. English)

Non Static Tables (Populated by some processes)    

         Table Name TextsDescription Store individual text (web page) information. Populated by Spider. Updated by Filters.

FieldName DataType Attribute   DescriptionTextId bigint, unsigned Primary Key, Auto_increment N Unique Text IdURL varchar(255) N URL of the textFingerprint varchar(255) Y Return value of MD5 (to avoid duplicate)DateOfDiscovery datetime N The time spider put the text into the systemTextPath varchar(255) N Directory and File Name of the saved text(Full Path)Length unsigned bigint default 0 N File size in byteTopicId int Foreign Key (Topics.TopicId) Y The first topic specified in HTML.

LanguageId intForeign Key (Languages.LanguageId) Y The language of this text written in.

* TopicId is determined by a special filter.

- 54 -

* LanguageId is determined by language filters, which only test texts, one by one. * Only one topic can be stored. May want to create a new relation, TextCategories(TextId, TopicId) to support multiple topics.

         Table Name BitextPairsDescription Store bitext candidate pairs. Populated by Spider. Update by the first filter, a bitext fails (Fail: from 0 to 1)

FieldName DataType Attribute   DescriptionBitextPairId bigint, unsigned Primary Key, Auto_increment N Unique BitextPair IdText1Id bigint, unsigned Foreign Key (Texts.TextId) NText2Id bigint, unsigned Foreign Key (Texts.TextId) N

Fail int default 0 YFailure Flag. 1 for failure, determined by the values in filter config file

CrTime datetime N The time texts become bitext candidate (row creation time)

* Was not able to make Foreign key against the same fields in the same table. So, no foreign key set up in actual database.

         Table Name TestResultsDescription Stores filtering results. Populated and Updated (ResultValue, Result, and EndTime) by Each Filters

FieldName DataType Attribute   DescriptionTestId bigint, unsigned Primary Key, Auto_increment N Unique Test IdTextId bigint, unsigned Foreign Key (Texts.TextId) Y

BitextPairId bigint, unsignedForeign Key (BitextPairs.BitextPairId) Y

FilterId int Foreign Key (Filters.FilterId) NResultValue varchar(255) Y raw return value from a filterResult int Y 1 for pass, 0 for failureStartTime datetime N Filtering Start TimeEndTime datetime N Filtering End Time

* StartTime and EndTime are used to determine the performance of each filter process (to determine the number of fitler instances to run)

- 55 -

* Either TextId for single text filtering or BitextPairId for bitext filtering will be populated. Never both* When a filtering starts, a new row is added (with ResultValue, Result, and EndTime as null or default.)* ResultValue, Result, and EndTime are updated by Filter at the end.

         Table Name SPIDERDescription A independent table used by Spiders and Spider Manager

FieldName DataType Attribute   Descriptionid bigint unsigned PrimaryKey, auto_incremented N Unique Spider Idname varchar(25) N Spider Namehost varchar(25) N host used to run spiderport varchar(25) N port used to run spiderstatus varchar(25) N not in use created datetime N Creation Time

* status is originally designed, but not used. Keep it there to avoid last miniute change.

- 56 -