Weka4WS a WSRF-Enabled Weka Toolkit

8
Weka4WS Introduction Weka4WS is a framework developed at the University of Calabria to extend the widely used Weka toolkit for supporting distributed data mining on Grid environments. Weka provides a large collection of machine learning algorithms written in Java for data pre-processing, classification, clustering, association rules, and visualization, which can be invoked through a common graphical user interface. In Weka, the overall data mining process takes place on a single machine, since the algorithms can be executed only locally. The goal of Weka4WS is to extend Weka to support remote execution of the data mining algorithms through WSRF Web Services. In such a way, distributed data mining tasks can be concurrently executed on decentralized Grid nodes by exploiting data distribution and improving application performance. In Weka4WS, the data mining algorithms for classification, clustering and association rules can be also executed on remote Grid resources. To enable remote invocation, all the data mining algorithms provided by the Weka library are exposed as a Web Service, which can be easily deployed on the available Grid nodes. Thus, Weka4WS also extends the Weka GUI to enable the invocation of the data mining algorithms that are exposed as Web Services on remote Grid nodes. To achieve integration and interoperability with standard Grid environments, Weka4WS has been designed by using the Web Services Resource Framework (WSRF) as enabling technology. In particular, Weka4WS has been developed by using the WSRF Java library provided by Globus Toolkit 4.0.x (GT4) . In the Weka4WS framework all nodes use the GT4 services for standard Grid functionalities, such as security and data management. Those nodes can be distinguished in two categories: 1. user nodes, which are the local machines of the users providing the Weka4WS client software; 2. computing nodes, which provide the Weka4WS Web Services allowing the execution of remote data mining tasks. Weka4WS is therefore distributed in two separated packages: 1. Weka4WS-client, which contains the client software (including the extended Weka GUI) to be installed on the user nodes; 2. Weka4WS-service, which contains the WSRF-compliant Web Services to be installed on the computing nodes. Installation Software prerequisites Weka4WS requires Globus Toolkit 4.0.x (full installation) on the computing nodes and only the Java WS Core (a subset of Globus Toolkit) on the user nodes. Note that this is not a minimum requirement but a specific requirement: Globus Toolkit 4.2.x and later versions contain some updates to the web services specifications and in some other of its services which make them incompatible with Weka4WS. Since the full version of Globus Toolkit 4.0.x runs on Unix platforms (Linux included), Weka4WS-service can be currently installed only on those systems, while the Weka4WS-client can be installed both on Unix and on Windows platforms. To install Globus Toolkit 4.0.x here you have some useful links: Globus Toolkit 4.0.x Download GT 4.0.x Quickstart Guide Installing GT 4.0.x (System Administrator's Guide) Security prerequisites Weka4WS runs in a security context, and uses a gridmap authorization: that is only users that are listed in the service gridmap may invoke it. So in order to make Weka4WS run properly the following prerequisites must be satisfied: 1. the Weka4WS user must hold a valid proxy certificate (in the X.509 format) with a given Distinguished Name (DN); Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download 1 of 8 12/11/2012 7:49 AM

description

Weka4WS a WSRF-Enabled Weka Toolkit

Transcript of Weka4WS a WSRF-Enabled Weka Toolkit

Page 1: Weka4WS a WSRF-Enabled Weka Toolkit

Weka4WS

Introduction

Weka4WS is a framework developed at the University of Calabria to extend the widely used Weka toolkit forsupporting distributed data mining on Grid environments.

Weka provides a large collection of machine learning algorithms written in Java for data pre-processing,classification, clustering, association rules, and visualization, which can be invoked through a common graphicaluser interface. In Weka, the overall data mining process takes place on a single machine, since the algorithmscan be executed only locally.

The goal of Weka4WS is to extend Weka to support remote execution of the data mining algorithms throughWSRF Web Services. In such a way, distributed data mining tasks can be concurrently executed ondecentralized Grid nodes by exploiting data distribution and improving application performance. In Weka4WS,the data mining algorithms for classification, clustering and association rules can be also executed on remoteGrid resources. To enable remote invocation, all the data mining algorithms provided by the Weka library areexposed as a Web Service, which can be easily deployed on the available Grid nodes. Thus, Weka4WS alsoextends the Weka GUI to enable the invocation of the data mining algorithms that are exposed as Web Serviceson remote Grid nodes.

To achieve integration and interoperability with standard Grid environments, Weka4WS has been designed byusing the Web Services Resource Framework (WSRF) as enabling technology. In particular, Weka4WS has beendeveloped by using the WSRF Java library provided by Globus Toolkit 4.0.x (GT4).

In the Weka4WS framework all nodes use the GT4 services for standard Grid functionalities, such as securityand data management. Those nodes can be distinguished in two categories:

1. user nodes, which are the local machines of the users providing the Weka4WS client software;

2. computing nodes, which provide the Weka4WS Web Services allowing the execution of remote data miningtasks.

Weka4WS is therefore distributed in two separated packages:

1. Weka4WS-client, which contains the client software (including the extended Weka GUI) to be installed on theuser nodes;

2. Weka4WS-service, which contains the WSRF-compliant Web Services to be installed on the computing nodes.

Installation

Software prerequisites

Weka4WS requires Globus Toolkit 4.0.x (full installation) on the computing nodes and only the Java WS Core (asubset of Globus Toolkit) on the user nodes. Note that this is not a minimum requirement but a specificrequirement: Globus Toolkit 4.2.x and later versions contain some updates to the web services specificationsand in some other of its services which make them incompatible with Weka4WS.

Since the full version of Globus Toolkit 4.0.x runs on Unix platforms (Linux included), Weka4WS-service can becurrently installed only on those systems, while the Weka4WS-client can be installed both on Unix and onWindows platforms.

To install Globus Toolkit 4.0.x here you have some useful links:

Globus Toolkit 4.0.x Download

GT 4.0.x Quickstart Guide

Installing GT 4.0.x (System Administrator's Guide)

Security prerequisites

Weka4WS runs in a security context, and uses a gridmap authorization: that is only users that are listed in theservice gridmap may invoke it. So in order to make Weka4WS run properly the following prerequisites must besatisfied:

1. the Weka4WS user must hold a valid proxy certificate (in the X.509 format) with a given Distinguished Name(DN);

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

1 of 8 12/11/2012 7:49 AM

Page 2: Weka4WS a WSRF-Enabled Weka Toolkit

2. the file '/etc/grid-security/grid-mapfile' on the computing nodes must contain an entry to map the Weka4WSclient user to a local user at the computing node. An entry example follows:

"O=KGrid/OU=University of Calabria/CN=John Doe" john

Computing nodes

As 'root' user, perform the following step:

1. add the following line to the file /etc/sudoers:

globus ALL= NOPASSWD: /bin/ls, /bin/cp, /bin/mkdir, /bin/chown, /bin/gzip

As 'globus' user (or alternatively as user which runs the globus container), download the Weka4WS-servicepackage in a directory (for example in its home directory), and perform the following steps:

1. extract the Weka4WS-service package:

tar xzvf weka4ws-service-2.1.tgz

2. enter the just created directory:

cd ./weka4ws-service-2.1

3. generate the Weka4WS GAR file running the command:

./build.sh

4. deploy the Weka4WS service running the command:

./deploy.sh

User node

Download the Weka4WS-client package and extract it in a directory of your choice.

Configuration

The only configuration Weka4WS requires is the editing of the 'machines' file, placed in the 'etc' subfolder of theclient package. This file contains information regarding the computing nodes, formatted with the syntaxexplained below.

The 'etc/machines' file syntax

Every line beginning with the '#' character will be ignored. Every line not beginning with '#' must contain thehostname address, its Globus container port, its GridFTP port and the logging option of a given computing node.The logging option can be only 1 or 0 standing for, respectively, "enabled" and "disabled": when the loggingoption is enabled a detailed logging will be produced on the screen where the container has been started up andis running. An example of machines file is shown below:

# ==================== computing node ==========================# hostname | container port | gridFTP port | logging

pluto.deis.unical.it 8443 2811 1saturn.deis.unical.it 8443 2811 0cosmos.cs.icar.cnr.it 8443 2811 1

Execution

Computing nodes

As 'root' user, perform the following step:

1. start the GridFTP server with:

$GLOBUS_LOCATION/sbin/globus-gridftp-server -p <port>

(where <port> is the desired port; if not specified the default 2811 port will be used)

As 'globus' user (or alternatively as user which runs the globus container) perform the following step:

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

2 of 8 12/11/2012 7:49 AM

Page 3: Weka4WS a WSRF-Enabled Weka Toolkit

1. start the globus container with:

$GLOBUS_LOCATION/bin/globus-start-container -p <port>

(where <port> is the desired port; if not specified the default 8443 port will be used)

User node

Enter the directory where you extracted the client package:

cd <path>/weka4ws-client-2.1

and run "weka4ws.sh" (or "weka4ws.bat" if you are running it on a Windows machine).

Run the client on Windows

1. download the Globus "Java WS Core Binary Installer" from here;

2. extract its content to a directory of your choice (e.g. "C:\ws-core-4.0.7");

3. place the user certificate (usercert.pem and userkey.pem) in

C:\Documents and Settings\[your username]\.globus

(Explorer does not allow to create a directory with a name starting with a dot, so you will have to create the.globus directory by running "mkdir .globus" in the command prompt)

4. place the Certification Authority files (a couple of files named like 'abc123.0' and 'abc123.signing_policy') in

C:\Documents and Settings\[your username]\.globus\certificates

5. set the environment variable:* go to Control Panel / System / Advanced / Environment variables* press the "New" under "System Variables" and set the following:

GLOBUS_LOCATION=C:\ws-core-4.0.7

* double click on "weka4ws.bat" to run the application.

Troubleshooting

* OutOfMemoryException (at client side): most Java virtual machines only allocate a certain maximumamount of memory to run Java programs. Usually this is much less than the amount of RAM in your computer.With Weka4WS you can extend the memory available for the virtual machine by running the 'weka4ws.sh' (or'weka4ws.bat' on Windows) script passing as first argument the amount of RAM (in MB) you wish to use. Forexample running:

./weka4ws.sh 2048

will run Weka4WS setting the maximum Java heap size to 2048MB.

* OutOfMemoryError (at server side): it is recommended to increase the maximum heap size of the JVM whenrunning the container. By default on Sun JVMs a 64MB maximum heap size is used. The maximum heap sizecan be set using the -Xmx JVM option. For example if you want to set 512MB as maximum heap size you needto run:

setenv GLOBUS_OPTIONS -Xmx512M

Screenshots

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

3 of 8 12/11/2012 7:49 AM

Page 4: Weka4WS a WSRF-Enabled Weka Toolkit

The Gui Chooser (left side), used to launchWeka's four graphical environments. Thehosts list checking window (top right side),automatically loaded at startup to checkwhether on every host:* the Globus Container is running andaccessible;* the GridFTP Server is running andaccessible;* the requesting user has an account on thehost;* the Weka4WS service is deployed andaccessible;* the Weka4WS client and service versionsare the same.The Grid Proxy Initialization window

(middle right side), automatically loaded atstartup if the user credentials are not availableor have expired.

~

The Weka4WS Explorer component with themodified parts highlighted. Through a dropdown menu (in blue) it is possible to choose onwhich remote host we want the data miningtask to be computed; the Reload hostsbutton (in red) brings up the hosts listchecking window (described above); theProxy button (in green) brings up the GridProxy Initialization window (described above).

~

The Weka4WS Explorer showing multiple

tasks executed concurrently on some remotehosts. The number of running tasks isdisplayed on the right-lower corner. At the topof the output panel is displayed the host namewhere the task is being computed. At any timeit is possible to stop a remote task byselecting the task from the 'Result list' (at theleft-lower corner) and pressing the 'Stop'button.

~

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

4 of 8 12/11/2012 7:49 AM

Page 5: Weka4WS a WSRF-Enabled Weka Toolkit

With a very detailed logging it's possible tofollow the remote computations on their verysingle steps, as well as to know their executiontimes.

~

The KnowledgeFlow component with themodified parts highlighted. Three buttons (inthe upper right corner) are used, from top tobottom, to start all the tasks, to stop them andthe last one is to bring up the hosts listchecking window (described earlier). Duringthe computation the label below eachalgorithm node displays the location addressupon where the computation is taking place.The Proxy button (in the lower left corner)brings up the Grid Proxy Initialization window(described earlier).

~

The choice of the location where to run acertain algorithm is made into theconfiguration panel of each algorithm,accessible right clicking on the algorithm iconand choosing Configure: through a drop downmenu it is possible to choose on which remotehost we want the selected data mining task tobe computed.

~

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

5 of 8 12/11/2012 7:49 AM

Page 6: Weka4WS a WSRF-Enabled Weka Toolkit

For complex workflows the grouping feature insub-flows of the KnowledgeFlow is useful toeasily and quickly set the computing locationsof the algorithms by either setting to Auto allthe computing locations of the algorithmsbelonging to the sub-flow, or choosing thespecific location of each algorithm by accessingthe relative configuration listed in the menu.

Changelog

Version 2.1:

* added the possibility to enable/disable a detailed logging at the remote host;

* improved datasets concurrent transfers performance: a dataset to be concurrently transferred to variousremote hosts is compressed by one thread only; a dataset to be analized by different data mining algorithms onthe same remote host is transferred only once by one thread only;

* updated xstream (the Java library used to serialize objects to XML and back again) to version 1.3;

* bugfix: tasks in the "runningTasks" list at the computing node were added twice for the same data miningtask when the dataset wasn't already available at the first service invocation;

* bugfix: tasks in the "runningTasks" list at the computing node side weren't removed from the list after theirtermination;

* bugfix: resources at the computing node weren't destroyed when exceptions arised at the user or computingnode;

* bugfix: compressed datasets appeared corrupted after their transfer at destination;

* bugfix: method "isEmpty" of the string class (introduced in Java 6), used in the HostCheckThread class of theuser node, disproved compatibility with Java 1.4;

Version 2.0 (visit the web page of Weka4WS 2.0):

* Knowledge Flow front-end also extended to support remote data mining;

* code updated to the 3.4.12 book version of Weka (12th of December 2007);

* the client side of the application can run also on Windows machines;

* proxy credentials may be created inside the client application: a dedicated window may be called at any timeboth in Explorer and Knowledge Flow and will automatically pop up at startup if the credentials are not availableor have expired;

* added pull-style message delivery mechanism for clients to whom notifications cannot be delivered (e.g.because they are behind a firewall): the client now starts by default in pull-mode (that is it checks for the resultavailability every 10 seconds) and requests a notification dispatch to the server: if it subsequently receives anotification then the client switches to push-mode (that is it waits for a result availability notification),otherwise it stays in pull-mode;

* improved hosts checking now including, besides the Globus Container and GridFTP availability check, alsouser permissions check, Weka4WSService deployment check, versions compatibility check between client and

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

6 of 8 12/11/2012 7:49 AM

Page 7: Weka4WS a WSRF-Enabled Weka Toolkit

service;

* added possibility to extend the memory available for the virtual machine by running the 'weka4ws.sh' (or'weka4ws.bat' on Windows) script and passing as first argument the amount of RAM (in MB) to be used. Forexample running

weka4ws.sh 2048

will run Weka4WS setting the maximum Java heap size to 2048MB.

Version 1.0 (visit the web page of Weka4WS 1.0):

* code updated to the 3.4.11 book version of Weka (1st of June 2007);

* added detailed client and service logging;

* added full support to data preprocessing;

* added full support to data visualization;

* added full support to "classifier evaluation options";

* added possibility to set a supplied test set;

* added dataset compression to improve transfers speed;

* added reporting of server-side exceptions to the client;

* added JDBC support;

* added possibility to concurrently run multiple remote tasks;

* added possibility to stop remote task execution.

Download

The current Weka4WS packages (2.1, dated 2nd of July 2008) can be downloaded here:

Weka4WS client

TAR.GZ (3.6MB)

ZIP (4.1MB)

Weka4WS service

TAR.GZ (2.8MB)

ZIP (2.9MB)

Copyright (C) 2005-2008 University of Calabria - Dept. of Electronics, Computer Science and Systems

This program is free software; you can redistribute it and/or modify it under the terms of the GNU GeneralPublic License as published by the Free Software Foundation; either version 2 of the License, or (at your option)any later version.

Blog

Since June 2007 Weka4WS has a blog of its own. You can find it at the following address:

http://weka4ws.wordpress.com/

With the blog you can:

stay updated with the ongoing work and new versions releases of Weka4WS through RSS feeds;

request new features or suggest modifications to existing features;

report suspected bugs;

read the frequently asked questions;

have the printable Weka4WS user guide.

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

7 of 8 12/11/2012 7:49 AM

Page 8: Weka4WS a WSRF-Enabled Weka Toolkit

How to cite

Domenico Talia, Paolo Trunfio, Oreste Verta, "The Weka4WS framework for distributed data mining in service-oriented Grids". Concurrency and Computation: Practice and Experience, vol. 20, n. 16, pp. 1933--1951, WileyInterScience, November 2008.[PDF]

About us

Grid Lab is located at the University of Calabria, Rende (CS), Italy, Cubo 41C 3rd floor.

For comments and suggestions please contact us:

Domenico Talia contact

Paolo Trunfio contact

Marco Lackovic contact

Università della Calabria | Grid Computing Lab print this page

Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download

8 of 8 12/11/2012 7:49 AM