Distributed Data Mining System in Java
-
Upload
jermaine-cherry -
Category
Documents
-
view
39 -
download
4
description
Transcript of Distributed Data Mining System in Java
Distributed Data Mining System in Java
Group Member
王春笙,林俊甫,王慧芬
Overview of Project Overview of Project
• Project participants– 王春笙,林俊甫,王慧芬
Project Programming Tasks Project Programming Tasks
• D92725002 林俊甫– Polling and reply Multicast between client and server– Client/Server Socket programming– Client dynamic join and leave mechanism– Multi-thread programming – Synchronization mechanism– Data chunks maintenance and dispatching mechanis
m– Client/Server communication link control
Project Programming Project Programming Tasks(cont’d)Tasks(cont’d)
– Client failure handling• Reassign backup server, if failure client is backup• Restore failure client works (with 王春笙 )
– Server failure handling• Backup Server designate mechanism and logic design
– RMI mechanism (with 王春笙 )– Basic GUI
System Infrastructure System Infrastructure
• System diagram
LAN
Server/Coordinator
Client Client Client
...
Mining data chunk
Mining result
Basic OperationBasic Operation
Server Client1. Polling on port 4444 Group 230.0.0.1@: who is server?
2. Servername: I am the server
3. Connect to <servername, port 4445>
4. Client do: filechunk#
5. ok
6. Client do: next filechunk#
7…..8…..….
Time Time
Listen multicastGroup query and reply Server found;
Connect to the Server
Fork thread to Handle client connection
Receive server’sInstruction, ivokeRMI to get file chunk
Wait for client’sProcessed result,Order client to getAnother file chunk
Port AssignmentPort Assignment
• Port 4444: for multicast
• Port 4445: for TCP/IP socket connection
• Port 4446: for RMI services
Finding A ServerFinding A Server
• Once a client start up, it will query periodically every 3 sec. over the multicast group 230.0.0.1 port 4444 by sending 1 byte string “@” to locating the server host.
• Once a server start up, it will fork a thread to dealing with the query
6. Server failure detect -> if I am backup
go to backup serverprocedure, otherwise
go to step.1.
3.Connect to Server on port
4445
2. Listen forserver response
1. Client Query: who is the Server now?
4. Use RMI Get file chunk from
Server
5. Process data mining and return
result to server
File DispatchingFile Dispatching
• Server maintain a file chunk pool .
• Server will find a available file chunk for client, set it to 1 and order client to get this file chunk by RMI file chunk will be update to 2 when client return result.
• Recovery: When server detects client’s link-broken, it will restore file chunk allocate to client to 0.
• File chunk class is declared as Serializable for RMI message passing to backup server
• File chunk class use Synchronization for concurrent control
FileChunks …………
-1: empty, 0: available, 1: using, 2:used
Backup Server SelectionBackup Server Selection
• Server maintains and assigns unique id for each individual client.
• Unique id is incremented as serial number.
• Client with smallest id is assigned as backup server
• When client failure, server will check if it is the backup server to restart the selection process or not.
Nodes MaintenanceNodes Maintenance
• Server maintain connected client’s records in an ArrayList
• ArrayList is compound with class Nodes, which records client’s detail information.
Key Value
Id Address Port Work on Status
ArrayList: ht
Nodes
RMI ServicesRMI Services
• RMI services is written in independent program because server and client (which acts as backup server) will use it.
• RMI services provides:– Backup server data to backup-server.– Get file chunk from server– Return mining result to server– Receive nodes information from server
Client FailureClient Failure
• Server’s action took:– Recovery– Reassignment – Redo backup server selection if failure nodes
is backup
• Client’s action– Do nothing except one is told by server to act
as backup
Server Failure Server Failure Server S Client BTime Time
Server run backupSelection choose AAs backup
TimeClient A
1.A is told by S thatIt is the backupA invoke RMI to get all Server data
A: Do backup
RMI Get file
RMI reply
2. A periodically Get server services,File chunk data do reply
Client do #
Client do #
do reply
1. B receives instruction as discuss before
Server CrashX X3. Comm.link brokenIs detected, start ServerAction class
2. Comm.Link Broken is detected, multicast query who is the server now?
B Polling @: who is server?4. Create server Socket at 4445, fork threadTo listen to query And wait for connection
A reply: I am the server3. B know A is the backup, re-connect to A
Connect to A:4445
Server/Client Life CycleServer/Client Life Cycle
Server Client
ServerNormal/AbnormalTermination
Normal/AbnormalTermination
evolve
Project Programming Tasks Project Programming Tasks
• D91725001 王春笙– Web log file preprocessing and separating– Web pages traversal sequences parsing– Page items transferring and mapping– Web pages sequential patterns mining – Mining results maintenance – RMI mining results transfer– Mining results lookup and display
Project Programming Project Programming Tasks(cont’d)Tasks(cont’d)
– Backup mechanism • Separate thread backup server files and memory data • Restore failure client works (with 林俊甫 )
– RMI mechanism (with 林俊甫 )– GUI global states refreshment– System integration
• Testing and debugging
Web Log File FormatWeb Log File Format
• User IP
• Date
• Time
• Web pages URL
Web File PreprocessingWeb File Preprocessing
• Select *.htm and *.html pages
• First sort by user ID
• Second sort by time
• Pages sequences separated by time– more than 30 seconds
Chunk Data FilesChunk Data Files• Part*.ppp
• Items.ppp
6023 2 1 1 2 86024 1 1 2066025 7 1 1 1 1 1 1 1 2 5 17 18 19 20 116026 3 1 1 1 144 145 3386027 2 1 1 2 96028 3 1 1 1 2 8 3
/~visualdep/htm/p5b.htm 168/~businessdep/student/picture.html 169/~comedu/inde.htm 170/~account/91tuition.htm 171/~stuaffair/life/procedure-17.htm 172/~stuaffair/life/procedure-25.htm 173
Apriori algorithmApriori algorithm
• 1:find all L1
• 2:generate C2 from L1
• 3:count C2 and find all L2
• 4:k=3
• 5:generate & prune Ck from Lk-1
• 6:count Ck and find all Lk
• 7:if Lk not empty then k++, goto 5
Apriori algorithm Apriori algorithm (cont’d)(cont’d)
• join phase:s1 join s2 if s1(drop first) = s2(drop last)
– s1 join s2 =>
• prune phase:delete a k candidate if any k-1 sub sequence not large
• C & L are stored in hash data structure
},{},,{ 21 absbas
},,{ aba
Mining Result DisplayMining Result Display• Client frequent patterns
– Web page ID– Support– Saved as *.pppl files
• Client frequent patterns– Web page ID– Support– Web page name
Backup MechanismBackup Mechanism
• When backup server selected, that client start a backup thread
• Backup thread loop every 0.5 second
• RMI data transfer– Chunk data file(part*.ppp,items.ppp)– Client information– File chunk information
• determine MaxID and set “in use” to “available”
– Frequent patterns information
System IntegrationSystem Integration
• Java class integration– Server component– Client component– Data mining component– GUI component
• Testing
• Debugging
Project Programming TasksProject Programming Tasks
• D92725001 王慧芬
– Graphical User Interface• Since this is a system working on data mining task
in a distributed way, its GUI provides four panels :– A system console– A result window– A connection table– A graphical network configuration
GUIGUI
• The system console shows how system proceeds
GUI (cont’d)GUI (cont’d)
• The result window displays the progress and results of data mining
GUI (cont’d)GUI (cont’d)• A connection table lists all of the on-line
client connection information
GUI (cont’d)GUI (cont’d)• A connection table consists of 5 fields
– NO: client-server connection id– IP address: client’s IP address– Port: client’s port number– Status: connection status, it could be
• 0: offline 1: online• 2: file transfer from server to client• 3: client is doing data mining• 4: client returns value back to server if data mining finished• 5: client is doing the backup and data mining at the same time
– # chunk works on: if data mining and backup, it indicates the chuck number that the connection works on
GUI (cont’d)GUI (cont’d)• A graphical network configuration follows the
connection table to depict the dynamic network configuration
GUI (cont’d)GUI (cont’d)
• In the dynamic network configuration, we use different client GIFs to express the status :– Offline On-line
– Data mining
– Backup and mining
GUI interfaceGUI interface• mw.showMsg()
– provided by GUI for server/client module to show the console message
• mw.showResultString()– provided by GUI for server/client module to show the re
sults of data mining
• Connection table– modified by server/client module for connection inform
ation– read by GUI every 0.01 second to depict the dynamic n
etwork configuration
GUI designGUI design
• Java swing is used to generate label, text, scrollbar, and table, etc..
• Java AWT 2D painting is used to form the animation of the connection lines in the dynamic configuration panel
• ‘Photo Impact’ and ‘GIF animator’ are used to generate the node icons
• EasyRGB used to tune the color harmonies.
GUI design (cont’d)GUI design (cont’d)• A new thread is forked from the GUI task to work on the
animation of the connection lines in the dynamic configuration panel,
– to read the table
every 0.03 second and
to show the connection
status with a moving
ball.
GUI
Generateconnection
table
Generateresult panel
Generatesystemconsole
Generateconnection
table
animation
InstallationInstallation
• 以執行一個 server ,兩個 client 為例– 建立三個資料夾,此三資料夾 Ser(Server),Cli(Client1),Cli2(Client
2)– 將附檔解壓至 Ser 資料夾,此資料夾內要下載 weblog10.zip 檔,
並解壓– 將附檔解壓至 Cli 與 Cli2 的空資料夾– 開啟二個 dos 視窗 (1,2 號視窗 ) ,進入 Ser 資料夾– 開啟三個 dos 視窗 (3,4,5 號視窗 ) , 3,4 號進入 Cli 資料夾, 5 號
進入 Cli2 資料夾– 1 號視窗執行 compile.bat 批次檔,再執行 rmi.bat– 2 號視窗執行 server.bat 批次檔– 3 號視窗執行 compile.bat 批次檔,再執行 rmi.bat– 4 號視窗執行 client.bat 批次檔– 5 號視窗執行 compile.bat 批次檔,再執行 client.bat 批次檔