Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing...

30
Distributed Computing Overviews

Transcript of Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing...

Page 1: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Distributed Computing Overviews

Page 2: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Agenda

• What is distributed computing• Why distributed computing• Common Architecture• Best Practice• Case study

– Condor– Hadoop – HDFS and map reduce

Page 3: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

What is Distributed Computing/System?

• Distributed computing– A field of computing science that

studies distributed system.– The use of distributed systems to

solve computational problems.• Distributed system

– Wikipedia• There are several autonomous

computational entities, each of which has its own local memory.

• The entities communicate with each other by message passing.

– Operating System Concept• The processors communicate with one

another through various communication lines, such as high-speed buses or telephone lines.

• Each processor has its own local memory.

Page 4: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

What is Distributed Computing/System?

• Distributed program– A computing program that runs in a distributed

system• Distributed programming

– The process of writing distributed program

Page 5: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

What is Distributed Computing/System?

• Common properties– Fault tolerance

• When one or some nodes fails, the whole system can still work fine except performance.

• Need to check the status of each node

– Each node play partial role• Each computer has only a limited, incomplete view of the system. Each

computer may know only one part of the input.

– Resource sharing• Each user can share the computing power and storage resource in the

system with other users

– Load Sharing• Dispatching several tasks to each nodes can help share loading to the

whole system.

– Easy to expand• We expect to use few time when adding nodes. Hope to spend no time

if possible.

Page 6: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Why Distributed Computing?

• The nature of application• Performance

– Computing intensive• The task could consume a lot of time on computing. For example, π

– Data intensive• The task that deals with a lot mount or large size of files. For example,

Facebook, LHC(Large Hadron Collider).

• Robustness– No SPOF (Single Point Of Failure)– Other nodes can execute the same task executed

on failed node.

Page 7: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Common Architectures

• Communicate and coordinate works among concurrent processes– Processes communicate by sending/receiving

messages– Synchronous/Asynchronous

Page 8: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Common Architectures

• Master/Slave architecture– Master/slave is a model of

communication where one device or process has unidirectional control over one or more other devices

• Database replication– Source database can be treated as

a master and the destination database can treated as a slave.

• Client-server– web browsers and web servers

Page 9: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Common Architectures

• Data-centric architecture– Using a standard, general-purpose relational database

management system customized in-memory or file-based data structures and access method

– Using dynamic, table-driven logic in logic embodied in previously compiled programs

– Stored procedures logic running in middle-tier application servers

– Shared databases as the basis for communicating between parallel processes direct inter-process communication via message passing function

Page 10: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Best Practice

• Data Intensive or Computing Intensive– Data size and the amount of data

• The attribute of data you consume• Computing intensive

– We can move data to the nodes where we can execute jobs

• Data Intensive– We can separate/replicate data to difference nodes, then we can execute

our tasks on these nodes– Reduce data replication when executing tasks

• Master nodes need to know data location • No data loss when incidents happen

– SAN (Storage Area Network)– Data replication on different nodes

• Synchronization– When splitting tasks to different nodes, how can

we make sure these tasks are synchronized?

Page 11: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Best Practice

• Robustness– Still safe when one or partial nodes fail– Need to recover when failed nodes are online. No

further or few action is needed• Condor – restart daemon

– Failure detection• When any nodes fails, master nodes can detect this situation.

– Eg: Heartbeat detection

– App/Users don’t need to know if any partial failure happens.

• Restart tasks on other nodes for users

Page 12: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Best Practice

• Network issue– Bandwidth

• Need to think of bandwidth when copying files from one node to other nodes if we would like to execute the task on the nodes if no data in these nodes.

• Scalability– Easy to expand

• Hadoop – configuration modification and start daemon

• Optimization– What can we do if the performance of some nodes

is not good?• Monitoring the performance of each node

– According to any information exchange like heartbeat or log

• Resume the same task on another nodes

Page 13: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Best Practice

• App/User– shouldn’t know how to communicate between

nodes– User mobility – user can access the system from

some point or anywhere• Grid – UI (User interface)• Condor – submit machine

Page 14: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Condor

• Condor– Computing intensive jobs– Queuing policy

• Match task and computing nodes

– Resource Classification• Each resource can advertise its attributes and master can classify

according to this

Page 15: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Condor

From http://www.cs.wisc.edu/condor/From http://www.cs.wisc.edu/condor/

Page 16: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Condor

• Role– Central Manger

• The collector of information, and the negotiator between resources and resource requests

– Execution machine• Responsible for executing condor tasks

– Submit machine• Responsible for submitting condor tasks

– Checkpoint servers• Responsible for storing all checkpoint files for the tasks

Page 17: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Condor

• Robustness– One execution machine fails

• We can execute the same task on other nodes.

– Recovery• Only need to restart the daemon when the failed nodes are online

Page 18: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Condor

• Resource sharing– Each condor user can share computing power with

other condor users.• Synchronization

– Users need to take care by themselves• Users can execute MPI job in a condor pool but need to think of the issues of

synchronization and Deadlock.

• Failure detection– Central manager can know when nodes fails

• Based on update notification sent by nodes

• Scalability– Only execute few commands when new nodes are

online.

Page 19: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

• HDFS– Namenode:

• manages the file system namespace and regulates access to files by clients.

• determines the mapping of blocks to DataNodes.

– Data Node : • manage storage attached to the nodes that they run on• save CRC codes• send heartbeat to namenode. • Each data is split as a chunk and each chuck is stored on some data

nodes.

– Secondary Namenode• responsible for merging fsImage and EditLog

Page 20: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

Page 21: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

• Map-reduce Framework– JobTracker

• Responsible for dispatch job to each tasktracker • Job management like removing and scheduling.

– TaskTracker• Responsible for executing job. Usually tasktracker launch another JVM

to execute the job.

Page 22: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

From Hadoop - The Definitive Guide From Hadoop - The Definitive Guide

Page 23: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

• Data replication– Data are replicated to different nodes

• Reduce the possibility of data loss• Data locality. Job will be sent to the node where data are.

• Robustness– One datanode fails

• We can get data from other nodes.

– One tasktracker failed• We can start the same task on different node

– Recovery• Only need to restart the daemon when the failed nodes are online

Page 24: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

• Resource sharing– Each hadoop user can share computing power

and storage space with other hadoop users.• Synchronization

– No synchronization• Failure detection

– Namenode/Jobtracker can know when datanode/tasktracker fails

• Based on heartbeat

Page 25: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study - Hadoop

• Scalability– Only execute few commands when new nodes are

online.• Optimization

– A speculative task is launched only when a task takes too much time on one node.

• The slower task will be killed when the other one has been finished

Page 26: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Reference

• http://en.wikipedia.org/wiki/Message_passing • http://en.wikipedia.org/wiki/Distributed_computing • http://www.mcs.anl.gov/research/projects/mpi/tutorial/mpiintro/

ppframe.htm • http://en.wikipedia.org/wiki/SIMD • http://en.wikipedia.org/wiki/Database-centric_architecture • http://www.cloudera.com/videos/thinking_at_scale• http://www.cs.wisc.edu/condor• http://hadoop.apache.org/• http://www.ece.arizona.edu/~ece568/MPI1.ppt • Tom White - Hadoop - The Definitive Guide• Silberschatz Galvin - Operating System Concepts

Page 27: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Backup slides

Page 28: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Message passing - Synchronous Vs. Asynchronous

Page 29: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study – Condor (All related daemons)

• condor_master: keeping all the rest of the Condor daemons running on each machine

• condor_startd: represents a given resource and enforcing the policy that resource owners configure which determines under what conditions remote jobs will be started, suspended, resumed, vacated, or killed. When the startd is ready to execute a Condor job

• condor_starter: spawns the remote Condor job on a given machine

• condor_schedd: represents resource requests to the Condor pool.

• condor_shadow• condor_collector: collecting all the information about the status

of a Condor pool• condor_negotiator: execute all the match-making within the

Condor system• condor_kbdd: notify condor_startd when machine owner • condor_ckpt_server: store and retrieve checkpoint files• condor_quill: builds and manages a database that represents a

copy of the Condor job queue

Page 30: Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Case study – Condor (All related daemons)

• condor_had: implementation of high availability of a pool's central manager through monitoring the communication of necessary daemons

• condor_replication: assists the condor_had daemon by keeping an updated copy of the pool's state

• condor_transferer: accomplish the task of transferring a file

• condor_lease_manager: leases in a persistent manner. Leases are represented by ClassAds

• condor_rooster: wakes hibernating machines based upon configuration details

• condor_shared_port: listen for incoming TCP packets