Scheduling scheme for hadoop clusters

A RESEARCH ON SCHEDULING SCHEME FOR HADOOP CLUSTERS

Guided by Presented by

Neetha K N Amjith B

Dept of CSE S7 CSE

AREAS OF SEMINAR

Hadoop

MapReduce and HDFS

Node 1

Node 2

Node n

.

.

.

Rack 1

Node 1

Node 2

Node n

Rack 2

. . .

Node 1

Node 2

Node n

Rack n

Hadoop clusterTERMINOLOGY REVIEW

INTRODUCTION

• Hadoop is a Open source software framework for distributed processing of large datasets across large clusters of computers

• 2 ComponentsMapReduce engineDistributed file system

COMPONENTS

• Mapreduce engineProgramming model developed by Google Computation component of Hadoop Consists of Map and Reduce functions

• HDFS Storage component of Hadoop Splits the data into blocks and distributes themFault tolerant and self-healing

• Jobtracker•Tasktracker

MapReduce node

•Name node•Data node

HDFS node

• HDFS Node• NameNode – Maintains metadata information

about files (1 per cluster). • DataNode – Handles all data allocation and

replication and is installed on each slave node (1 to many per cluster).

• MapReduce node• JobTracker – Schedules job execution and keep

track of cluster wide job status (1 per cluster) • TaskTracker – Receives tasks from job tracker.

Runs on compute nodes in conjunction with data node (1 to many per cluster).

LITERATURE SURVEY

SYSTEM FEATURES DISADVANTAGES

REFERENCE

Hadoop FIFO scheduing

Implements by FIFO principle

Can not assign priority for jobs

REF [6]

Facebook’s Fair scheduler

Even allocation of resources

No preemption support for large tasks

REF [4]

Yahoo’s Capacity scheduler

FIFO scheduler based on priority

Problem in assigning priorities

REF[6]

EXISTING SYSTEM

EXISTING SYSTEM (disadvantage)

• The underutilization of CPU processes• Not flexible• Interaction between master node with slave nodes

PROPSED SYSTEM

• Analyze the system for CPU and IO underutilization• Use a predictive scheduler for predicting the appropriate

TaskTracker• Couple the scheduler with a prefetching mechanism to

improve the system performance

PREDICTIVE SCHEDULER

• Flexible task scheduler• Predicts the most appropriate task trackers to assign

future tasks• Allows DataNodes to explore underutilization of disk

bandwidth• Seeks stragglers and predicts candidate data blocks

PREFETCHING MODULE

• Integrate with predictive scheduler• Multiple worker threads• Monitor status of worker threads and coordinate

prefetching process

STEPS FOR LAUNCHING TASKS

Copying the job from HDFS to TaskTracker

Creation of local working directory for task

Creation of TaskTracker instance

ISSUES IN PREFETCHING MODULE

•When to prefetch•What to prefetch• How much to prefetch

ADVANTAGES

• Avoidance of I/O stalls• Maximising CPU utilisation • Helps the smooth functioning of Hadoop• Flexible

COMPARISON

EXISTING SYSTEM PROPSOED SYSTEM

Low i/o perfomance High I/O perfomance

CPU underutilised Proper utilisation

Less flexible Additional overhead of prefetching to master

FUTURE SCOPE

• Hadoop on demand (HOD)• A scheduler in heterogeneous environment

REFERENCES

• 1. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008.

• 2. M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI’08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008.

• 3. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79–95, December 1995.

• 4. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim,et. al. Hpmr: Prefetching and pre-shuffling in shared mapreduce computation environment. In Proceedings of 11th IEEE International Conference on Cluster Computing, pages 16–20. ACM, 2009.

• 5. Tom White. Hadoop The Definitive Guide. O’Reilly, 2009.• 6. Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin

Garegrat, Shiwali Mohan

THANK YOU!!!!!!

QUESTIONS??

Scheduling scheme for hadoop clusters

Technology

Transcript of Scheduling scheme for hadoop clusters