Scheduling scheme for hadoop clusters

26
A RESEARCH ON SCHEDULING SCHEME FOR HADOOP CLUSTERS Guided by Presented by Neetha K N Amjith B Dept of CSE S7 CSE

description

a prefetching mechanism into MapReduce model while retaining compatibility with the native Hadoop.

Transcript of Scheduling scheme for hadoop clusters

Page 1: Scheduling scheme for hadoop clusters

A RESEARCH ON SCHEDULING SCHEME FOR HADOOP CLUSTERS

Guided by Presented by

Neetha K N Amjith B

Dept of CSE S7 CSE

Page 2: Scheduling scheme for hadoop clusters

AREAS OF SEMINAR

Hadoop

MapReduce and HDFS

Page 3: Scheduling scheme for hadoop clusters

Node 1

Node 2

Node n

.

.

.

Rack 1

Node 1

Node 2

Node n

Rack 2

. . .

Node 1

Node 2

Node n

Rack n

Hadoop clusterTERMINOLOGY REVIEW

Page 4: Scheduling scheme for hadoop clusters

INTRODUCTION

• Hadoop is a Open source software framework for distributed processing of large datasets across large clusters of computers

• 2 ComponentsMapReduce engineDistributed file system

Page 5: Scheduling scheme for hadoop clusters

COMPONENTS

• Mapreduce engineProgramming model developed by Google Computation component of Hadoop Consists of Map and Reduce functions

• HDFS Storage component of Hadoop Splits the data into blocks and distributes themFault tolerant and self-healing

Page 6: Scheduling scheme for hadoop clusters

• Jobtracker•Tasktracker

MapReduce node

•Name node•Data node

HDFS node

Page 7: Scheduling scheme for hadoop clusters

• HDFS Node• NameNode – Maintains metadata information

about files (1 per cluster). • DataNode – Handles all data allocation and

replication and is installed on each slave node (1 to many per cluster).

• MapReduce node• JobTracker – Schedules job execution and keep

track of cluster wide job status (1 per cluster) • TaskTracker – Receives tasks from job tracker.

Runs on compute nodes in conjunction with data node (1 to many per cluster).

Page 8: Scheduling scheme for hadoop clusters
Page 9: Scheduling scheme for hadoop clusters

LITERATURE SURVEY

SYSTEM FEATURES DISADVANTAGES

REFERENCE

Hadoop FIFO scheduing

Implements by FIFO principle

Can not assign priority for jobs

REF [6]

Facebook’s Fair scheduler

Even allocation of resources

No preemption support for large tasks

REF [4]

Yahoo’s Capacity scheduler

FIFO scheduler based on priority

Problem in assigning priorities

REF[6]

Page 10: Scheduling scheme for hadoop clusters

EXISTING SYSTEM

Page 11: Scheduling scheme for hadoop clusters
Page 12: Scheduling scheme for hadoop clusters
Page 13: Scheduling scheme for hadoop clusters
Page 14: Scheduling scheme for hadoop clusters

EXISTING SYSTEM (disadvantage)

• The underutilization of CPU processes• Not flexible• Interaction between master node with slave nodes

Page 15: Scheduling scheme for hadoop clusters

PROPSED SYSTEM

• Analyze the system for CPU and IO underutilization• Use a predictive scheduler for predicting the appropriate

TaskTracker• Couple the scheduler with a prefetching mechanism to

improve the system performance

Page 16: Scheduling scheme for hadoop clusters
Page 17: Scheduling scheme for hadoop clusters

PREDICTIVE SCHEDULER

• Flexible task scheduler• Predicts the most appropriate task trackers to assign

future tasks• Allows DataNodes to explore underutilization of disk

bandwidth• Seeks stragglers and predicts candidate data blocks

Page 18: Scheduling scheme for hadoop clusters

PREFETCHING MODULE

• Integrate with predictive scheduler• Multiple worker threads• Monitor status of worker threads and coordinate

prefetching process

Page 19: Scheduling scheme for hadoop clusters

STEPS FOR LAUNCHING TASKS

Copying the job from HDFS to TaskTracker

Creation of local working directory for task

Creation of TaskTracker instance

Page 20: Scheduling scheme for hadoop clusters

ISSUES IN PREFETCHING MODULE

•When to prefetch•What to prefetch• How much to prefetch

Page 21: Scheduling scheme for hadoop clusters

ADVANTAGES

• Avoidance of I/O stalls• Maximising CPU utilisation • Helps the smooth functioning of Hadoop• Flexible

Page 22: Scheduling scheme for hadoop clusters

COMPARISON

EXISTING SYSTEM PROPSOED SYSTEM

Low i/o perfomance High I/O perfomance

CPU underutilised Proper utilisation

Less flexible Additional overhead of prefetching to master

Page 23: Scheduling scheme for hadoop clusters

FUTURE SCOPE

• Hadoop on demand (HOD)• A scheduler in heterogeneous environment

Page 24: Scheduling scheme for hadoop clusters

REFERENCES

• 1. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008.

• 2. M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI’08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008.

• 3. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79–95, December 1995.

• 4. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim,et. al. Hpmr: Prefetching and pre-shuffling in shared mapreduce computation environment. In Proceedings of 11th IEEE International Conference on Cluster Computing, pages 16–20. ACM, 2009.

• 5. Tom White. Hadoop The Definitive Guide. O’Reilly, 2009.• 6. Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin

Garegrat, Shiwali Mohan

Page 25: Scheduling scheme for hadoop clusters

THANK YOU!!!!!!

Page 26: Scheduling scheme for hadoop clusters

QUESTIONS??