Hadoop Summit 2010 Keynote
-
Upload
yahoo-developer-network -
Category
Technology
-
view
1.135 -
download
0
Transcript of Hadoop Summit 2010 Keynote
Hadoop
Trends, Opportunities, Challenges
Hemanth Yamijala
Committer, Hadoop
Technical Lead, Map/Reduce, Yahoo!
What is
• Distributed computing framework
– Offers storage and batch processing for petabytes
of data
– Very suitable for ad-hoc textual processing
applicationsapplications
• Components
– Hadoop Distributed File System
– Map/Reduce programming framework
• Apache Software Foundation project
Hadoop on your Yahoo! page …
Hadoop Adoption Trends - Yahoo!
•Runs the Yahoo! Distribution of Hadoop
•http://github.com/yahoo/hadoop
•230 jobs/hour on average
•4.38 Tb/hour of input, 936 Gb/hour of output
Hadoop on your FB, Twitter pages
– Reporting, analytics, machine learning
• Amazon
– Hosted Hadoop on top of EC2 and S3
– Product search index
– Analytics, social network graphs
• AOL, Microsoft (PowerSet), IBM, …
• http://wiki.apache.org/hadoop/PoweredBy
Support of a vibrant community
Hadoop contributions:
Core: HDFS, Map/Reduce; Non-core: sub-projects Hadoop mailing list traffic
Cloudera Distribution of Hadoop – paid, supported service offering
from Cloudera
Support from Academia, Research
• PSG Tech, Coimbatore
– Semantic search, information retrieval, scheduling, applications in molecular biology –Deep dive on this later
• IIIT, Hyderabad• IIIT, Hyderabad
– Applications in Indian language content processing, scheduling
• IISc, Bangalore
– Modeling a simulator for Hadoop
• Many more – M45, OpenCirrus, …
Hadoop – a RAD tool ?
• Without Hadoop
– Build-out and maintenance of hardware
– Transfer, storage of data - Deep dive on this later
on
– Handling failures, efficiency– Handling failures, efficiency
• Enables rapid experimentation, iteration,
repeatability, low cost of failure
• Great Ecosystem: Streaming, PIG, Hive, Hbase,
Oozie, Avro…
Technical focus areas at Yahoo!
• Security
– Kerberos based authentication
• Backwards Compatibility – 1.0
– APIs cannot be broken between major releases– APIs cannot be broken between major releases
– A new API in Map/Reduce that enables this
• Robustness
– Multiple bug fixes
– Map/Reduce framework refactoring for better
concurrency, simplifying control flow logic
Technical focus areas at Yahoo!
• Append / Sync / Flush
– Until Hadoop 0.20, files were write once
– Append going to open Hadoop for more apps
• Efficiency in scheduling, data processing
– Task scheduling for better utilization, better
sharing policies
– Zero data copy – usage of direct I/O buffers
• Quality engineering
– Automated distributed system testing,
performance benchmarks (deep dive coming)
Agenda for Hadoop Summit
• Lightning Talk by Hari Vasudev (VP Platform
Tech Group, Yahoo!)
• Data Management on Grid by Srikanth
Sundarrajan (Yahoo!)Sundarrajan (Yahoo!)
• Machine Learning using Hadoop- Real Case
Study by Krishna Prasad Chitrapura (Yahoo!)
• Multiple Sequence Alignment using Hadoop
by Dr. Sudha Sadhasivam (PSG Tech,
Coimbatore)
Agenda for Hadoop Summit
• Benchmarking and Optimizing Hadoop
deployments(benchmarking on HiBench) by Mukesh
Gangadhar (Intel)
• Challenges and Uniqueness of QE and RE processes in Hadoop
by Jayant Mahajan (Yahoo!)
• Tuning Hadoop to deliver performance to your application by
Srigurunath Chakravarthi (Yahoo!)
• Panel Discussion: Moderator: Basant Verma (Yahoo!);
Panelist: T. S. Mohan (Infosys), Sudha Sadhasivam (PSG Tech),
Chidambaran Kollengode (Yahoo!) & Jothi Padmanabhan
(Yahoo!),
• Yahoo booth throughout the day: win cool prizes ☺
Backup Slides
Challenges for Yahoo!
• No longer just a wildly successful cool project!
– People are demanding we deliver !
• Production usage, availability, SLAs
– Jobs that MUST finish in 15 minutes, or revenue is – Jobs that MUST finish in 15 minutes, or revenue is
lost, and the time limits are going down
• Usability, Operability
• Scale, Performance
– Ever increasing demands mean we need larger
clusters, faster throughput
Design considerations
• Cost Effectiveness
– Runs on commodity hardware, Linux
• Linear Scale
• Fault Tolerance• Fault Tolerance
– Block replication, checksums
– Transparent monitoring and re-execution of tasks
• Efficiency
– Data locality
– Efficient resource usage