Distributed machine learning

33
DISTRIBUTED MACHINE LEARNING STANLEY W ANG SOLUTION ARCHITECT , TECH LEAD @SWANG68 http://www.linkedin.com/in/stanley-wang-a2b143b

Transcript of Distributed machine learning

DISTRIBUTED MACHINE LEARNING

STANLEY WANG SOLUTION ARCHITECT, TECH LEAD @SWANG68 http://www.linkedin.com/in/stanley-wang-a2b143b

What is Machine Learning?

Mathematics 101 for Machine Learning

Types of Machine Learning

Types of ML Algorithms

• Clustering

• Association learning

• Parameter estimation

• Recommendation engines

• Classification

• Similarity matching

• Neural networks

• Bayesian networks

• Genetic algorithms

Top Machine Learning Algorithms

Machine Learning Library

Typical Machine Learning Cases

Machine Learning Customers Examples

Machine Learning in Big Data Infrastructure

Big Data Machine Learning Pipeline

Benefits of Big Data Machine Learning

Distributed ML Framework

• Data Centric: Train over large data Data split over multiple machines Model replicas train over different parts of data and

communicate model information periodically

• Model Centric: Train over large models Models split over multiple machines A single training iteration spans multiple machines

• Graph Centric: Train over large graphs

Partitions data as graph associated with every vertex/edge; Parallel apply update functions are operations on a vertex

and transforming data in the scope of the vertex;

Data Parallel ML - MapReduce

Model Parallel ML – Parameter Server

Graph Parallel ML – BSP, Pregel, GAS

Graph Parallel vs Data Parallel

Graph parallel is new technique to partition and distribute data and execute machine learning algorithm orders of

magnitude faster than data parallel approach!

Efficient Scaling Up

• Businesses Need to Compute Hundreds of

Distinct Tasks on the Same Graph

o Example: personalized recommendations;

Parallelize each task Parallelize across tasks

Task

Task

Task

Task

Task

Task

Task

Task

Task Tas

k Task

Task

Task

Complex Simple

Expensive to scale

2x machines = 2x throughput

Another Approach Task Parallelism: Simple, But Practical

• What about scalability? Use cluster of single-machine

systems to solve many tasks in parallel, homogeneous

graph data, but, heterogeneous algorithm;

• What about learning ability? Use hybrid data fusion

approach ;

• What about memory? Using Parallel Sliding Windows

(PSW) algorithm enable computation on very large graphs

on disk;

Parallel Sliding Windows

• PSW processes the graph one sub-graph a time:

• In one iteration, the whole graph is processed.

– And typically, next iteration is started.

Scalable Distributed ML Frameworks • Yahoo Vowpal Wabbit - Fast and scalable out-of-core online

ML algorithms library ; Hadoop compatible Allreduce;

• Hadoop Mahout - Scalable Java ML library using map/reduce

paradigm; supports 3”C”s+Extras use cases;

• Spark MLlib - Memory based distributed ML framework; 10

times as fast as Mahout and even scales better;

• Dato GraphLab - Graph based parallel ML framework;

• Apache Giraph: Bulk Synchronous Parallel (BSP) based large

scale graph processing framework ;

• CMU Parameter Server - Distributed ML framework;

• CMU Petuum - Iterative-Convergent Big ML ;

• 0xdata H2O - Scalable memory efficient deep learning

system;

ML and Big Data is Breakthrough