Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
-
Upload
gianmario-spacagna -
Category
Documents
-
view
789 -
download
0
Transcript of Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel auto-tuning of machine learning algorithms Gianmario Spacagna [email protected]
16 October 2012
(877) 769-3047 (408) 404-0152 fax [email protected]
AgilOne, Inc. 1091 N Shoreline Blvd. #250 Mountain View, CA 94043
Motivation • Increase revenue of cloud service providers à Keep cost curve linear w.r.t. the expected exponential income growth.
• Technically achievable through Scalability: • Scalability in terms of resources à Distributed Parallel
Computing (Hadoop). • Scalability in terms of multi-tenancy à Same system
running for several customers. • Scalability in terms of auto-configuration à Avoiding manual tuning up operations.
2
Income Cost
Good Work Flow
3
Good data
ML Algorithm
Good results!
Tuning (Adjusting configuration)
General Tuning diagram
4
Test Data
Run algorithm with conf. X
Are results good?
Tuned
Change configuration
X
yes
no
Tuning of Machine Learning Algorithms
• We need tuning when: • New algorithm or version is released. • We want to improve accuracy and/or performance. • New customer comes and the system must be customized for the
new dataset and requirements.
We need to make it smart, automatic and scalable!
5
Vision
6
Magic Box
Request: • Data set • Application
(prediction, clustering, classification…) • Algorithm
(ANN, LR, K-means…) • Fitness metrics
(Std. dev, Prob. of false true, clustering coeff., randomness…)
• Goal constraints (x> 0.9 & 0.3<y<0.5)
Response: • Best algorithm • Optimal
configuration • Metrics
evaluation
Architecture Design
7
Initializer
Upper Applications API
Controller
Scheduler
Executor ANN
Hadoop Cloud Service
Executor LR
Executor K-Means
Evaluator
Data Sampler
Evaluator
Data Sampler
Evaluator
Data Sampler
Local
Upper Applications API
Tasks: • Interfaces the communication between the system and the upper applications layer.
• Parse requests and results and generates the related output domain object.
Possible data format: • JSON • STDIN/OUT
8
Initializer
Tasks: • Generates the initial set of configuration.
Possible implementations: • Random points • Latin Hyper Cube
• Dataset similarity
9
Controller
Tasks: • Compares and generates configurations.
• Decides the convergence of the tuning.
• Adapt the data sampling request.
Possible implementations: • Random search • Grid search
• Stochastic Kriging • Genetic Algorithms
10
Scheduler
Tasks: • Checks if the requests are covered by the available services.
• Schedules and parallelizes requests executions.
• Optimizes resources.
• Collects evaluated results.
Possible implementations: • First available • Oldest idle
• Load balanced • Serialized (single node)
11
Executor
Tasks: • Executes the providing algorithm with the specified configuration.
Possible implementations: • Local execution • Hadoop cluster
• Cloud service
12
Sub components: • Evaluator: Evaluates results
standing to the specified fitness metrics.
• Data Sampler: Down and Up sampling of data.
Tuning diagram
13
Test Data
Run algorithm with conf. X
Are results good?
Tuned
Change configuration
X
yes
no
Scheduler, Executor Initializer,
Controller
Test execution Test control
SUNS: Simple, Unclever and Not Scalable
14
Random Points
STDIN/OUT
Random Search – Grid Search
Serialized
Executor K-Means
Local
Evaluator
15
Latin Hyper Cube
STDIN/OUT or JSON
Genetic Algorithm / Stochastic Kriging
Serialized
Executor K-Means
Local
Evaluator
SNS: Smart but Not Scalable
16
Dataset Similarity
STDIN/OUT or JSON
Genetic Algorithm / Stochastic Kriging
Serialized
Executor K-Means
Local
Evaluator
VSNS: Very Smart but Not Scalable
17
Dataset Similarity
STDIN/OUT or JSON
Genetic Algorithm or Stochastic Kriging
First Available
Executor K-Means
Hadoop
Evaluator
VSS: Very Smart and Scalable
18
Dataset Similarity
STDIN/OUT or JSON
Genetic Algorithm or Stochastic Kriging
Load Balanced
Executor K-Means
Hadoop
Evaluator
VSVSO: Very Smart, Very Scalable and Optimized
Data Sampler
Thesis
It is possible to build an intelligent system based on Genetic Algorithm/Stochastic
Kriging that automatically selects and tunes machine learning algorithms, such
as K-Means and LR, parallelizing the work on an Hadoop cluster to scale in a
cost-efficient manner.
19
Project Plan 1. Design the entire application in Scala in a testable and expandable
way.
2. Implement the Genetic Algorithm or the Stochastic Kriging controller. 3. Implement the Latin Hyper Cube initializer.
4. Test with local instance algorithms (K-Means and/or LR).
5. Develop and test at least one algorithm in MapReduce fashion using Hadoop.
6. Test with real AgilOne cluster of servers. 7. Implement the Dataset Similarity initializer.
8. Implement the Dataset Sampler.
20
Order of priorities:
Questions, feedbacks, suggestions?
21
Thank you!
22