Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf ·...
Transcript of Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf ·...
![Page 1: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/1.jpg)
Using Distributed Computing for MLaaS Michael Salvador Svanholm, Consultant
![Page 2: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/2.jpg)
We have used Apache to distribute our Machine learning
tools.
So far, we have created: Anomaly Detection and Classification.
Distributed computing is a method to deliver
results fast, when facing a growing amount of data
![Page 3: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/3.jpg)
We have used Apache to distribute our Machine learning
tools.
So far, we have created: Anomaly Detection and Classification.
Distributed computing is a method to deliver
results fast, when facing a growing amount of data
Ideally, clients can use these tools without help, if they
“know” their own data.
![Page 4: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/4.jpg)
On the other hand, anomalies can also be “data of interest” which means, that a lot of value can
potentially come from examining them.
Anomaly Detection using K-means clustering can be used to clean data
![Page 5: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/5.jpg)
On the other hand, anomalies can also be “data of interest” which means, that a lot of value can
potentially come from examining them.
Anomaly Detection using K-means clustering can be used to clean data
These
data points are
anomalies/outliers
![Page 6: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/6.jpg)
We found that some companies are anomalies, compared to others, on a subset of features in the CVR-
data from the Danish Business Authority.
Detecting anomalies in the Danish Business Registry Data (CVR-data)
Prototypes that define this cluster
Outliers in this particular cluster
![Page 7: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/7.jpg)
Bankruptcy prediction using classification on the Danish Business Registry Data (CVR-data)
Our analysis shows that the latest amount of “årsværk” and number of “closed production units” are
significant in respect to keeping a company from going bankrupt.
On the other hand, number of “open production units”, the second latest amount of “årsværk” are
significant in respect to a company that has gone bankrupt.
![Page 8: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/8.jpg)
Semi supervised learning:We can use a few labeled points with unlabeled data.
What’s next?
Black/White data points: Labeled data.
Grey data points: Unlabeled data.
Created by: Techerin
![Page 9: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/9.jpg)
Thank you for
your attention
![Page 10: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/10.jpg)
Big Data in the Food Supply Chain
Methods for handling missing data
Niels Bruun Ipsen
![Page 11: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/11.jpg)
29/03/2017Methods for missing data2 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
![Page 12: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/12.jpg)
29/03/2017Methods for missing data3 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
![Page 13: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/13.jpg)
29/03/2017Methods for missing data4 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
• Influence by missing data limits performance [3]
![Page 14: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/14.jpg)
29/03/2017Methods for missing data5 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
• Influence by missing data limits performance [3]
• How to handle missing data in a formal way in a Big Data context?
![Page 15: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/15.jpg)
29/03/2017Methods for missing data6 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
• Influence by missing data limits performance [3]
• How to handle missing data in a formal way in a Big Data context?
Missing data methods
PPCA
FA
Mixtures of PPCA or FA
ARD
Missing data process simulation
![Page 16: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/16.jpg)
29/03/2017Methods for missing data7 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
![Page 17: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/17.jpg)
29/03/2017Methods for missing data8 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
• Generalization
– Factor Analysis, mixtures
![Page 18: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/18.jpg)
29/03/2017Methods for missing data9 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
• Generalization
– Factor Analysis, mixtures
• Automation
– Automatic Relevance Determination, MLaaS
![Page 19: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/19.jpg)
29/03/2017Methods for missing data10 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
• Generalization
– Factor Analysis, mixtures
• Automation
– Automatic Relevance Determination, MLaaS
• Process estimation
![Page 20: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/20.jpg)
29/03/2017Methods for missing data11 DTU Compute, Technical University of Denmark
![Page 21: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/21.jpg)
29/03/2017Methods for missing data12 DTU Compute, Technical University of Denmark
![Page 22: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/22.jpg)
Thank you
[1] Lokers, Rob, et al. "Analysis of Big Data technologies for use in agro-environmental science.”
[2] Marvin, Hans JP, et al. "A holistic approach to food safety risks: Food fraud as an example.”
[3] Anagnostopoulos, Christos, and Peter Triantafillou. "Scaling out big data missing value imputations: pythia vs. godzilla.”
![Page 23: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/23.jpg)
Integrating Big Data in Food
Philip Johan Havemann Jørgensen, Ph.d. student
![Page 24: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/24.jpg)
Philip Johan Havemann Jørgensen, Ph.d. student 2
![Page 25: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/25.jpg)
Philip Johan Havemann Jørgensen, Ph.d. student 3
![Page 26: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/26.jpg)
Measurements for mass spectrum × retention time
Philip Johan Havemann Jørgensen, Ph.d. student 4
![Page 27: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/27.jpg)
Measurements for mass spectrum × retention time × samples
Philip Johan Havemann Jørgensen, Ph.d. student 5
![Page 28: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/28.jpg)
Tensor Factorization (Parafac2):
Xk = ADkFTk
Key challenge: Determining the correct number of components(Trying to use a probabilistic formulation to solve it)
Philip Johan Havemann Jørgensen, Ph.d. student 6
![Page 29: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/29.jpg)
I Capturing relations in multimodal dataI Data Fusion
I Improving Predictive AnalysisI Transfer Learning/Domain Adaptation
Philip Johan Havemann Jørgensen, Ph.d. student 7
![Page 30: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/30.jpg)
Thank you!
Philip Johan Havemann Jørgensen, Ph.d. student 8
![Page 31: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/31.jpg)
Knowing Nothing
Jeppe Nørregaard
PhD Student with Lars Kai Hansen as supervisor
- Computers and Semantics in Text
![Page 32: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/32.jpg)
Knowing Nothing2 DTU Compute, Technical University of Denmark 29-03-2017
People interact with computers
Where do you
want to go on
holiday?
![Page 33: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/33.jpg)
Knowing Nothing3 DTU Compute, Technical University of Denmark 29-03-2017
People interact with computers
Where do you
want to go on
holiday?
… and other people
![Page 34: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/34.jpg)
Knowing Nothing4 DTU Compute, Technical University of Denmark 29-03-2017
People interact with computers
Where do you
want to go on
holiday?
Doesn’t know what it’s selling
… and other people
![Page 35: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/35.jpg)
Knowing Nothing5 DTU Compute, Technical University of Denmark
Motivations
29-03-2017
Imagine a computer that…
• “knew” Wikipedia
![Page 36: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/36.jpg)
Knowing Nothing6 DTU Compute, Technical University of Denmark 29-03-2017
![Page 37: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/37.jpg)
Knowing Nothing7 DTU Compute, Technical University of Denmark 29-03-2017
![Page 38: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/38.jpg)
Knowing Nothing8 DTU Compute, Technical University of Denmark
![Page 39: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/39.jpg)
Knowing Nothing9 DTU Compute, Technical University of Denmark
![Page 40: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/40.jpg)
Knowing Nothing10 DTU Compute, Technical University of Denmark
Fake News
~3.500 personnel == 3.600 tanks ?
29-03-2017
![Page 41: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/41.jpg)
Knowing Nothing11 DTU Compute, Technical University of Denmark
Motivations
Imagine a computer that…
• “knew” Wikipedia
29-03-2017
![Page 42: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/42.jpg)
Knowing Nothing12 DTU Compute, Technical University of Denmark
Motivations
Imagine a computer that…
• “knew” Wikipedia
• could fact check news
29-03-2017
![Page 43: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/43.jpg)
Knowing Nothing13 DTU Compute, Technical University of Denmark
Motivations
Imagine a computer that…
• “knew” Wikipedia
• could fact check news
• perhaps a little Turing test?
29-03-2017
![Page 44: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/44.jpg)
Knowing Nothing14 DTU Compute, Technical University of Denmark
We are currently working on
Giving computers their own memory
29-03-2017
![Page 45: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/45.jpg)
Knowing Nothing15 DTU Compute, Technical University of Denmark
Exam time!
29-03-2017
All knowledge in the universe
![Page 46: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/46.jpg)
Knowing Nothing16 DTU Compute, Technical University of Denmark
Exam time!
29-03-2017
All knowledge you need
![Page 47: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/47.jpg)
Knowing Nothing17 DTU Compute, Technical University of Denmark
Differentiable Neural Computers[0]
29-03-2017
Write
Read
Memory
We don’t need to touch this
Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory.“ Nature 538.7626 (2016): 471-476.
[0]
![Page 48: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/48.jpg)
Thank You
Jeppe Nørregaard
![Page 49: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/49.jpg)
Automating unsupervised learning
DABAI
Frans Zdyb
![Page 50: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/50.jpg)
Data
Insight
Preprocessing
Domain knowledge
Load into memoryOnline stream
Cluster
Sanitize input
Vector embeddingOutlier detection
Choose a loss functionSpecify labels
Modeling
Formulate priorsTransfer learning
Meta learning
Engineer featuresLearn model parameters
Tune hyperparameters
Build an ensemble
Evaluation
Measure model fit
Measure generalization performance
Measure robustness
Measure scalability
Explanation
Visualisations
Case-based explanations
Report generation
Informed decisions
Machine Learning as a Service
![Page 51: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/51.jpg)
Supervised learning finds predictive relations between variables,
There are systems that do this automatically.
Auto-sklearn1
a wrapper around the scikit-learn, uses
meta-learning, Bayesian optimization and ensemble building
to outperform the state-of-the-art on the ChaLearn AutoML Challenge.
Classification works really well. Regression is coming along nicely.
1 “Efficient and Robust Automated Machine Learning”, Hutter et al., 2015
![Page 52: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/52.jpg)
Unsupervised learning finds generalizable dependencies between variables,
Automating it is largely unexplored territory.
Hypothesis:
● Generalize to unseen data● Robust to different training sets● Detect outliers● Aid in supervised learning
Bayesian Optimization with Gaussian Process
We can use Bayesian Optimization to tune unsupervised models
![Page 53: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/53.jpg)
Python + Numpy + Scipy
TensorFlow for distributed numerical computing and automatic differentiation
Edward2
for probabilistic modeling, built on top of TensorFlowGraphical modelsNeural networksBayesian non-parametrics
Variational InferenceMCMC
GPyOpt3
for Bayesian OptimizationEasy to useParallelUp to date
2 Edward: A library for probabilistic modeling, inference, and criticism, 2016, edwardlib.org3 GPyOpt: A Bayesian Optimization framework in python, 2016, sheffieldml.github.io/GPyOpt/
![Page 54: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University](https://reader033.fdocuments.net/reader033/viewer/2022060501/5f1b583bca450d3886323a97/html5/thumbnails/54.jpg)
Thank you!