Machine Learning for (JVM) Developers
-
Upload
mateusz-dymczyk -
Category
Software
-
view
403 -
download
1
Transcript of Machine Learning for (JVM) Developers
Machine learning for (JVM) developers
Mateusz Dymczyk Software Engineer
H2O.ai
11th May 2016
Say who?
• Software Engineer @ H2O.ai • Ph.D. drop-out (AGH in Krakow) • ex Fujitsu Laboratories research trainee
Say what?
• Status quo of data • Why Machine Learning? • Intro to Machine Learning • Machine Learning and the JVM • Machine Learning Demo
The state of data
Exponential growth
Text
Data source
Data collection Data storage
Simple analytics
Data processing
Ideas
• Alerting from real time data • Similarity search
Retail
Healthcare
Insurance/banking
• Recommendations • Store layout • Ad targetting
• Stock price predictions • Anomaly/fraud detection • Automatic investments
https://www.kaggle.com/wiki/DataScienceUseCases
Machine Learning
Def ini t ion
“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
Simply speaking…
• Subfield of Artificial Intelligence which… • Tries to find patterns in data using… • Math, statistics, probability, optimisation theory etc. to
create… • Model which can be used to predict values or cluster • Theoretical concept with many implementations
Basic terminology
Observations are objects which are used for learning and evaluation. Anything that can be described using quantitative features.
Observations
{"title":"Emailschema","type":"object","properties":{"age":{"type":"float"},"rooms":{"type":"int"},"size":{"type":"float"},"location":{"type":"string"}}}
Feature is a quantitative trait that (partially) represents an observation.
Feature vector is an n-dimentional vector of features that represents an observation.
Feature extraction vs. feature selection
Feature
{"title":"Emailschema","type":"object","properties":{"age":{"type":"float"},"rooms":{"type":"int"},"size":{"type":"float"},"location":{"type":"string"}}}
[5,3,60.5]
• System is a set of related objects forming a complex whole (e.g. set of all possible distinct observations) • In our case set of all possible houses
System
• Model is the description of a system using mathematical concepts/language. • Result of a machine learning technique • Can be used for predictions/clustering • Online or offline
Model
Supervised Learning
• User needs to know: • the structure of the data • possible outputs
• Sample data has to be labeled for training
Classif ication
• Required: • all possible labels • already labeled samples
• Output: predicted label for new inputs • Examples: • spam classification based on email content • gender classification based on physical features
Regression
• Required: • samples with actual values associated
• Output: predicted values for new inputs • Examples: • price prediction based on historical prices
Unsupervised Learning
• Doesn’t require the user to know what should be the output
• No labelling necessary by the user • Useful for finding structure in data • Examples: • grouping users (clustering)
Cluster ing
• Required: • data, no labelling necessary
• Output: data grouped into clusters • Examples: • grouping users with similar tastes
Types of machine learning
eg. regression, when you want to predict
a real number
eg. clustering, when you want to cluster or have too much data
eg. classification, when you want to assign to a
category
eg. association analysis, when you want to find relations between data
Predictions/clusters
Gener ic f low
Raw data Feature extraction
Machine learning magic
TRAINING
ModelIncoming new data
Feature extraction
PREDICTING
Validation
• How do we know the model is good? • Cross validation: • divide the data into training and testing subsets
(sometimes third one is necessary) • train using the training set, validate using testing set • do those splits multiple times and take the average!
Common pit fal ls
• Overfitting
• Underfitting
ML and the JVM
The tools…
• SMILE • Weka • Mahout • Deeplearning4j/s • TridentML (Storm) • MLlib (Spark) • FlinkML (Flink) • H2O
Spark?
• Distributed, fast, in-memory computational framework
• Based on RDDs (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format)
• Support for Scala, Java, Python and R • Focuses on well known methods
(map(), flatMap(), filter(), reduce() …)
Spark?
val conf = new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf)
val textFile: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")
Why Spark/MLlib
PROS • extensive community, part of Spark
(Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular
algorithms • easy to use, well documented, multitude
of examples • fast and robust
CONS • only Spark • very young • mainly simple algorithms • still pretty “low level”
Demos
Pr ice predict ion
Raw house data
Feature extraction
Logistic regression modelling
TRAINING
Predicted priceModelIncoming
new dataFeature
extraction
PREDICTING
Date Open
26 708.58
25 700.01
24 688.92
23 701.45
22 707.45
19 695.03
18 710
17 699
16 692.98
12 690.26
11 675
10 686.86
9 672.32
8 667.85
660
672.5
685
697.5
710
0 6.5 13 19.5 26
600
650
700
750
800
0 6.5 13 19.5 26
Spam classif ication
Spam/not spamModelIncoming
emailsFeature
extraction
PREDICTING
Raw spam emails
Feature extraction
Logistic regression modelling
TRAINING
Raw ok emails
Feature extraction
Word representation
• Some algorithms are ok with strings • Stopword extraction, form normalisation • Many approaches to transform into numerical values: • set of words • bag of words (TF) • TF-IDF • ...
Term frequency
All terms i love like cake pie cookies
Document1 1 0 1 1 0 0
Document2 1 1 0 0 1 0
Document3 1 1 0 0 0 1
What next?
• Get ideas: o https://www.kaggle.com/wiki/DataScienceUseCases
• Learn the basics: o https://www.coursera.org/learn/machine-learning o https://work.caltech.edu/telecourse.html
• Get started with MLlib: o http://spark.apache.org/docs/latest/mllib-guide.html o https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x
• Try out other frameworks and courses: o https://github.com/h2oai/sparkling-water o https://www.coursera.org/course/mmds
• Practical books: o “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media o “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
Q&A