Machine Learning for (JVM) Developers

40
Machine learning for (JVM) developers Mateusz Dymczyk Software Engineer H2O.ai 11 th May 2016

Transcript of Machine Learning for (JVM) Developers

Page 1: Machine Learning for (JVM) Developers

Machine learning for (JVM) developers

Mateusz Dymczyk Software Engineer

H2O.ai

11th May 2016

Page 2: Machine Learning for (JVM) Developers

Say who?

• Software Engineer @ H2O.ai • Ph.D. drop-out (AGH in Krakow) • ex Fujitsu Laboratories research trainee

Page 3: Machine Learning for (JVM) Developers

Say what?

• Status quo of data • Why Machine Learning? • Intro to Machine Learning • Machine Learning and the JVM • Machine Learning Demo

Page 4: Machine Learning for (JVM) Developers

The state of data

Page 5: Machine Learning for (JVM) Developers

Exponential growth

Page 6: Machine Learning for (JVM) Developers

Text

Data source

Data collection Data storage

Simple analytics

Data processing

Page 7: Machine Learning for (JVM) Developers

Ideas

• Alerting from real time data • Similarity search

Retail

Healthcare

Insurance/banking

• Recommendations • Store layout • Ad targetting

• Stock price predictions • Anomaly/fraud detection • Automatic investments

https://www.kaggle.com/wiki/DataScienceUseCases

Page 8: Machine Learning for (JVM) Developers

Machine Learning

Page 9: Machine Learning for (JVM) Developers

Def ini t ion

“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”

Page 10: Machine Learning for (JVM) Developers

Simply speaking…

• Subfield of Artificial Intelligence which… • Tries to find patterns in data using… • Math, statistics, probability, optimisation theory etc. to

create… • Model which can be used to predict values or cluster • Theoretical concept with many implementations

Page 11: Machine Learning for (JVM) Developers

Basic terminology

Page 12: Machine Learning for (JVM) Developers

Observations are objects which are used for learning and evaluation. Anything that can be described using quantitative features.

Observations

{"title":"Emailschema","type":"object","properties":{"age":{"type":"float"},"rooms":{"type":"int"},"size":{"type":"float"},"location":{"type":"string"}}}

Page 13: Machine Learning for (JVM) Developers

Feature is a quantitative trait that (partially) represents an observation.

Feature vector is an n-dimentional vector of features that represents an observation.

Feature extraction vs. feature selection

Feature

{"title":"Emailschema","type":"object","properties":{"age":{"type":"float"},"rooms":{"type":"int"},"size":{"type":"float"},"location":{"type":"string"}}}

[5,3,60.5]

Page 14: Machine Learning for (JVM) Developers

• System is a set of related objects forming a complex whole (e.g. set of all possible distinct observations) • In our case set of all possible houses

System

Page 15: Machine Learning for (JVM) Developers

• Model is the description of a system using mathematical concepts/language. • Result of a machine learning technique • Can be used for predictions/clustering • Online or offline

Model

Page 16: Machine Learning for (JVM) Developers

Supervised Learning

• User needs to know: • the structure of the data • possible outputs

• Sample data has to be labeled for training

Page 17: Machine Learning for (JVM) Developers

Classif ication

• Required: • all possible labels • already labeled samples

• Output: predicted label for new inputs • Examples: • spam classification based on email content • gender classification based on physical features

Page 18: Machine Learning for (JVM) Developers

Regression

• Required: • samples with actual values associated

• Output: predicted values for new inputs • Examples: • price prediction based on historical prices

Page 19: Machine Learning for (JVM) Developers

Unsupervised Learning

• Doesn’t require the user to know what should be the output

• No labelling necessary by the user • Useful for finding structure in data • Examples: • grouping users (clustering)

Page 20: Machine Learning for (JVM) Developers

Cluster ing

• Required: • data, no labelling necessary

• Output: data grouped into clusters • Examples: • grouping users with similar tastes

Page 21: Machine Learning for (JVM) Developers

Types of machine learning

eg. regression, when you want to predict

a real number

eg. clustering, when you want to cluster or have too much data

eg. classification, when you want to assign to a

category

eg. association analysis, when you want to find relations between data

Page 22: Machine Learning for (JVM) Developers

Predictions/clusters

Gener ic f low

Raw data Feature extraction

Machine learning magic

TRAINING

ModelIncoming new data

Feature extraction

PREDICTING

Page 23: Machine Learning for (JVM) Developers

Validation

• How do we know the model is good? • Cross validation: • divide the data into training and testing subsets

(sometimes third one is necessary) • train using the training set, validate using testing set • do those splits multiple times and take the average!

Page 24: Machine Learning for (JVM) Developers

Common pit fal ls

• Overfitting

• Underfitting

Page 25: Machine Learning for (JVM) Developers

ML and the JVM

Page 26: Machine Learning for (JVM) Developers

The tools…

• SMILE • Weka • Mahout • Deeplearning4j/s • TridentML (Storm) • MLlib (Spark) • FlinkML (Flink) • H2O

Page 27: Machine Learning for (JVM) Developers

Spark?

• Distributed, fast, in-memory computational framework

• Based on RDDs (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format)

• Support for Scala, Java, Python and R • Focuses on well known methods

(map(), flatMap(), filter(), reduce() …)

Page 28: Machine Learning for (JVM) Developers

Spark?

val conf = new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf)

val textFile: RDD[String] = sc.textFile("hdfs://...")

val counts: RDD[(String, Int)] = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")

Page 29: Machine Learning for (JVM) Developers

Why Spark/MLlib

PROS • extensive community, part of Spark

(Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular

algorithms • easy to use, well documented, multitude

of examples • fast and robust

CONS • only Spark • very young • mainly simple algorithms • still pretty “low level”

Page 30: Machine Learning for (JVM) Developers

Demos

Page 31: Machine Learning for (JVM) Developers

Pr ice predict ion

Raw house data

Feature extraction

Logistic regression modelling

TRAINING

Predicted priceModelIncoming

new dataFeature

extraction

PREDICTING

Page 32: Machine Learning for (JVM) Developers

Date Open

26 708.58

25 700.01

24 688.92

23 701.45

22 707.45

19 695.03

18 710

17 699

16 692.98

12 690.26

11 675

10 686.86

9 672.32

8 667.85

Page 33: Machine Learning for (JVM) Developers

660

672.5

685

697.5

710

0 6.5 13 19.5 26

Page 34: Machine Learning for (JVM) Developers

600

650

700

750

800

0 6.5 13 19.5 26

Page 35: Machine Learning for (JVM) Developers

Spam classif ication

Spam/not spamModelIncoming

emailsFeature

extraction

PREDICTING

Raw spam emails

Feature extraction

Logistic regression modelling

TRAINING

Raw ok emails

Feature extraction

Page 36: Machine Learning for (JVM) Developers

Word representation

• Some algorithms are ok with strings • Stopword extraction, form normalisation • Many approaches to transform into numerical values: • set of words • bag of words (TF) • TF-IDF • ...

Page 37: Machine Learning for (JVM) Developers

Term frequency

All terms i love like cake pie cookies

Document1 1 0 1 1 0 0

Document2 1 1 0 0 1 0

Document3 1 1 0 0 0 1

Page 38: Machine Learning for (JVM) Developers

What next?

• Get ideas: o https://www.kaggle.com/wiki/DataScienceUseCases

• Learn the basics: o https://www.coursera.org/learn/machine-learning o https://work.caltech.edu/telecourse.html

• Get started with MLlib: o http://spark.apache.org/docs/latest/mllib-guide.html o https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

• Try out other frameworks and courses: o https://github.com/h2oai/sparkling-water o https://www.coursera.org/course/mmds

• Practical books: o “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media o “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media

Page 39: Machine Learning for (JVM) Developers

Thank you!

@mdymczyk

Mateusz Dymczyk

[email protected]

Page 40: Machine Learning for (JVM) Developers

Q&A