Machine Learning for (JVM) Developers

Machine learning for (JVM) developers

Mateusz Dymczyk Software Engineer

H2O.ai

11th May 2016

Say who?

• Software Engineer @ H2O.ai • Ph.D. drop-out (AGH in Krakow) • ex Fujitsu Laboratories research trainee

Say what?

• Status quo of data • Why Machine Learning? • Intro to Machine Learning • Machine Learning and the JVM • Machine Learning Demo

The state of data

Exponential growth

Text

Data source

Data collection Data storage

Simple analytics

Data processing

Ideas

• Alerting from real time data • Similarity search

Retail

Healthcare

Insurance/banking

• Recommendations • Store layout • Ad targetting

• Stock price predictions • Anomaly/fraud detection • Automatic investments

https://www.kaggle.com/wiki/DataScienceUseCases

https://www.kaggle.com/wiki/DataScienceUseCases

Machine Learning

Def ini t ion

“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”

Simply speaking…

• Subfield of Artificial Intelligence which… • Tries to find patterns in data using… • Math, statistics, probability, optimisation theory etc. to

create… • Model which can be used to predict values or cluster • Theoretical concept with many implementations

Basic terminology

Observations are objects which are used for learning and evaluation. Anything that can be described using quantitative features.

Observations

{"title":"Emailschema","type":"object","properties":{"age":{"type":"float"},"rooms":{"type":"int"},"size":{"type":"float"},"location":{"type":"string"}}}

Feature is a quantitative trait that (partially) represents an observation.

Feature vector is an n-dimentional vector of features that represents an observation.

Feature extraction vs. feature selection

Feature

{"title":"Emailschema","type":"object","properties":{"age":{"type":"float"},"rooms":{"type":"int"},"size":{"type":"float"},"location":{"type":"string"}}}

[5,3,60.5]

• System is a set of related objects forming a complex whole (e.g. set of all possible distinct observations) • In our case set of all possible houses

System

• Model is the description of a system using mathematical concepts/language. • Result of a machine learning technique • Can be used for predictions/clustering • Online or offline

Model

Supervised Learning

• User needs to know: • the structure of the data • possible outputs

• Sample data has to be labeled for training

Classif ication

• Required: • all possible labels • already labeled samples

• Output: predicted label for new inputs • Examples: • spam classification based on email content • gender classification based on physical features

Regression

• Required: • samples with actual values associated

• Output: predicted values for new inputs • Examples: • price prediction based on historical prices

Unsupervised Learning

• Doesn’t require the user to know what should be the output

• No labelling necessary by the user • Useful for finding structure in data • Examples: • grouping users (clustering)

Cluster ing

• Required: • data, no labelling necessary

• Output: data grouped into clusters • Examples: • grouping users with similar tastes

Types of machine learning

eg. regression, when you want to predict

a real number

eg. clustering, when you want to cluster or have too much data

eg. classification, when you want to assign to a

category

eg. association analysis, when you want to find relations between data

Predictions/clusters

Gener ic f low

Raw data Feature extraction

Machine learning magic

TRAINING

ModelIncoming new data

Feature extraction

PREDICTING

Validation

• How do we know the model is good? • Cross validation: • divide the data into training and testing subsets

(sometimes third one is necessary) • train using the training set, validate using testing set • do those splits multiple times and take the average!

Common pit fal ls

• Overfitting

• Underfitting

ML and the JVM

The tools…

• SMILE • Weka • Mahout • Deeplearning4j/s • TridentML (Storm) • MLlib (Spark) • FlinkML (Flink) • H2O

Spark?

• Distributed, fast, in-memory computational framework

• Based on RDDs (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format)

• Support for Scala, Java, Python and R • Focuses on well known methods

(map(), flatMap(), filter(), reduce() …)

Spark?

val conf = new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf)

val textFile: RDD[String] = sc.textFile("hdfs://...")

val counts: RDD[(String, Int)] = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")

Why Spark/MLlib

PROS • extensive community, part of Spark

(Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular

algorithms • easy to use, well documented, multitude

of examples • fast and robust

CONS • only Spark • very young • mainly simple algorithms • still pretty “low level”

Pr ice predict ion

Raw house data

Feature extraction

Logistic regression modelling

TRAINING

Predicted priceModelIncoming

new dataFeature

extraction

PREDICTING

Date Open

26 708.58

25 700.01

24 688.92

23 701.45

22 707.45

19 695.03

18 710

17 699

16 692.98

12 690.26

11 675

10 686.86

9 672.32

8 667.85

660

672.5

685

697.5

710

0 6.5 13 19.5 26

600

650

700

750

800

0 6.5 13 19.5 26

Spam classif ication

Spam/not spamModelIncoming

emailsFeature

extraction

PREDICTING

Raw spam emails

Feature extraction

Logistic regression modelling

TRAINING

Raw ok emails

Feature extraction

Word representation

• Some algorithms are ok with strings • Stopword extraction, form normalisation • Many approaches to transform into numerical values: • set of words • bag of words (TF) • TF-IDF • ...

Term frequency

All terms i love like cake pie cookies

Document1 1 0 1 1 0 0

Document2 1 1 0 0 1 0

Document3 1 1 0 0 0 1

What next?

• Get ideas: o https://www.kaggle.com/wiki/DataScienceUseCases

• Learn the basics: o https://www.coursera.org/learn/machine-learning o https://work.caltech.edu/telecourse.html

• Get started with MLlib: o http://spark.apache.org/docs/latest/mllib-guide.html o https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

• Try out other frameworks and courses: o https://github.com/h2oai/sparkling-water o https://www.coursera.org/course/mmds

• Practical books: o “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media o “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media

https://www.coursera.org/learn/machine-learning

Thank you!

@mdymczyk

Mateusz Dymczyk

[email protected]

mailto:[email protected]?subject=

Machine Learning for (JVM) Developers

Software

Transcript of Machine Learning for (JVM) Developers