Fortune Teller API - Doing Data Science with Apache Spark

31
THE FORTUNE TELLER API Bas Geerdink Doing data science with Apache Spark

Transcript of Fortune Teller API - Doing Data Science with Apache Spark

Page 1: Fortune Teller API - Doing Data Science with Apache Spark

THE FORTUNE TELLER API

Bas Geerdink

Doing data science with Apache Spark

Page 2: Fortune Teller API - Doing Data Science with Apache Spark

ABOUT ME

[email protected]@bgeerdink

2

Page 3: Fortune Teller API - Doing Data Science with Apache Spark

TODAY’S MISSION:TO PREDICT THE FUTURE…

• Data Science• Spark and MLlib

• API

Page 4: Fortune Teller API - Doing Data Science with Apache Spark

DATA SCIENCE

Process:1. Formulate a question2. Gather data3. Model data4. Create data product

Source: Drew Conway, The Data Science Venn Diagram, 2013

Page 5: Fortune Teller API - Doing Data Science with Apache Spark

DATA SCIENCE METHOD

Source: Foundational Methodology for Data Science, IBM, 2015

1. Formulate a question

3. Analyze data

4. Product2. Gather data

Page 6: Fortune Teller API - Doing Data Science with Apache Spark

DATA SCIENCE METHOD1. Formulate a question

Page 7: Fortune Teller API - Doing Data Science with Apache Spark

BUSINESS PROBLEMFortune Teller at the circus

Input:• Glass ball• Lines on hand• Star sign• Astrology• Tarot cards

Output:• Vague prediction about future

Product Owner:“We should be able to do better than this!”

Page 8: Fortune Teller API - Doing Data Science with Apache Spark
Page 9: Fortune Teller API - Doing Data Science with Apache Spark

HOW TO CALCULATE HAPPINESS

Input: (personal details)• Country of residence• Age• Male / female• Partner (yes / no)• Number of children• Level of education (yes / no)

Output: (the happiness score)• Health

• Life expectancy• Disease

• Wealth• Poverty yes or no• Income

• “Psychological well-being”• Enjoyment• Stress• Anger• Worry• Sadness

Page 10: Fortune Teller API - Doing Data Science with Apache Spark

DATA SCIENCE METHOD1. Formulate a question

2. Gather data

Page 11: Fortune Teller API - Doing Data Science with Apache Spark

DATA SOURCES

• Gallup-Healthways Well-Being Index• The World Bank• Google Scholar• www.data.gov• Global Health Data Exchange• World Health Organization• Simple Online Data Archive for Population Studies

(Sodapop)• The World Factbook• UCI Machine Learning Repository

Page 12: Fortune Teller API - Doing Data Science with Apache Spark

WINNING DATASET

National Health Interview Survey 2012

• 43345 surveys• 133 questions• Well documented• Free to download and use

Page 13: Fortune Teller API - Doing Data Science with Apache Spark

HOW TO CALCULATE HAPPINESS

Input: (personal details)• Country of residence• Age• Male / female• Partner (yes / no)• Number of children• Level of education

Output: (the happiness score)• Health

• Life expectancy• Disease

• Wealth• Poverty yes or no• Income

• “Psychological well-being”• Enjoyment• Stress• Anger• Worry• Sadness

Page 14: Fortune Teller API - Doing Data Science with Apache Spark

DATA SCIENCE METHOD1. Formulate a question

3. Analyze data

2. Gather data

Page 15: Fortune Teller API - Doing Data Science with Apache Spark

• General purpose computing engine• In-memory processing• Support of streaming data, machine learning,

graphs• (much) faster than Hadoop MapReduce

Page 16: Fortune Teller API - Doing Data Science with Apache Spark

• Small player in the (OS) world of Machine Learning: Python and R are leading, followed by SAS, Weka, RapidMiner, …

• It’s just a tool… no solution or holy grail• “I predict that mean cluster size will remain very

close to one until the end of humanity. The vast majority of problems are small. Honestly, the combined utility of PyData and Spark pales in comparison to the utility of Excel.”

Page 17: Fortune Teller API - Doing Data Science with Apache Spark

SPARK OVERVIEW

Spark Core

Spark SQL

Spark Streaming GraphXMLlib

Standalone YARN Mesos

Scala

Python

R

Java

File system

HDFS

HBase

Cassandra

Page 18: Fortune Teller API - Doing Data Science with Apache Spark

SPARK CLUSTER MODE

• Standalone• Mesos• YARN

Page 19: Fortune Teller API - Doing Data Science with Apache Spark

DEMO

Page 20: Fortune Teller API - Doing Data Science with Apache Spark

CORRELATION <> CAUSATION

Page 21: Fortune Teller API - Doing Data Science with Apache Spark

BIG DATA IS OUT, ML IS IN

Source: Gartner, Hype Cycle for Emerging Technologies, 2015

Page 22: Fortune Teller API - Doing Data Science with Apache Spark

MACHINE LEARNING

• Actually, this is…algorithms maximizing scores using a statistical approach to problem solving• Producing…systems that can learn from and make decisions and predictions based on data

The field of study that gives computers the ability to learn without being explicitly programmed.(Arthur Samuel, 1959)

Page 23: Fortune Teller API - Doing Data Science with Apache Spark

MACHINE LEARNING TASKSRecommendation Using Association Rules (Similarity Matching) 

• Predict items that have a high similarity to others within a given set of items.• Example: Predicting movies or books based on someone’s historic purchase behavior.

Classification

• Predict to which class/category a certain item belongs. These categories are predefined. A classification task can be binary or multi-class.

• Example: Determining whether a message is spam or non-spam (binary); determining characters from a handwriting sample (multi-class).

Regression

• Focus on predicting numeric values.• Example: Predicting the number of ice cream cones to be sold on a certain day based on

weather data.

Clustering

• Divide items into groups, but unlike in classification tasks, these groups are not previously defined.

• Example: Grouping customers based on certain properties to discover customer segments.

Page 24: Fortune Teller API - Doing Data Science with Apache Spark

PICK AN ALGORITHM…

Page 25: Fortune Teller API - Doing Data Science with Apache Spark

DEMO

Page 26: Fortune Teller API - Doing Data Science with Apache Spark

DATA SCIENCE METHOD1. Formulate a question

3. Analyze data

4. Product2. Gather data

Page 27: Fortune Teller API - Doing Data Science with Apache Spark

API DESIGN

• Start Spark server: GET http://fortuneteller/start• Stop Spark server: GET http://fortuneteller/stop • Add survey records: POST http://fortuneteller/survey • Train model: GET http://fortuneteller/train • Correlations: GET http://fortuneteller/correlations • Predict Health: GET http://fortuneteller/prediction/health• Predict Wealth: GET http://fortuneteller/prediction/wealth

Page 28: Fortune Teller API - Doing Data Science with Apache Spark

DEMO

Page 29: Fortune Teller API - Doing Data Science with Apache Spark

Web app?Deploy to cloud?Streaming linear

regression?

Next steps…

Page 30: Fortune Teller API - Doing Data Science with Apache Spark

Questions?

Page 31: Fortune Teller API - Doing Data Science with Apache Spark

https://github.com/geerdink/FortuneTellerApi