Introduction to apache spark and machine learning

12
Introduction to Apache Spark and Machine Learning Ezekiel Awoyemi Data Engineer Andela

Transcript of Introduction to apache spark and machine learning

  1. 1. Introduction to Apache Spark and Machine Learning Ezekiel Awoyemi Data Engineer Andela
  2. 2. What is Apache Spark It is an open-source cluster computing framework built around speed, ease of use, and sophisticated analytics compared to other big data analytics like MapReduce and Storm. 2
  3. 3. Spark-stack 3
  4. 4. What is Big Data and where does it come from Ad impression Fast forward, pause and rewind of videos Transactions Social networks Telecommunication networks 4
  5. 5. Data Science Data Science aims to derive knowledge from big data, efficiently and intelligently Nowcasting: example Google flu trends in Feb, 2010. Forecasting: example Princeton Universitys Epidemiological modelling of online social network dynamics 5
  6. 6. Database/Data Science 6 ELEMENTS DATABASE DATA SCIENCE PRIORITIES Consistency, Error recovery, Audibility Speed, Availability, Query richness DATA VALUE Precious Cheap DATA VOLUME Modest Massive STRUCTURE Strong (Schema) Weak or none(Text) EXAMPLES Bank records, Medical records, Census, Personal records Online clicks, GPS logs, Tweets, etc Querying the past Querying the future
  7. 7. Spark Program Lifecycle Create RDDs from external data or parallels a collection in your driver program Lazily transform them into new RDDs Cache() some RDDs for reuse Perform actions to execute parallel computation and produce results 7
  8. 8. Machine Learning Machine Learning is used to solve Supervised Classification Problems. Give machines examples and they will learn with that We can use Collaborative filtering which is commonly used for recommender systems Naive Bayes Principles/algorithms, etc 8
  9. 9. Examples of machine learning Classification of email as spam Self driving car Recommending new songs, movies, etc 9
  10. 10. Coding Example 1. Text file is the complete work of William Shakespeare Count the number of lines in the file Print the first line or first item in the RDD How many lines contain the word come Count the number of words in the file How many times do we now have the word come Print the first item in the RDD Print (word, count) pair 10
  11. 11. Coding Example 2. We have 1000209 ratings from 6040 users on 3706 movies collected by MovieLens Using a small set of movies that have received the most ratings from users in the MoviesLens dataset. Get a fellow to rate movies (1(poor) - 5(best), or 0 if not seen) Make movie recommendations 11
  12. 12. Thank You 12