Streaming API, Spark and Ruby

81
Streaming API, Spark & Ruby

Transcript of Streaming API, Spark and Ruby

Page 1: Streaming API, Spark and Ruby

Streaming API, Spark & Ruby

Page 2: Streaming API, Spark and Ruby

AGENDA

Page 3: Streaming API, Spark and Ruby

BIG DATA

Page 4: Streaming API, Spark and Ruby

Ruby in BIG DATA

Page 5: Streaming API, Spark and Ruby
Page 6: Streaming API, Spark and Ruby

Why we should know?

Page 7: Streaming API, Spark and Ruby

Insights from SAS

Page 8: Streaming API, Spark and Ruby
Page 9: Streaming API, Spark and Ruby
Page 10: Streaming API, Spark and Ruby
Page 11: Streaming API, Spark and Ruby

Data

Page 12: Streaming API, Spark and Ruby

Information

Page 13: Streaming API, Spark and Ruby

Knowledge

Page 14: Streaming API, Spark and Ruby
Page 15: Streaming API, Spark and Ruby

Data Science Field

Page 16: Streaming API, Spark and Ruby

Computer Science

Maths & Statitics

Subject Matter

Expertise

Page 17: Streaming API, Spark and Ruby

Where it all began?

Page 18: Streaming API, Spark and Ruby

PROBLEM

Page 19: Streaming API, Spark and Ruby

How to store ? How to process ?

Page 20: Streaming API, Spark and Ruby

Solution

Page 21: Streaming API, Spark and Ruby

HADOOP

Page 22: Streaming API, Spark and Ruby
Page 23: Streaming API, Spark and Ruby

HADOOP Ecosystem

Page 24: Streaming API, Spark and Ruby
Page 25: Streaming API, Spark and Ruby

STORAGE & PROCESS

Page 26: Streaming API, Spark and Ruby

• Distributed Storage: HDFS A distributed file system where commodity hardware can be used to form clusters and store the huge data in distributed fashion.

• Distributed Processing: MapReduce Paradigm

It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just configuration change.

Page 27: Streaming API, Spark and Ruby

Case Study

Page 28: Streaming API, Spark and Ruby

Daywise Analysis of Rubyconfindia Tweets

Page 29: Streaming API, Spark and Ruby
Page 30: Streaming API, Spark and Ruby

Twitter API

Page 31: Streaming API, Spark and Ruby

Sample Data

Page 32: Streaming API, Spark and Ruby

--- !ruby/object:Twitter::Tweetattrs: :created_at: Tue Mar 08 11:00:57 +0000 2016 :id: 707159160945811457 :id_str: '707159160945811457' :text: 'Once in a life time to meet Matz at the awesome #kochi https://t.co/6oCIagsHCg #ruby #india https://t.co/YRlpABApkP'

Page 33: Streaming API, Spark and Ruby

Tweet Count

Page 34: Streaming API, Spark and Ruby
Page 35: Streaming API, Spark and Ruby

CLOUDERA QUICKSTART

Page 36: Streaming API, Spark and Ruby

STORAGE LAYER

Page 37: Streaming API, Spark and Ruby

HDFS

Page 38: Streaming API, Spark and Ruby

• It is a distributed file system• Streaming Data Access: Write once, read many times

• Able to run on commodity Hardware• Fault tolerance• Replication: 3 nodes by default, configurable• Block based: 64-256MB, configurable

Page 39: Streaming API, Spark and Ruby

Replication Factor 2

Page 40: Streaming API, Spark and Ruby

Name Node: Stores Meta Data

Meta Data:/data/pristine/catalina.log.> 1, 2, 4

/data/pristine/myfile. >3,5

Data Node 1 Data Node 2 Data Node 3

1 2 45 5 2 3 4 1 3

Page 41: Streaming API, Spark and Ruby

HDFS Command Line

Page 42: Streaming API, Spark and Ruby
Page 43: Streaming API, Spark and Ruby

Copying File to HDFS

Page 44: Streaming API, Spark and Ruby
Page 45: Streaming API, Spark and Ruby

HDFS NameNode UI

Page 46: Streaming API, Spark and Ruby
Page 47: Streaming API, Spark and Ruby

PROCESS LAYER

Page 48: Streaming API, Spark and Ruby

YARN / MapReduce 2.0

Page 49: Streaming API, Spark and Ruby

• YARN: A framework for job scheduling and cluster resource management.

• MapReduce: Distributed processing paradigm

Page 50: Streaming API, Spark and Ruby

Map Function

Input: (input_key, value)

Output: bunch of (intermediate_key, value)

System applies the map function in parallel to all inputs

Reduce Function

Input: (intermediate_key, value)

Output: bunch of (values)

System will group all pairs with the same intermediate key and apply the reduce function

OUTPUT RESULT

SHUFFLESTAGE

FILE CHUNKS

Page 51: Streaming API, Spark and Ruby

Leveraging Ruby

Page 52: Streaming API, Spark and Ruby

Mapper.rb

Page 53: Streaming API, Spark and Ruby

#!/usr/bin/env ruby

STDIN.read.split("--- !ruby/object:Twitter::Tweet").each do |t| date = t.match(/\:created_at\: .{30}/).to_s.split puts "#{date[1]}" if date[1]end

Page 54: Streaming API, Spark and Ruby

Reducer.rb

Page 55: Streaming API, Spark and Ruby

#!/usr/bin/env ruby

STDIN.readlines.group_by{|i| i.strip}.map{|i,j| "#{i} #{j.count}" }.each{|i| puts i}

Page 56: Streaming API, Spark and Ruby

HADOOP STREAMING API

Page 57: Streaming API, Spark and Ruby

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –input /user/rubyconf/tweets.txt –output /user/rubyconf/daywise –mapper mapper.rb –reducer reducer.rb –file mapper.rb –file reducer.rb

Page 58: Streaming API, Spark and Ruby
Page 59: Streaming API, Spark and Ruby

YARN Resource Manager

Page 60: Streaming API, Spark and Ruby
Page 61: Streaming API, Spark and Ruby

Job Details

Page 62: Streaming API, Spark and Ruby
Page 63: Streaming API, Spark and Ruby

YARN Resource Manager

Page 64: Streaming API, Spark and Ruby
Page 65: Streaming API, Spark and Ruby

DAYWISE TWEETS

Page 66: Streaming API, Spark and Ruby
Page 67: Streaming API, Spark and Ruby

TREND

Page 68: Streaming API, Spark and Ruby
Page 69: Streaming API, Spark and Ruby

SPARK

Page 70: Streaming API, Spark and Ruby

SPARK Benefits• SPEED • Ease of Use• Runs Everywhere

• Source: http://spark.apache.org/

Page 71: Streaming API, Spark and Ruby

• Source: http://spark.apache.org/

Page 72: Streaming API, Spark and Ruby

gem ruby-spark• gem install ruby-spark• ruby-spark build• ruby-spark shell

Page 73: Streaming API, Spark and Ruby

Ruby with SPARK

Page 74: Streaming API, Spark and Ruby

require 'ruby-spark'

# ConfigurationSpark.config do set_app_name "RubySpark" set 'spark.ruby.serializer', 'oj' set 'spark.ruby.serializer.batch_size', 100end

# Start Apache SparkSpark.start

Page 75: Streaming API, Spark and Ruby

# Context referencesc = Spark.scrdd = sc.text_file("hdfs://user/rubyconf/tweets.txt”)

# Collect all created days from datesdays = rdd.map(lambda {|t| date = t.match(/\:created_at\: .{30}/).to_s.split; date[1] if date[1]})

# Creating key value pair pairrdd = days.map(lambda { |x| [x,1] })

# Final output by using reducerdaywise = pairrdd.reduce_by_key( lambda{|x,y| x+y}).collect_as_hash

Page 76: Streaming API, Spark and Ruby

Expertise & Learnings

Page 77: Streaming API, Spark and Ruby
Page 78: Streaming API, Spark and Ruby

Remember

Page 79: Streaming API, Spark and Ruby

We can use Ruby with HADOOP

Streaming API & SPARK

Page 80: Streaming API, Spark and Ruby

SPARK is more generalized distributed

computing model

Page 81: Streaming API, Spark and Ruby

Thank You

@_manoharaa