Processing Big Data in Real-Time - Yanai Franchi, Tikal
-
Upload
codemotion-tel-aviv -
Category
Technology
-
view
106 -
download
0
Transcript of Processing Big Data in Real-Time - Yanai Franchi, Tikal
1
ProcessingProcessing“BIG-DATA”“BIG-DATA”In In Real TimeReal Time
Yanai Franchi , TikalYanai Franchi , Tikal
11
Key Notes● Collector Service - Collects checkins as text addresses
– We need to use GeoLocation ServiceWe need to use GeoLocation Service
● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI.
● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise).
● Accuracy – Sample data, NOT critical data.
– Proportionately representative
– Data volume is large enough to is large enough to compensate for data loss.compensate for data loss.
12
Heat-Map Context
Text-Address
Checkins Heat-MapService
Gogobot System
GogobotMicro Service
GogobotMicro Service
GogobotMicro Service
Geo LocationService
Get-GeoCode(Address)
Heat-Map
Last Interval Locations
13
Database
Persist Checkin Intervals
ProcessingCheckins
ReadText Address
Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...
Simulate Checkins with a File
Plan A
GET Geo Location
Geo LocationService
19
Problems ?
● Tedious: Spend time conf iguring where to send messages, deploying workers, and deploying intermediate queues.
● Brittle: There's little fault-tolerance.
● Painful to scale: Partition of running worker/s is complicated.
20
What We Want ?● Horizontal scalability● Fault-tolerance● No intermediate message brokers!● Higher level abstraction than message
passing● “Just works”● Guaranteed data processing (not in this
case)
21
Apache Storm
✔Horizontal scalability
✔Fault-tolerance
✔No intermediate message brokers!
✔Higher level abstraction than message passing
✔“Just works”
✔Guaranteed data processing
23
What is Storm ?
● CEP - Open source and distributed realtime computation system. – Makes it easy to Makes it easy to reliably process unboundedreliably process unbounded streams streams ofof
tuplestuples– Doing for realtime processing what Hadoop did for batch Doing for realtime processing what Hadoop did for batch
processing.processing.
● Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will be It is scalable,fault-tolerant, guarantees your data will be
processed, and is easy to set up and operate.processed, and is easy to set up and operate.
26
Bolts
Tuple
TupleTuple
Processes input streams and producesnew streams
TupleTupleTupleTuple
Tuple TupleTuple
27
Storm Topology
Network of spouts and bolts
Tuple
TupleTuple
TupleTuple TupleTupleTuple TupleTupleTuple
Tuple
Tuple
TupleTuple TupleTupleTuple
28
Guarantee for Processing
● Storm guarantees the full processing of a tuple by tracking its state
● In case of failure, Storm can re-process it.● Source tuples with full “acked” trees are removed
from the system
31
Stream Grouping
● Shuff le grouping: pick a random task● Fields grouping: consistent hashing on a subset of
tuple f ields● All grouping: send to all tasks● Global grouping: pick task with lowest id
32
Tasks , Executors , Workers
Task Task Task
Worker Process
Sput /Bolt
Sput /Bolt
Sput /Bolt=
Executor Thread
JVM
Executor Thread
33
Bolt B Bolt B
Worker Process
Executor
Spout A
Executor
Node
SupervisorBolt C Bolt C
Executor
Bolt B Bolt B
Worker Process
Executor
Spout A
Executor
Node
SupervisorBolt C Bolt C
Executor
34
Nimbus
Supervisor Supervisor
Supervisor Supervisor
Supervisor Supervisor
Upload/Rebalance Heat-Map Topology
Zoo KeeperNodes
Storm Architecture
Master Node (similar to Hadoop JobTracker)
NOT criticalfor running topology
35
Nimbus
Supervisor Supervisor
Supervisor Supervisor
Supervisor Supervisor
Upload/Rebalance Heat-Map Topology
Zoo Keeper
Storm Architecture
Used For Cluster Coordination
A few nodes
36
Nimbus
Supervisor Supervisor
Supervisor Supervisor
Supervisor Supervisor
Upload/Rebalance Heat-Map Topology
Zoo Keeper
Storm Architecture
Run Worker Processes
38
HeatMap Input/Output Tuples
● Input Tuples: Timestamp and Text Address : – (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”)
● Output Tuple: Time interval, and a list of points for it:– (9:00:00 PM to 9:00:15 PM, (9:00:00 PM to 9:00:15 PM,
ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))
39
Checkins Spout
GeocodeLookup
Bolt
HeatmapBuilder
Bolt
PersistorBolt
(9:01 PM @ 287 Hudson st)
(9:01 PM , (40.736, -74,354)))
Heat Map Storm
Topology(9:00 PM – 9:15 PM , List((40.73, -74,34),
(51.36, -83,33),(69.73, -34,24))
Upon Elapsed Interval
40
Checkins Spoutpublic class CheckinsSpout extends BaseRichSpout {
private List<String> sampleLocations;private int nextEmitIndex;private SpoutOutputCollector outputCollector;
@Overridepublic void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {this.outputCollector = spoutOutputCollector;this.nextEmitIndex = 0;sampleLocations = IOUtils.readLines(
ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));}
@Overridepublic void nextTuple() {
String address = checkins.get(nextEmitIndex);String checkin = new Date().getTime()+"@ADDRESS:"+address;
outputCollector.emit(new Values(checkin));nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("str"));
}}
We hold stateNo need for thread safety
Declare output fields
Been called iteratively by Storm
41
Geocode Lookup Boltpublic class GeocodeLookupBolt extends BaseBasicBolt {
private LocatorService locatorService;
@Overridepublic void prepare(Map stormConf, TopologyContext context) {
locatorService = new GoogleLocatorService();}
@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {
String str = tuple.getStringByField("str");String[] parts = str.split("@");Long time = Long.valueOf(parts[0]);String address = parts[1];
LocationDTO locationDTO = locatorService.getLocation(address);String city = locationDTO.getCity();outputCollector.emit(new Values(city,time,locationDTO) );
}
@Overridepublic void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {
fieldsDeclarer.declare(new Fields("city","time", "location"));}
}
Get Geocode, Create DTO
43
Two Streams to Heat-Map Builder
On tick tuple, we f lush our Heat-Map
Checkin 1 Checkin 4 Checkin 5 Checkin 6
HeatMap-Builder Bolt
44
Tick Tuple in Actionpublic class HeatMapBuilderBolt extends BaseBasicBolt {
private Map<String, List<LocationDTO>> heatmaps;
@Overridepublic Map<String, Object> getComponentConfiguration() {
Config conf = new Config();conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );return conf;
}
@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {
if (isTickTuple(tuple)) {// Emit accumulated intervals
} else {// Add check-in info to the current interval in the Map
}}
private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);}
@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("time-interval", "city","locationsList"));}
Tick interval
Hold latest intervals
45
Persister Boltpublic class PersistorBolt extends BaseBasicBolt {
private Jedis jedis;
@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {
Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString
( tuple.getValueByField("locationsList"));
String dbKey = "checkins-" + timeInterval+"@"+city;
jedis.setex(dbKey, 3600*24 ,locationsList);
jedis.publish("location-key", dbKey);
}}
Publish in Redis channel for debugging
Persist in Redisfor 24h
46
Shuffle Grouping
Shuffle Grouping
Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...
Sample Checkins File
Read Text Addresses
Transforming the TuplesCheckins
Spout
GeocodeLookup
Bolt
HeatmapBuilder
Bolt
DatabasePersistor
Bolt
Get Geo Location
Geo LocationService
Field Grouping(city)
Group by city
47
Heat Map Topologypublic class LocalTopologyRunner { public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();StormSubmitter.submitTopology(
"local-heatmap", new Config(), builder.createTopology());
}
private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout());
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ).shuffleGrouping("checkins");
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ).fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() ).shuffleGrouping("heatmap-builder");
return builder;
}}
50
Scaling the Topologypublic class LocalTopologyRunner {conf.setNumWorkers(20); public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();Config conf = new Config();conf.setNumWorkers(2);StormSubmitter.submitTopology(
"local-heatmap", conf, builder.createTopology()); }
private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 );
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 ).shuffleGrouping("checkins").setNumTasks(64);
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4).fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() , 2 ).shuffleGrouping("heatmap-builder").setNumTasks(4);
return builder;
}}
Parallelism hint
Increase TasksFor Future
Set no. of workers
51
Database
Storm Heat-Map Topology
Persist Checkin Intervals
GET Geo Location
Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...
ReadText Address
Sample Checkins File
Recap – Plan A
Geo LocationService
54
Plan B - Kafka Spout&Bolt to HeatMap
GeocodeLookup
Bolt
HeatmapBuilder
Bolt
KafkaCheckins
Spout
Database
PersistorBolt
Geo LocationService
Read Text Addresses
CheckinKafkaTopic
PublishCheckins
LocationsTopic
KafkaLocations
Bolt
70
Topics● Logical collections of partitions (the physical f iles). ● A broker contains some of the partitions for a topic
• “Up to 2 million writes/sec on 3 cheap machines”
• Using 3 producers on 3 different machines, 3x async replication,
• Only 1 producer/machine because NIC already saturatedOnly 1 producer/machine because NIC already saturated
• End-to-End Latency is about 10ms for 99.9%
• Sustained throughput as stored data grows
•
•
•87
88
Add Kafka to our Topologypublic class LocalTopologyRunner { ...
private static TopologyBuilder buildTopolgy() {
... builder.setSpout("checkins", new KafkaSpout(kafkaConfig) , 4); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092",
"kafka.serializer.StringEncoder","locations-topic"))
.shuffleGrouping("persistor");
return builder;}
}
Kafka Bolt
Kafka Spout
89
Checkin HTTP Reactor
PublishCheckins
Database
CheckinKafkaTopic
Consume Checkins
Storm Heat-Map Topology
LocationsKafkaTopic
Publish Interval Key
Persist Checkin Intervals
Geo LocationServiceGET Geo
Location
Text-Address
92
More Conclusions..
● BigData – Also refers to Velocity of data (not only Volume of data)
● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs.
● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout