Storm Gluecon

61
Storm Dan Lynn [email protected] @danklynn As deep into real-time data processing as you can get* *in 30 minutes.

description

real-time

Transcript of Storm Gluecon

  • Storm

    Dan [email protected]

    @danklynn

    As deep into real-time data processing as you can get**in 30 minutes.

  • Keeps Contact Information Current and Complete

    Based in Denver, Colorado

    CTO & [email protected]

    @danklynn

  • Turn Partial Contacts Into Full Contacts

  • Storm

  • StormDistributed and fault-tolerant real-3me computa3on

  • StormDistributed and fault-tolerant real-3me computa3on

  • StormDistributed and fault-tolerant real-3me computa3on

  • StormDistributed and fault-tolerant real-3me computa3on

  • THE HARD WAY

    Queues

    Workers

  • THE HARD WAY

  • Key Concepts

  • TuplesOrdered list of elements

  • TuplesOrdered list of elements

    ("search-01384", "e:[email protected]")

  • StreamsUnbounded sequence of tuples

  • StreamsUnbounded sequence of tuples

    Tuple Tuple Tuple Tuple Tuple Tuple

  • SpoutsSource of streams

  • SpoutsSource of streams

  • SpoutsSource of streams

    Tuple Tuple Tuple Tuple Tuple Tuple

  • Spouts can talk with

    some images from h,p://commons.wikimedia.org

    Queues

    Web logs

    API calls

    Event data

  • BoltsProcess tuples and create new streams

  • Bolts

    some images from h,p://commons.wikimedia.org

    Apply funcBons / transformsFilterAggregaBonStreaming joinsAccess DBs, APIs, etc...

  • Bolts

    Tuple Tuple Tuple Tuple Tuple Tuple

    some images from h,p://commons.wikimedia.org

    TupleTuple

    TupleTuple

    TupleTuple

    TupleTuple

    TupleTuple

    TupleTuple

  • TopologiesA directed graph of Spouts and Bolts

  • This is a Topology

    some images from h,p://commons.wikimedia.org

  • This is also a topology

    some images from h,p://commons.wikimedia.org

  • TasksExecute Streams or Bolts

  • Running a Topology

    $ storm jar my-code.jar com.example.MyTopology arg1 arg2

  • Storm Cluster

    Nathan Marz

  • Storm Cluster

    Nathan Marz

    If this wereHadoop...

  • Storm Cluster

    Nathan Marz

    Job Tracker

    If this wereHadoop...

  • Storm Cluster

    Nathan MarzTask Trackers

    If this wereHadoop...

  • Storm Cluster

    Nathan Marz

    Coordinates everything

    But its not Hadoop

  • Example:Streaming Word Count

  • Streaming Word Count

    TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

  • Streaming Word Count

    TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

  • Streaming Word Count

    public static class SplitSentence extends ShellBolt implements IRichBolt {public SplitSentence() {super("python", "splitsentence.py");}

    @Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}

    @Overridepublic Map getComponentConfiguration() {return null;}}

    SplitSentence.java

  • Streaming Word Count

    public static class SplitSentence extends ShellBolt implements IRichBolt {public SplitSentence() {super("python", "splitsentence.py");}

    @Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}

    @Overridepublic Map getComponentConfiguration() {return null;}}

    SplitSentence.java

    splitsentence.py

  • Streaming Word Count

    public static class SplitSentence extends ShellBolt implements IRichBolt {public SplitSentence() {super("python", "splitsentence.py");}

    @Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}

    @Overridepublic Map getComponentConfiguration() {return null;}}

    SplitSentence.java

  • Streaming Word Count

    TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

    java

  • Streaming Word Count

    public static class WordCount extends BaseBasicBolt {Map counts = new HashMap();

    @Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String word = tuple.getString(0);Integer count = counts.get(word);if(count==null) count = 0;count++;counts.put(word, count);collector.emit(new Values(word, count));}

    @Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word", "count"));}}

    WordCount.java

  • Streaming Word Count

    TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

    java

    Groupings control how tuples are routed

  • Shuffle groupingTuples are randomly distributed across all of the

    tasks running the bolt

  • Fields groupingGroups tuples by specific named fields and routes

    them to the same task

  • Fields groupingGroups tuples by specific named fields and routes

    them to the same task

    Analogous to Ha

    doops

    partitioning beh

    avior

  • Trending Topics

  • Twitter Trending Topics

    TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)

    (word)

    RollingCountsBoltparallelism = n

    (word, coun

    t)

    IntermediateRankingsBoltparallelism = n

    (rankings)

    (tweets)

    (JSON rankings)

    RankingsReportBoltparallelism = 1

    TotalRankingsBoltparallelism = 1

    (rank

    ings)

  • Live Coding!

  • Twitter Trending Topics

    TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)

    (word)

    RollingCountsBoltparallelism = n

    (word, coun

    t)

    IntermediateRankingsBoltparallelism = n

    (rankings)

    (tweets)

    (JSON rankings)

    RankingsReportBoltparallelism = 1

    TotalRankingsBoltparallelism = 1

    (rank

    ings)

  • Tips

  • loggly.com

    Graylog2logstash

    Use a log aggregator

  • "$topologyName-$buildNumber"

    Rolling Deploys

  • 1. Launch new topology

    2. Wait for it to be healthy

    3. Kill the old one

    Rolling Deploys

  • These are under active development

    Rolling Deploys

  • TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

    java

    see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology

    Tune your parallelism

  • Tune your parallelismSupervisor

    Worker Process (JVM)

    Executor (thread)

    Task

    Task

    Executor (thread)

    Task

    Task

    Worker Process (JVM)

    Executor (thread)

    Task

    Task

    Executor (thread)

    Task

    Task

    Parallelism hints control the number of Executors

  • collector.emit(new Values(word, count));

    see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology

    Anchor your tuples (or not)

    collector.emit(tuple, new Values(word, count));

  • But Dan, you left out Trident!

  • if (storm == hadoop) { trident = pig / cascading}

  • A little taste of Trident TridentState urlToTweeters = topology.newStaticState(getUrlToTweetersState());TridentState tweetersToFollowers = topology.newStaticState(getTweeterToFollowersState());

    topology.newDRPCStream("reach") .stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters")) .each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter")) .shuffle() .stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers")) .parallelismHint(200) .each(new Fields("followers"), new ExpandList(), new Fields("follower")) .groupBy(new Fields("follower")) .aggregate(new One(), new Fields("one")) .parallelismHint(20) .aggregate(new Count(), new Fields("reach"));

    h,ps://github.com/nathanmarz/storm/wiki/Trident-tutorial

  • Thanks:

    @stormprocessorhttp://github.com/nathanmarz/storm

    Nathan Marz - @nathanmarz

    http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

    Michael Knoll - @miguno

    Michael Rose - @xorlevhttp://github.com/xorlev

  • [email protected]

    https://github.com/danklynn/storm-starter/tree/gluecon2013