Goto amsterdam-2013-skinned

Click here to load reader

download Goto amsterdam-2013-skinned

of 53

  • date post

    10-May-2015
  • Category

    Technology

  • view

    540
  • download

    0

Embed Size (px)

Transcript of Goto amsterdam-2013-skinned

  • 1.1MapR Technologies - ConfidentialOld and New Building BlocksCome Together For Big Data

2. 2MapR Technologies - Confidential Contact:tdunning@maprtech.com@ted_dunning Slides and suchhttp://slideshare.net/tdunning Hash tags: #mapr #goto #d3 #node 3. 3MapR Technologies - ConfidentialEmbarrassment of Riches d3.js allows really pretty pictures node.js allows simple (not just web) servers Storm does real-time Hadoop does big data d3 allows very cool visualizations 4. 4MapR Technologies - ConfidentialD3 demo 5. 5MapR Technologies - Confidentialnode demo 6. 6MapR Technologies - ConfidentialHadoopdemo 7. 7MapR Technologies - ConfidentialBut Web camp everything is a service with a URL or a DOM Big data camp non-traditional file systems Everybody else files and databases They dont like to talk to each other 8. 8MapR Technologies - ConfidentialWhy Not Tiered Architectures? Tiered architectures translations between services and cultures standard corporate answer Feels like molasses 9. 9MapR Technologies - ConfidentialThe Vision Integrate multiple computing paradigms many computing communities How? common storage, queuing and data platforms 10. 10MapR Technologies - ConfidentialFor Example, Incoming documents with text store in file-based queues index in real-time using Storm and Solr add initial engagement class, dont-know Search for documents using original text add random noise, small for well understood docs, large for dont-knowdocs Record engagement 11. 11MapR Technologies - ConfidentialAdd Analysis Process engagement logs item-item cooccurrence user-item histories Update search index indicator items decrease uncertainty on well understood docs Update user profile item history 12. 12MapR Technologies - ConfidentialSearch Again Now searches use recent views + text recent views query indicator fields text queries normal text data add noise as appropriate 13. 13MapR Technologies - ConfidentialAnd Draw a Picture Searches and clicks can be logged real-time metrics real-time trending topics Whats hot, whats not Popular searches Document clusters Word clouds 14. 14MapR Technologies - ConfidentialIn Pictures 15. 15MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsources 16. 16MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine 17. 17MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine LogsRecommendationanalysis 18. 18MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine LogsRecommendationanalysisUsageanalysisRenderingAdminqueries 19. 19MapR Technologies - ConfidentialWhich Technology?DocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine LogsRecommendationanalysisUsageanalysisAdminqueriesRenderingStorm/nodeSolrMapRD3/nodeOther 20. 20MapR Technologies - ConfidentialYeah, But This isnt as easy as it looks Take the real-time / long-time part 21. 21MapR Technologies - ConfidentialtnowHadoop is Not Very Real-timeUnprocessedDataFullyprocessedLatest fullperiodHadoop jobtakes thislong for thisdata 22. 22MapR Technologies - ConfidentialtnowHadoop worksgreat back hereStormworkshereReal-time and Long-time togetherBlendedviewBlendedviewBlendedView 23. 23MapR Technologies - ConfidentialSolRIndexerSolRIndexerSolrindexingCooccurrence(Mahout)Item meta-dataIndexshardsCompletehistory 24. 24MapR Technologies - ConfidentialSolRIndexerSolRIndexerSolrsearchWeb tierItem meta-dataIndexshardsUserhistory 25. 25MapR Technologies - ConfidentialUsersCatcher StormTopicQueueWeb-serverhttpWebDataMapR 26. 26MapR Technologies - ConfidentialCloser Look Catcher ProtocolDataSourcesCatcherClusterCatcherClusterDataSourcesThe data sources and catcherscommunicate with a very simpleprotocol.Hello() => list of catchersLog(topic,message) =>(OK|FAIL, redirect-to-catcher) 27. 27MapR Technologies - ConfidentialCloser Look Catcher QueuesCatcherClusterCatcherClusterThe catchers forward log requeststo the correct catcher and returnthat host in the reply to allow theclient to avoid the extra hop.Each topic file is appended byexactly one catcher.Topic files are kept in shared filestorage.TopicFileTopicFile 28. 28MapR Technologies - ConfidentialCloser Look ProtoSpoutThe ProtoSpout tails the topic files,parses log records into tuples andinjects them into the Stormtopology.Last fully acked position stored inshared file system.TopicFileTopicFileProtoSpout 29. 29MapR Technologies - ConfidentialYeah, But What was that about adding noise in scoring? Why would I do that?? Is there a simple answer? 30. 30MapR Technologies - ConfidentialThompson Sampling Select each shell according to the probability that it is the best Probability that it is the best can be computed using posterior But I promised a simple answerP(i is best) = I E[ri |q]= maxjE[rj |q] P(q | D) dq 31. 31MapR Technologies - ConfidentialThompson Sampling Take 2 Sample Pick i to maximize reward Record result from using iq ~P(q | D)i = argmaxjE[r |q] 32. 32MapR Technologies - ConfidentialNearly Forgotten until Recently Citations for Thompson sampling 33. 33MapR Technologies - ConfidentialBayesian Bandit for the Search Compute distributions based on data so far Sample scores s1, s2 based on actual score plus per doc noise from these distributions Rank docs by si Lemma 1: The probability of showing doc i at first position willmatch the probability it is the best Lemma 2: This is as good as it gets 34. 34MapR Technologies - ConfidentialAnd it works!11000 100 200 300 400 500 600 700 800 900 10000.1200.010.020.030.040.050.060.070.080.090.10.11nregret- greedy, = 0.05Bayesian Bandit with Gamma- Normal 35. 35MapR Technologies - ConfidentialYeah, But Isnt recommendations complicated? How can I implement this? 36. 36MapR Technologies - ConfidentialRecommendation Basics History:User Thing1 32 43 42 33 21 12 1 37. 37MapR Technologies - ConfidentialRecommendation Basics History as matrix: (t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) oncet1 t2 t3 t4u1 1 0 1 0u2 1 0 1 1u3 0 1 0 1 38. 38MapR Technologies - ConfidentialA Quick Simplification Users who do h Also do rAhATAh( )ATA( )hUser-centric recommendationsItem-centric recommendations 39. 39MapR Technologies - ConfidentialRecommendation Basics Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2 40. 40MapR Technologies - ConfidentialProblems with Raw Cooccurrence Very popular items co-occur with everything Welcome document Elevator music That isnt interesting We want anomalous cooccurrence 41. 41MapR Technologies - ConfidentialRecommendation Basics Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2t3 not t3t1 2 1not t1 1 1 42. 42MapR Technologies - ConfidentialSpot the Anomaly Root LLR is roughly like standard deviationsA not AB 13 1000not B 1000 100,000A not AB 1 0not B 0 2A not AB 1 0not B 0 10,000A not AB 10 0not B 0 100,0000.44 0.982.26 7.15 43. 43MapR Technologies - ConfidentialRoot LLR Details In Rentropy = function(k) {-sum(k*log((k==0)+(k/sum(k))))}rootLLr = function(k) {sign = sign * sqrt((entropy(rowSums(k))+entropy(colSums(k))- entropy(k))/2)} Like sqrt(mutual information * N/2)See http://bit.ly/16DvLVK 44. 44MapR Technologies - ConfidentialThreshold by Score Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2 45. 45MapR Technologies - ConfidentialThreshold by Score Significant cooccurrence => Indicatorst1 t2 t3 t4t1 1 0 0 1t2 0 1 0 1t3 0 0 1 1t4 1 0 0 1 46. 46MapR Technologies - ConfidentialYeah, But Why go to all this trouble? Does it really help? 47. 47MapR Technologies - ConfidentialReal-life example 48. 48MapR Technologies - ConfidentialThe Real Life Issues Exploration Diversity Speed Not the last percent 49. 49MapR Technologies - ConfidentialThe Second Page 50. 50MapR Technologies - ConfidentialMake it Worse to Make It Better Add noise to rank1 2 8 7 6 3 5 4 10 13 21 18 12 9 14 24 34 28 32 1711 27 40 30 41 49 16 15 35 23 19 22 26 31 20 43 25 29 33 6238 60 74 53 36 37 39 70 45 44 46 71 42 69 47 63 52 57 51 48 Results are worse today But better tomorrow 51. 51MapR Technologies - ConfidentialAnti-Flood 200 of the same result is no better than 2 The recommender list is a portfolio of results If probability of success is highly correlated, then probability of at least onesuccess is much lower Suppressing items similar to higher ranking items helps 52. 52MapR Technologies - ConfidentialThe Punchline Hybrid systems really can work today Middle tiers arent as interesting as they used to be No need for Flume queue directly in big data system No need for external queues, tail the data directly with Storm No need for query systems for presentation data read it directly withnode Absolutely require common frameworks and standard interfaces You can do this today! 53. 53MapR Technologies - Confidential Contact:tdunning@maprtech.com@ted_dunning Slides and suchhttp://slideshare.net/tdunning Hash tags: #mapr #goto #d3 #node