Dancing with publish/subscribe

72
Dancing with publish/subscribe (Distributed event based systems) Lightening Talk on Top-k publish/subscribe By Y.S. Horawalavithana BSc(Hons.) Computer Science MSc. Distributed System 1

Transcript of Dancing with publish/subscribe

  1. 1. Dancing with publish/subscribe (Distributed event based systems) Lightening Talk on Top-k publish/subscribe By Y.S. Horawalavithana BSc(Hons.) Computer Science MSc. Distributed System 1
  2. 2. Today For who? Outline Discussion MSc. Distributed System 2
  3. 3. Communication paradigms Point-to-point communication Participants need to exist at the same time Direct coupling Strict Identity management Not good for volatile environment Not a good way to communicate with several participants Indirect communication Communication through an intermediary between sender(s) & receiver(s) No direct coupling Space uncoupling Anonymity Time uncoupling Independent lifetimes Through persistent communication channel MSc. Distributed System 3
  4. 4. Indirect communication Scenarios where users connect and disconnect very often Mobile environments, messaging services, forums Event dissemination where receivers may be unknown and change often RSS, events feeds in financial services Scenarios with very large number of participants Google Ads system, Spotify Commonly used in cases when change is anticipated Need to provide dependable services MSc. Distributed System 4
  5. 5. Taxonomy MSc. Distributed System 5 Indirect Communication Communication based Group communication Message Queues Publish/subscribe State based Tuple spaces Distributed Shared Memory
  6. 6. Publish/Subscribe Notify me of all stock quotes of Google from NYSE if the price is greater than 150 MSc. Distributed System 6
  7. 7. Introduction: pub/sub systems Information consumers express their interests in information with subscriptions, identifying which items are of interest. Information producers, publish information by submitting publications (a.k.a. publication events or event notifications). A pub/sub system: Subscription processing: Indexing and storing subscriptions. Event processing: upon event arrival, access subscription indices and identify all matched subscriptions. Event delivery: deliver event to clients with matched subscriptions.. MSc. Distributed System 7
  8. 8. Programming model MSc. Distributed System 8 Figure adapted from Instructors Guide for Coulouris, Dollimore, Kindbergand Blair, Distributed Systems: Concepts and Design Edn. 5 Pearson Education 2012
  9. 9. Introduction: DB view at pub/sub Events correspond to data (data-carrying events). Subscriptions correspond to continuous queries: Define predicates on attributes Fundamentally different model: Instead of storing/indexing data and issuing queries to access it Queries (subscriptions) are stored/indexed and incoming data (events) is matched against stored queries. MSc. Distributed System 9
  10. 10. Introduction: Communications view at pub/sub Akin to multicasting (group IPC, 1-N communication) Each publisher (through its events) communicates to a large number of subscribers. However, communication is, Anonymous Subscribers do not know publishers and vice versa Asynchronous publishers and subscribers do not block when publishing/subscribing Mutually out-of-sync: no rendezvous in time Heterogeneous can be used to connect heterogeneous components MSc. Distributed System 10
  11. 11. Example: Real-world Implementation MSc. Distributed System 11
  12. 12. Pub/sub: System Space 12 Figure adapted from K. Pripui, I. Podnararko, and K. Aberer, Top-k/w publish/subscribe 2012
  13. 13. Pub/sub: Subscription models Content based Type based Topic based Context Type Object Types Independent Channels Hierarchical Topics MSc. Distributed System 13 (Un)structured queries Complex Event Processing
  14. 14. Pub/sub: Real-world Applications Too numeroussome representative application classes News alerts Online stock quotes Internet games Sensor networks Location-based services Network management Internet auctions . MSc. Distributed System 14
  15. 15. Case study: Dealing Room MSc. Distributed System 15
  16. 16. Case study: Spotify MSc. Distributed System 16
  17. 17. Spotify at First glance End-to-end architecture to support social interaction Topic-based subscriptions Friends (Spotify + Facebook): FB friends who are Spotify users and by sharing music Playlists (URI): other users playlists (updates), Collaborative playlists or only modifiable by creator Artists pages (follow artist): new albums or news related to artist MSc. Distributed System 17
  18. 18. Spotify at First glance Hybrid engine Relay events to online users in real time Store and forward selected events to offline users DHT based overlay 3 sites: Stockholm Sweden, London UK, Ashburn USA Design to scale Stores approx., 600 million subscriptions at any given time Matches billions of publication events every day MSc. Distributed System 18
  19. 19. Large scale publish/subscribe systems MSc. Distributed System 19
  20. 20. Boolean matching at pub/sub Assume the dealer room system implemented on top of pub/sub paradigm Dealer submits a subscription [Name = Google , price > 150 , volume < 5000] Stock Exchange publishes a stock quote (publication) [Name = Google and price = 200 and volume = 3000] MSc. Distributed System 20
  21. 21. Drawbacks at Boolean pub/sub Drawbacks A subscriber may be either overloaded with publications or receive too few publications Impossible to compare different matching publications as ranking functions are not defined, and Partial matching between subscriptions and publications is not supported. MSc. Distributed System 21
  22. 22. Real-world Requirements: Sensor Web Real-time environmental monitoring Environmental scientists would like to identify and monitor up to 10 sites with the largest pollution readings over the course of a single day - NSF's Ocean Observatories Initiative (OOI) Identify 10 sensors closest to a particular location measuring the largest pollution levels over time (e.g. top-10 readings are provided on hourly basis) - SNSFs Sensor Scope project Power grid monitoring Operators would like to monitor over time 100 sites with the largest or the lowest power production using solar panel current and voltage readings so that they to identify power grid hot-spots MSc. Distributed System 22
  23. 23. Real-world Requirements: Forest Fire rescue MSc. Distributed System 23
  24. 24. Real-world Requirements: Social Media Personalized newspaper Facebook user is approximately exposed to more than 1500 stories per day, but an average user only engaged with 100 stories from the current news feed. What if to have a personalized news-paper at the end of day Social Annotation of news-stories Serving of Yahoo! News page-views with a fresh set of Top-k tweets, by considering news-story as a subscription while tweets as incoming publications MSc. Distributed System 24
  25. 25. Top-k publish/subscribe Notify me of all Top-10 stock quotes of Google hourly from NYSE if the price is greater than 150 MSc. Distributed System 25
  26. 26. Top-k publish/subscribe How many matching publications will be delivered to a subscriber during a period of time? Actually we dont know in state-of-the-art pub/sub systems Top-k pub/sub models are powered by, Expressive stateful query processing engines User defined parameter k restricts the delivered publications Time (in)dependent Top-k computing methods Sliding window model for handling streaming publications Methods to deliver Top-k notifications Pro-active On-demand MSc. Distributed System 26
  27. 27. Abstract Top-k/w matching Limit the number of matching and delivered publications to k best within a sliding window of size w MSc. Distributed System 27 1 2 3 4 5 6 7 8 9 10 .... 1 2 3 4 5 6 7 8 9 10 .... 1 2 3 4 5 6 7 8 9 10 .... 5 1 5 6 5 9 Top-2 Matching publication stream h=1 h=3 Jumping step (h)
  28. 28. [Pripui 2012] Top-k/w model: DaZaLaPS Subscriber controls the number of publications it receives per subscription (top-k) within a sliding window Subscription is defined by Totally-ordered and time-independent scoring function Parameter k N Parameter w R+*(time-based)or n N (count-based sliding window). Ranks publications according to the degree of relevance (score) to a subscription Each publication is competing with other publications from the sliding window for a position among top-k publications MSc. Distributed System 28
  29. 29. [Pripui 2012] Top-k/w model: DaZaLaPS When can a publication become a Top-k object in the subscription window? Immediately upon publication Later on when it becomes a Top-k object in the subscription window MSc. Distributed System 29 Maintain a set of candidate (potential Top-k) publications in memory!
  30. 30. [Pripui 2012] Distributed Top-k/w model Network of processing nodes, where each node is responsible for computing Top-k/w publications Publication Flooding MSc. Distributed System 30 A B C D E F subscribe(s) change( ) publish(p) p p p p p
  31. 31. [Pripui 2012] Distributed Top-k/w model Subscription Flooding Proxy subscriptions: Replicas of original publications which to be advertised over the network MSc. Distributed System 31 A B C D E F subscribe(s) change( ) publish(p)
  32. 32. [Pripui 2012] Distributed Top-k/w model Rendezvous routing Often implemented on top of a structured peer-to-peer network Rendezvous node is responsible for Matching mapped publications & subscriptions Delivering matching publications to subscribers directly MSc. Distributed System 32 A B C D E F subscribe(s) publish(p) s s sp p change( )
  33. 33. [Pripui 2012] Distributed Top-k/w model Basic gossiping Similar to publication flooding, but randomly spread through an overlay network as a gossip Cannot provide any guarantee regarding publication delivery Purely probabilistic MSc. Distributed System 33 A B C D E F subscribe(s) change( ) publish(p) p p p
  34. 34. [Pripui 2012] Distributed Top-k/w model Informed gossiping Each node additionally stores subscriptions of its close neighbors and also processes the subscriptions of its neighbors Partially probabilistic and partially deterministic MSc. Distributed System 34 A B C D E F subscribe(s) change( ) publish(p) p p p
  35. 35. [Shrarer 2014] Google Top-k pub/sub MSc. Distributed System 35News-story as a subscription Tweets as publications
  36. 36. [Shrarer 2014] Google Top-k pub/sub Annotating news stories with social updates (tweets), at a news website serving high volume of page-views Billions page-views at Yahoo News! per day More than 100 millions related tweets per day Top-k pub/sub approach stories are standing subscriptions on tweets Story Index is queried frequently, but it is updated infrequently based on DAAT, TAAT algorithms Tweet Index updated frequently but queried only for new stories MSc. Distributed System 36
  37. 37. [Drosou 2009] PrefSIENA Say Addison is more interested in horror movies than comedies Addison would like to receive notifications about (various) comedies only if there are no (or just a few) notifications about horror movies MSc. Distributed System 37 title = The Godfather genre = drama showing time = 21:10 title = Ratatouille genre = comedy showing time = 21:15 title = Fight Club genre = drama showing time = 23:00 title = Casablanca genre = drama showing time = 23:10title = Vertigo genre = drama showing time = 23:20 Published events User subscriptions genre = drama genre = horror
  38. 38. [Drosou 2009] PrefSIENA To express some form of ranking among subscriptions, PrefSIENA allow users to define priorities among them To do this, they introduce preferential subscriptions Based on preferential subscriptions, we deliver to users only the k most interesting events Covering/Matching relation MSc. Distributed System 38 string director = Peter Jackson time release date > 1 Jan 2003 string director = Steven Spielberg string genre = fantasy string release date > 1 Jan 2003 string title = LOTR: The Return of the King string director = Peter Jackson time release date = 1 Dec 2003 string genre = fantasy integer oscars = 11
  39. 39. [Drosou 2009] PrefSIENA Ordering subscriptions To order user subscriptions according to the preference relation, they use the winnow operator1, applying it on various levels Step 01: Construct DAG MSc. Distributed System 39 genre = drama genre = horror User preferences genre = comedy genre = romance genre = romance genre = action genre = drama genre = horror genre = comedy genre = romance genre = romance genre = action genre = comedy genre = horror genre = drama genre = comedy genre = horror genre = romance genre = action Preference graph
  40. 40. [Drosou 2009] PrefSIENA Step 02: perform a topological sort to compute winnow levels. The subscriptions of level i are associated with a preference rank (i): is a monotonically decreasing function with [0, 1] e.g. for = (D +1 (l -1)) / (D +1) MSc. Distributed System 40 genre = drama genre = comedy genre = horror genre = romance genre = action Preference graph Preference rank = 1 Preference rank = 2/3 Preference rank = 1/3
  41. 41. [Drosou 2009] PrefSIENA Step 03: Computing Event Ranks Step 04: Based on the ranks, they deliver to users only the k most interesting events Continuous, periodic & sliding window MSc. Distributed System 41 User subscriptions genre = adventure 0.9 director = Peter Jackson 0.7 string title = King Kong string director = Peter Jackson time release date = 14 Dec 2005 string genre = adventure string title = King Kong string director = Peter Jackson time release date = 14 Dec 2005 string genre = adventure 0.9 = max
  42. 42. [Drosou 2009] PrefSIENA: Sliding window Delivery MSc. Distributed System 42 title = The Big Parade genre = romance showing time = 21:00 title = The Apartment genre = comedy showing time = 21:10 title = The Godfather genre = drama showing time = 21:25 title = Forrest Gump genre = romance showing time = 21:10 title = Jaws genre = horror showing time = 20:55 title = Vertigo genre = horror showing time = 21:45 title = Psycho genre = horror showing time = 21:50 title = Pulp Fiction genre = drama showing time = 21:25 User subscriptions genre = comedy 0.9 genre = romance 0.9 genre = drama 0.8 genre = horror 0.6 20:00 20:15 20:22 20:25 20:50 20:40 20:45 20:55 k = 2 w = 4 title = The Big Parade genre = romance showing time = 21:00 title = The Apartment genre = comedy showing time = 21:10 title = Forrest Gump genre = romance showing time = 21:10 title = The Godfather genre = drama showing time = 21:25 title = Psycho genre = horror showing time = 21:50 title = Pulp Fiction genre = drama showing time = 21:25 Matching events Delivered events
  43. 43. [Drosou 2009] PrefSIENA But wait.. The most highly ranked events may be very similar to each other We wish to retrieve results on a broader variety of user interests Two different perspectives on achieving diversity: Avoid overlap: choose notifications that are dissimilar to each other Increase coverage: choose notifications that cover as many user interests as possible How to measure diversity? Many alternative ways Common ground: measure similarity/distance among the selected items MSc. Distributed System 43
  44. 44. MSc. Distributed System 44 Diversity: Top-k representative set Representative Top-kDrawback (without diversity) What we want (with diversity) Method to retrieve Top-k publications from matching publications
  45. 45. MSc. Distributed System 45 MAX* k-diversity problem where 1. P = {p1, , pn} 2. k n 3. d: a distance metric 4. f: a diversity function ),(argmax* dSfS k|S| PS Find:
  46. 46. MSc. Distributed System 46 Proposed: MAXDIVREL k-diversity problem S-Pinrelevancy&similarity-distheminimize,, Sinrelevancy&similarity-disthemaximize,,g ),,( ),,( maxarg),,(argmax* rdSh rdS rdSh rdSg rdSfS PS where 1. P = {p1, , pn} 2. d: a distance metric 3. r: a relevance metric 4. f: a diversity function
  47. 47. MSc. Distributed System 47 Formal Definition: MAXDIVREL k-diversity SPpSp ji i j Spp ji i j ji ji ppd pr pr SP rdSh ppd pr pr S rdS , , dominanceholds),( )( )( || 1 ,,argmin ceindependenholds),( )( )( || 1 ,,gargmax where 1. P = {p1, , pn} 2. d: a distance metric 3. r: a relevance metric 4. > 0 Independence condition: , , , > Dominance condition: , . . , ;
  48. 48. MSc. Distributed System 48 NP-Hardness: Minimum independent-dominating set 1 2 3 4 5 1 4 3 5 2 1 4 3 5 2 1 4 3 2 5 1 4 3 2 5 jijiji ppppdppodNeighborho ,|)( 1 4 32 5 Publication space Graph model Independent, dominating Independent, dominating Independent, dominating Dominating, not independent
  49. 49. MSc. Distributed System 49 NAVE Greedy argmax ()2 ( ) () (, )
  50. 50. MSc. Distributed System 50 Handling streaming publications 1 2 3 4 5 1 4 3 5 2 6 1 4 3 5 26 Continuity Requirements 1. Durability an item is selected as diversified in window may still have the chance to be in + 1 window if it's not expired & other valid items in + 1 window are failed to compete with it. 2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not- older than j.
  51. 51. MSc. Distributed System 51 MAXDIVREL continuous k-diversity 1 2 3 4 .. +1 .. .. .. .... Matching publication stream 1 2 3 4 .. +1 .. .. .. .... ith window (i+1)th window +1 MAXDIVREL k-diversity MAXDIVREL k-diversity Independence Dominance Durability Order Straightforward solution: Apply nave greedy method at each instance Propose incremental index mechanism! Avoid the curse of re-calculating neighborhood
  52. 52. MSc. Distributed System 52 Locality Sensitive Hashing (LSH) Simple Idea if two points are close together, then after a projection operation these two points will remain close together
  53. 53. MSc. Distributed System 53 LSH Analysis For any given points , = 1 1 = 2 1 = 2 Hash function h is (1, 2, 1, 2) sensitive, Ideally we need (12) to be large (12) to be small
  54. 54. MSc. Distributed System 54 LSH in MAXDIVREL: Publications as categorical data
  55. 55. MSc. Distributed System 55 LSH in MAXDIVREL: Characteristic Matrix
  56. 56. MSc. Distributed System 56 LSH in MAXDIVREL: Minhashing No Publications any more! Signature to represent Technique Randomly permute the rows at characteristic matrix m times Take the number of the 1st row, in the permuted order, which the column has a 1 for the correspondent column of publications. First permutation of rows at characteristic matrix Advantage: Reduce the dimensions into a small minhash signature
  57. 57. MSc. Distributed System 57 LSH in MAXDIVREL: Signature Matrix Fast-minhashing Select m number of random hash functions To model the effect of m number of random permutation Mathematically proved only when, The number of rows is a prime.
  58. 58. MSc. Distributed System 58 LSH in MAXDIVREL: LSH Buckets Take r sized signature vectors From m sized minhash- signature Map them into, L Hash-Tables Each with arbitrary b number of buckets
  59. 59. MSc. Distributed System 59 LSH in MAXDIVREL: How to select L, r? For two vectors x,y , = 1 , ; , , = 1. = 2. ? 2) () 1 1
  60. 60. MSc. Distributed System 60 LSH in MAXDIVREL: Analysis For two vectors x,y , = 1 , ; , , = For publications x & y , = At a particular hash table x & y map into the same bucket: , x & y does not map into the same bucket: 1 , At L Hash-tables x & y does not map into the same bucket: (1 , ) 1 (1 , ) True near neighbors will be unlikely to be unlucky in all the projections
  61. 61. MSc. Distributed System 61 LSH in MAXDIVREL: Batch-wise Top-k computation Bucket Winner a publication which has the highest relevancy score Winner is dominant to represent it's bucket neighborhood Top-k "winners that have a majority of votes k winners are independent . . ith window
  62. 62. MSc. Distributed System 62 LSH in MAXDIVREL: Incremental Top-k computation Characteristic Matrix Signature Matrix Map signature into L hash-tables Update Winner at bucket signature maps into Vote
  63. 63. MSc. Distributed System 63 LSH in MAXDIVREL: When new publication F arrives Only buckets 13 , 23 , 32 , 43 will vote Follow continuity requirements Durability Order . . ith window (i+1)th window
  64. 64. MSc. Distributed System 64 Implementation
  65. 65. MSc. Distributed System 65 Cloud service modules Source: Amazon Kinesis Source: Amazon Elastic-cache
  66. 66. MSc. Distributed System 66 Top-k pub/sub: DEMO
  67. 67. P2P Pub/Sub Scribe: topic-based, built on top of Pastry, stateful, rendezvous. Hermes: topic & content-based, built on top of Pastry(-like) net, stateful, rendezvous & flooding-like. Meghdoot: content-based, built on top of CAN, stateful, rendezvous. Tera: topic-based, built on unstructured P2P net, stateful, random walk- based-flooding. Sub2Sub: content-based, built on unstructured P2P net, stateful, flooding- like. DHTStrings: content-based, DHT-independent, string support, stateless, rendezvous. OP-DHT Pub/Sub: content-based, (can be) built on top of Chord/Pastry/Bamboo. MSc. Distributed System 67
  68. 68. DHT based pub/sub: Scribe Topic Based Based on DHT (Pastry) Rendezvous event routing A random identifier is assigned to each topic The pastry node with the identifier closest to the one of the topic becomes responsible for that topic MSc. Distributed System 68
  69. 69. DHT based pub/sub: Meghdoot Content Based Based on Structured Overlay CAN Mapping the subscription language and the event space to CAN space Subscription and event Routing exploit CAN routing algorithms MSc. Distributed System 69
  70. 70. Top-k publish/subscribe at P2P Stateful approaches introduce some kind of state at (intermediate) nodes. State can refer to : State needed to support specialized structures built on top of the network structure E.g. trees (parent, children pointers) Routing state for content-based routing: Subscription paths to be followed by matching publications Subscriptions (meta)data: not just forward pointers to be followed and subscription content (its predicates), but also possible info as to What about query inherent diversification? The controlled parameters (k & w) can change Updates and the need to maintain state consistency may stress the system and revoke any benefits.. So well be left with the complexity MSc. Distributed System 70
  71. 71. Future work Apply Top-k diversification modules at (un)structured P2P Exploiting overlap among diversified results of users who have similar interest Develop LSH based index over multi-threaded distributed environment Develop large scale Top-k pub/sub applications by exploring other suitable use-cases E.g. Personalized newspaper for every Facebook user Diverse set of personalized Twitter trends Social annotation of news-stories MSc. Distributed System 71
  72. 72. Thank you! [email protected] @SamTube405 http://geektube405.wordpress.com MSc. Distributed System 72