State of IP Multicast
description
Transcript of State of IP Multicast
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 1
State of IP Multicast
Radia Perlman
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 2
Outline
• Addresses
• IGMP
• Various Routing Protocols– review of DVMRP, MOSPF, CBT, PIM-DM,
PIM-SM, MSDP, BGMP/MASC– problems (scaling, etc)– potential solutions: Simple Multicast/Express
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 3
Addresses
• IP Address is 4 bytes– “Class A” top bit is 0– “Class B” top bits 01– “Class C” top bits 001
• IP Multicast address is “class D”, top bits are 0001
• Mapping to layer 2: use bottom 23 bits, top 24 is OUI, one more bit so ISOC has some
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 4
IGMP (Internet Group Management Protocol)
• Purpose: router on a LAN discovers which multicast addresses have receivers on LAN
• Rtr sends query. Members respond– V1: IGMP response to derived layer 2 multicast
address after random delay. Rtr listens promiscuously
– V2: Resign. Rtr queries again– V3: join ({S’s},G) sent to rtr layer 2 address
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 5
There are two ways of constructing a design.One way is to make it so simple there areobviously no deficiencies. The other way isto make it so complicated that there are noobvious deficiencies. ---Tony Hoare
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 6
DVMRP
• Flood and prune– send data everywhere (optimization: reverse
path forwarding)– send prune (S,G)– remember who you sent prunes to (in case join
happens, so you can de-prune)– remember prunes you received (so you can
filter)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 7
Flooding/RPF
• Forward received packet onto all links except the one it was received on– exponential overhead
• RPF: Only accept pkt with source S on link L if you’d send to S via L– n2 overhead: each pkt goes on each link
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 8
Why DVMRP Doesn’t Scale
• Leaking even a few packets for each of millions of sessions periodically
• Prune state (S,G) pairs/neighbor of groups they DON’T want (most of the millions)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 9
MOSPF
• Pass information about all members for all groups in routing protocol
• Calculate spanning tree from source when packet arrives from (S,G) (and cache result)
• Scaling issues:– routing control overhead (all group members)– CPU for multiple Dijkstra calculations
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 10
CBT
C
A B
D F
RR
R R
RR
R
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 11
CBT
• Build bidirectional tree rooted at Core
• Only routers on tree need to know about tree
• Only problem: Who is the core?
• Two mechanisms specified– configure the routers with (C,G) mappings– do PIM-SM bootstrap protocol (see next)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 12
PIM-SM
• Unidirectional Shared Tree (tunnel packet to core)
• Plus dynamically formed per-source trees when (enough) traffic occurs
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 13
Unidirectional Tree
C
A B
D F
RR
R R
RR
R
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 14
Bidirectional Tree
C
A B
D F
RR
R R
RR
R
G
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 15
Dynamically Formed Per-Source Trees
• If enough traffic from S
• join a tree rooted at S
• prune off from shared tree for (S,G)
• Routers keep more trees and more prune state
• State timed out. Bursty source problem
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 16
(Simplified) PIM core mapping
• PIM: “bootstrap” routers flood advertisements throughout domain
• Core capable routers register with elected BSR
• BSR announces list of cores
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 17
PIM core mapping (cont’d)
• Hash alg to map M to one of the set of currently alive core capable routers
• Core not necessarily near group, so shared tree can be really bad
• Advertisements don’t scale, so this is intra-domain only
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 18
Interdomain
• Use protocols that don’t scale within domains
• Find some way of gluing domains together– BGMP/MASC– MSDP
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 19
BGMP/MASC
• For interdomain: have each domain dynamically choose and defend a block of multicast addresses
• Have interdomain routing protocol pass around “reachability” of multicast address blocks
• Join is in direction of multicast address prefix
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 20
Scaling Problems
• MASC– Harder than asking entire Internet to
automatically number itself with IP addresses.– Too much bandwidth used– Too hard to debug– Too much of a burden on BGP– Will run out of addresses
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 21
MSDP
• Multicast Source Distribution Protocol
• “Interim solution” until BGMP/MASC done
• Configure tunnels between core capable routers in various domains, enough so hopefully Internet is connected
• Flood (S,G) for all active (S,G)’s, throughout Internet
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 22
MSDP
xx
x
x x
x
x
x x
x
x
x
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 23
Why MSDP Won’t Scale
• Too much information to pass around (all active (S,G) pairs
• Too many tunnels to configure
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 24
“Current approach”
• Use protocols that don’t scale within a domain
• Find some way of hooking domains together for groups with members in different domains
• MSDP or MASC
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 25
Simple Multicast
• What causes the greatest complexity, scalability problems in the design?
• Remove the need for those
• Result: one scalable mechanism that will work both inside and between domains
• Doesn’t need to be called “new protocol”. Can be modification of something else
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 26
Solve 90% of the problem as simplyas possible. Then remove the remaining10% from the problem requirements --- Marshall Rose
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 27
First Simplification
• Don’t bother dynamically creating per-source trees
• Instead use a single shared, good bidirectional tree
• Less state
• Better shared tree (bidirectional)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 28
Bidirectional Suboptimal?
• Cost to network to deliver data NOT MORE• Core is NOT a bottleneck• Core can be an endnode, does not need to
forward data• With single exit point from “domain”, delay
difference from source tree is negligible• Don’t need “optimal”. Need “good enough”
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 29
Bidirectional Trees Best
• Per-Source Trees– Do NOT make network overhead lower (unless
core is poorly chosen)– More state for net (n trees rather than one)– Only metric under which per-source tree is better is
delay from source to each receiver
• Bidirectional tree, with slight care, can ensure short paths to nearby members from any source
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 30
Choosing good bidirectional tree
• From each domain (or region separated by expensive links), have routers agree on one exit point per IP address prefix
• Choose core to be a member of the group, or close to a member of the group
• No “bandwidth bottleneck” around core--it’s just a node in the tree
• C can be endnode (only fwd tunneled pkts)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 31
Good Bidirectional Tree
R1
R2
R3
R4
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 32
Next simplification
• Forcing all routers in Internet to figure out C from M is too expensive and complicated
• Instead, make group ID 8 bytes
• Only extra work for endnode: look up 8 byte group ID rather than 4 bytes.
• Eliminate need for multicast address allocation, domain-wide core advertisements, etc.
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 33
Simple Multicast
• Bidirectional Tree
• Group ID is (C,M)
• To create group: choose C, ask C for M
• Member discovers 8 byte (C,M) – via email, web page, SDR, directory, etc.
• Include C and M in join or IGMP reply
• Include C and M in data messages
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 34
Simple Multicast Variants
• (C,G) in join, not in data messages– requires unique G’s– what if disagreement about C for G?
• (C,G) in both join and data– explicitly (e.g., IP option)– MPLS– use link-local destination address
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 35
Link Local Destination Address
A R1 CR2
Join C,G
Ack C,G, use X1
Join C,G
Ack C,G, use X2
Join C,G
Ack C,G, use X3
Data, dest=X1 Data, dest=X2 Data, dest=X3
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 36
Simple Multicast Variants, Cont’d
• Express– 8-byte group ID (S,G)– Unidirectional Tree– If multiple senders
• create multiple trees
• tunnel to S
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 37
Issues (with good answers)
• Access Control: controlling who sends by configuring “one” node
• Reliability if core goes down
• Backward compatibility (migrating nodes one at a time)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 38
“Access Control”
• Suppose want to restrict senders?
• Express: S can choose not to forward from others
• PIM: RP can be configured with authorized senders. Refuse to forward. (but members below 1st hop router will receive pkts)
• SM: Core can be configured, and tell others in heartbeat
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 39
Access Control, Cont’d
• What if list doesn’t fit in the heartbeat msg?– Only say no S “if needed” (after bad S sends)– Only say yes S if needed (S tunnels to core or
asks permission of core)– Can have list of yes’s, no’s, or both
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 40
Multiple Groups for Availability
• Rather than “backup core”, just create multiple groups (C1,M1), (C2,M2) and members join both
• Transmit on one (one where you’re getting heartbeat). Receive on both.
• Or if application requires absolute timeliness, transmit on both
• Also, create multiple for load sharing
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 41
Multiple Groups
• Interdomain policy might require a tree per source domain.– Create a single tree for each domain rather than
one per source in that domain.– Can use shared tree like RP: If create extra
auxiliary tree, have it advertised via heartbeat
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 42
Distributed Cores
• If really want failover to another core
• Have protocol among core capable routers
• They advertise among themselves
• Winner injects host route
• Will be less overhead than PIM BSR protocol advertising throughout domain
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 43
Backward Compatibility
• Simplest: look different so other multicast protocols won’t forward the packet
• Assume incremental deployment
• Join sent to Core. Unicast by non-SM rtrs
• Data destination=core or M or tunnel endpoint
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 44
Automatically discovering Tunnel
• R1 sends “join”. Destination=core
• Forwarded until it reaches R2
• R2 notes pkt rcv’d from non-neighbor R1
• Adds “tunnel port” to R1 to state for (C,M)
• Sends join-ack to R1
• R1 creates “tunnel port” to R2 as parent port for (C,M)
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 45
Tunnel needed
R1 r r R2 r C
R1 -- R2 and R2 -- C are “tunnels”IP option contains both C and MIP destination address has C or tunnel endpoint or M
AR3
B
D
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 46
New Protocol or New version of existing protocol?
• No reason to do “totally new thing”• Two suggestions: bidirectional shared trees,
and group ID=(C,G)• Suggestions orthogonal• CBT and BGMP already do bidirectional
trees. PIM could be modified to do it• Easy to modify any of them to get core from
pkt
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 47
Summary
• Shared bidirectional trees– fewer trees to keep track of and maintain– more efficient than tunneling to core
• Group ID C+M– trivial address allocation– no extra info for BGP to pass around– no “core capable router advertisements”– controlled selection of core for group
Copyright © 1999 Sun Microsystems, Inc.
Radia Perlman 48
Summary
• This stuff doesn’t have to be so complicated
• It would be good for Internet if multicast really could allow millions of groups, easily formed by anyone