Evolvable multi-party systems and sophisticated diagnosis ...€¦ ·...

Evolvablemulti-party systems& sophisticateddiagnosis toolsfor the cloudecosystem

Raja R. Sambasivan

Cloud datacenter

Othe

r dat

acen

ters

& ey

eball

ISPs

(A)

Workflow-centric tracing<E, F, G, H, I*, J*, K*>

Party w/control

App nodes

Provider Provider or ISPs

Wide area

Guest OS

Tenants or provider

Hardware

Host OS

Virtualization

ReadDistributed app 1 Distributed app 2

Network

(B)

(C*)

(D*)

Figure 1: �e cloud ecosystem andmy work within it. Parenthesesmarkareas in the cloudecosystem inwhich Ihavebuilt systems. Angledbrackets indicate my contributions building diagnosis tools. Asterisksmark current projects, which I will continue to work on as faculty.

�e cloud ecosystem is critical to modern society.We use it when watching movies, buying goods, trad-ing stocks, navigating to destinations, collaborating atwork, and even when playing games.�is ecosystem isincredibly complex (see Figure 1 for a simpli�ed view).It is comprised of cloud datacenters, eyeball ISPs, andthe wide-area network. Cloud datacenters are built andmanaged by various cloud providers (one provider con-trols one or more entire datacenters). �ey run manydistributed applications (e.g., shopping applications ordistributed-storage applications), whichmay belong toexternal tenants (e.g., Target) or internal ones (i.e., cloudproviders themselves). Distributed applications maybe comprised of 1000s of nodes and any subset of themcould be involved in the work�ow (i.e., processing) ofan application request.

In addition to application nodes, many datacenterstack layers are involved in the processing of applicationrequests.�e guest OS layer is controlled by tenants and runs their preferred operating systems. Lower layers are controlledby cloud providers.�ey include the virtualization layer tomultiplex physical resources among tenants, the host OS layertomanage physical computers, the network layer tomanage physical switches, and the physical hardware itself. EyeballISPs (e.g., Verizon) are Internet service providers that contain tenants’ customers (e.g., online shoppers).�e wide-areanetwork (e.g., the Internet) connects cloud datacenters to each other other and to eyeball ISPs. It is used to transfer dataamong applications with nodes inmultiple datacenters and to reach customers.

Today’s cloud ecosystem limits innovation because parts of it are not evolvable (i.e., changeable), because evolvableportions are controlled by a single party, and because the di�culty of diagnosing problems in this ecosystemdisincentivizeschange. As a systems and networking researcher, I am broadly interested in building themulti-party evolvable systemsnecessary to support innovationwithin the cloud ecosystemand the diagnosis tools necessary to guarantee their health.Below, I characterize my past and current work in building evolvable systems [1, 3, 8, 11] (marked with parentheses inFigure 1) and in building diagnosis tools [2, 4–7, 9, 10] (marked with angled brackets in Figure 1).

Building evolvable systems: As a PhD student, (A) I helped build UrsaMinor [1, 3, 11], an object-based distributed-storage application whose goal was to support a diverse range of possible current and future datacenter workloads. As apostdoctoral researcher, (B) I implemented a version of of BGP that can facilitate the deployment of the wide range ofcritical �xes and sophisticated replacements that have been proposed for it. A key result of this research was that smallchanges to BGP allow for a rich, evolvable public Internet that supportsmany diverse inter-domain routing protocols.

As a research scientist, (C) I am helping build a public cloud datacenter (called theMass Open Cloud) that enablesinnovation by supportingmultiple providers as well as tenants. Providers can install a range of infrastructure and servicesand compete with each other to provide tenants with services. It also aims tomake datasets that are typically hidden bycloud providers (e.g., problems logs) available to cloud researchers (e.g., to better support diagnosis research). As part ofthis e�ort, (D), I am collaborating with researchers to create an interface that both enables multiple datacenter networks(e.g., ones from di�erent providers) to be installed within datacenters and enables tenants to choose among them.

Building tools for diagnosing systems: My experiences building systems have informed my work on diagnosis.Duringmy PhD, <E> I built one of the earliest work�ow-centric tracing infrastructures (also called end-to-end tracing)that strongly supports diagnosis tasks [9]. Such tracing is uniquely suited to informing operational tasks within distributedapplications (e.g., diagnosis and resource attribution). With low overhead, it preserves the work�ow of individual requestsas they are processed within and among the nodes of a distributed application. Work�ow-centric tracing works bypropagating context (e.g., unique IDs and logical clocks) along with the �ow of requests’ execution.�is context is storedwithin records of trace points executed by requests.Work�ow traces are created by stitching together trace-point recordswith related context. Traces are graphs in which vertices represent trace-points. Edges represent causal relationshipsbetween trace points and are labeled with inter-trace-point latencies. Today, basic versions of work�ow-centric tracing are

December 2018

Research statement - Raja R. Sambasivan

being adopted by industry for use in their production systems.<F> I designed and built Spectroscope [9], one of the earliest automated diagnosis tools that uses work�ow-centric

traces. Spectroscope helps engineers with their diagnosis e�orts by automatically localizing the source of a newly-observedproblem from themany distributed-application nodes that may contain the root cause to a few likely candidates. My workon Spectroscope ishighly cited (over 160 citations) andwas shown to be e�ectivewithin certainGoogle infrastructureapplications. <G> I created visualizations [7] to e�ectively present Spectroscope’s results to engineers. My visualizationshave in�uenced recent visualizations in Jaeger, an open-source work�ow-centric tracing infrastructure.

<H> Iusedmyexperienceswithwork�ow-centric tracing to systematize itsdesignaxes [6]. I found thatdi�erentdesignchoices for these axes dictate a tracing infrastructure’s utility for di�erent operational tasks. I found that design choicesthat appear innocuous at �rst can lead to high overheads due to request batching within distributed applications.

As a research scientist, I am advising students on three projects that build on work�ow-centric tracing. <I> First,we are extending work�ow-centric tracing to capture elements of requests’ work�ows within the guest OS layer. �iswill empower both existing diagnosis tools (e.g., Spectroscope) and the new ones we are creating to diagnose cross-layerproblems (e.g., a slow request caused by a bug within the OS). <J> Second, we are building an instrumentation frameworkthat explores the search space of possible instrumentation choices. It will automatically enable the trace points needed toprovide insight into a newly-observed problem in running applications. My proposed research on this framework hasbeen just awarded aNSFCSR Small Grant [2] onwhich I am principal investigator.

<K>�ird, we are creating a novel diagnosis abstraction that will mitigate the amount of complexity that engineers (ordiagnosis tools) must deal with at any time during diagnosis. We expect that this abstraction will both improve existingdiagnosis tools (e.g., Spectroscope) and enable a broad range of new diagnosis-related use cases.

In the following sections, I describe some ofmy research on building evolvable systems and building diagnosis toolsin detail.�ese e�orts were (or are) collaborations withmany systems, networkingmachine learning, and visualizationresearchers as well as industrial practitioners fromGoogle, LightStep, and RedHat.

Enabling evolvability for inter-domain routing

Protocol(s) Field(s)

Wiser

Wiser, BGP, BGPSec

Path cost

OriginNext hop

Baseline Address: 128.6.0.0/32

Pathdescriptors

Path vector

BGPSec Attestation

Island descriptors SCION

Within-island

MIRO 173.82.2.0

Value(s)

100

<signatures>

Protocol(s) Field(s) Value(s)

Portal addr

br70 br50 br10 br1br70 br20 br5 br1

EGP195.2.27.0/32

4000

Island IDs

19 16

Island ID

paths

Wiser Portal addr 163.42.5.0

A

G

A

Coordination <protocol>

“

3 K

3

G

““

Figure2: AD-BGPmulti-protocoladvertisement. IslandIDsnamegroups of contiguous ISPs that have deployed the same protocol. Pathdescriptors describe per-protocol properties of entire paths. Islanddescriptorsdescribeadvancedroutingoptionsavailablewithin islands(e.g., extra within-island paths that sources can choose, but cannot beexpressedoutsideof an islandbecauseofBGP’s singlepath limitation).

�e Internet’s inter-domain routing infrastructure is acritical piece of its architecture.�e routing paths it com-putes in the control plane connect all of the entities on theInternet (e.g, di�erent datacenters, di�erent ISPs). Today,this infrastructure is provided by a single protocol, BGP,which is extremely limited. It does not provide any qualityof service, it advertises only one (potentially poorly per-forming) best-e�ort routing path to sources of tra�c, itis susceptible to security attacks, and it does not provideany support to help diagnose routing problems. Worstof all, BGP is architecturally rigid—it requires directlyneighboring ISPs to use the same inter-domain routingprotocol—and thus cannot facilitate the introduction ofmore advanced inter-domain routing protocols.

BGP’s rigidity disincentivizes ISPs from deployingnew routing protocols. ISPs that are interested in de-ploying a speci�c new protocol must do so in contiguousgroups of isolated islands.�ese islands cannot discoverone another, disseminate new protocols’ information toone another, or use the new protocol to exchange tra�camongst one another.�ese limitations reduce the ben-e�ts of deploying the protocol. Islands can use overlays toforcibly (i.e., without support from BGP) route tra�c to islands that have deployed the new protocol. But, the data-planetunnels that overlays use to force tra�c on certain paths can increase the costs of ISPs that are (currently) not interested indeploying the new protocol.�is incentivizes them to push back against the protocol’s deployment (e.g., when it is �rstbeing standardized).

In this research [8], I identi�edwhat features are needed in any inter-domain routing protocol to bootstrap evolution tonew inter-domain routing protocols—i.e., facilitate their deployment across non-contiguous ISPs and, if desired, graduallydeprecate itself in favor of one of them . By incorporating these evolvability features, these new protocols would themselves

http://www.rajasambasivan.com of 5 December 2018

http://www.rajasambasivan.com


be able to bootstrap evolvability to further new protocols. To identify the features, I systematized what support fourteenrecently-proposed inter-domain routing protocols would need to be deployed across (non-contiguous) islands.

�e features I identi�ed, pass-through support within routers and multi-protocol support within advertisements,reduce rigidity for two reasons. First, combined, they cleanly separate the information contained in protocols’ connectivityadvertisements from that used by their path-selection algorithms.�is transforms advertisements into containers that cancarrymultiple inter-domain routing protocols’ information. Second, multi-protocol advertisements include extra supportto allow islands that express their within-island paths in di�erent ways (e.g., path vector and link state) to establish routingpaths that include each other. Figure 2 shows amulti-protocol advertisement for a routing path that includes several newprotocols from the literature.

To understand the bene�ts the evolvability features can provide and the di�culty of incorporating them into an existinginter-domain routing protocol, I created a version of BGP (called D-BGP) that includes them. I implemented D-BGPwithin Quagga, an open-source router. I found that D-BGP is not that far fromBGP. It only required ~900 lines of extracode andmany (but not all) of D-BGP’s features can be provided by carefully systematizing BGP’s community attributes. Ifound that D-BGP can support an Internet comprised ofmany diverse protocols, including critical �xes to D-BGP itself,value-added services, and various other advanced protocols. I also found that D-BGP signi�cantly accelerates the rate atwhich early adopters see the bene�ts of adopting speci�c new protocols at low adoption rates (e.g., at 10-40% adoption)compared to using BGPwithout data-plane tunnels.

Next steps: �e key next step for this research is to explore how D-BGP itself could be deployed. Also, research isneeded to contrast the bene�ts of D-BGP’s approach for enabling evolvability to that of so�ware-de�ned exchanges (SDXs)and to explore whether a combination of both approaches could yield even greater evolvability bene�ts.

Automatically localizing performancedegradationsAcritical stepofanapplicationengineer’sdiagnosis e�orts isproblemlocalization.�is challenging task involves identifyingwhich subset of themyriad number of entities (application nodes, code in lower stack layers, ISPs) involved in requests’work�ows are the likely sources of the problem. Once these potential culprit(s) have been identi�ed, engineers canproceed to identify the root cause. To help engineers with this step, my dissertation research involved building andevaluating a tool called Spectroscope that useswork�ow-centric traces to automatically localize the sources of performancedegradations [7, 9]. Spectroscope’s focus was on the application layer (partially because the work�ow-centric traces I usedwith it only captured application-level slices of requests’ work�ows).

Cachehit

Reply

Read

Cachemiss

Start

Reply

Read20

10

5,000

20 µs

100

End 100

Reply

Write

Start Start

End End

Rank: 1Type: Structural

Reply

Write

Start Start

End End

10

5,000

2000

µs

µs µs

µs

µs

µs

µs

10 µs

10 µs

5,000 µs

10 µs µs10 µs

Legend: MutationPrecursorRequests: 7,000

Rank: 2Type: TimingRequests: 5,000

Figure3: Spectroscope’soutput. Twomocked-upmutationsare shown in which shapes represent nodes accessed.

Spectroscope’s key insight is that performance degradationso�enmanifest as mutations in requests’ work�ows. Two typesof mutations are possible. Structural mutations involve changesin concurrency, synchronization, or nodes and functions visitedby requests. Timingmutations only involve changes in latencies.For a given degradation, exposing the resultingmutations, iden-tifying their previous behavior (called precursors), and showinghowmutations and precursors di�er localizes the source of theproblem. Additional guidance can be provided by rankingmu-tations and their corresponding precursors by their contributionto the performance change. Spectroscope’s key insight allows itto work without labels indicating whether individual requests are problematic.

Spectroscope takes as input work�ow-centric traces from a degraded period of a distributed-applications’ executionand anon-degradedperiod. It then exploits a variety of domain-speci�cheuristics,machine-learning algorithms, statisticalalgorithms, and visualization techniques to achieve its goals (see Figure 3 for a view of Spectroscope’s output). To evaluateSpectroscope, I used it to diagnose eight performance degradations in Ursa Minor and certain Google Infrastructureservices. Six of the eight problems were real and previously unsolved. Spectroscope’s results successfully localized thesources of all of the problems.

Next steps: �ere are four promising next steps, the �rst three of which I am currently pursuing. �ey include: 1)Adding support forwork�ow-centric tracing to the guestOS layer to enable Spectroscope to diagnose cross-layer problems,2)Creating a just-in-time instrumentation framework to guarantee that the instrumentation needed to su�ciently localizea new problem is present in the systemwhen it is �rst observed (see next section), 3)Creating amore granular abstractionthat Spectroscope can use as input (instead of entire traces) that will allow it to bothmore directly identify mutations andprecursors and scale to larger systems, and 4)Creating a new version of Spectroscope whose algorithms are suited forlocalizing cross-layer problems, using just-in-time instrumentation, and using themore granular abstraction.




Ensuring instrumentation presence via just-in-time instrumentationWhen using Spectroscope, I o�en found that bothUrsaMinor andGoogle lacked instrumentation to provide su�cientvisibility into problems. As a result, with UrsaMinor, I o�en used Spectroscope in time-consuming iterative cycles oflocalizing the problemusing existing instrumentation, adding additional trace points in areas it identi�ed, then re-runningSpectroscope. I could not add additional instrumentation to Google applications.

My experiences with Spectroscope re�ect a broad truth. It is di�cult to knowwhere (e.g., in which nodes), atwhatgranularity (e.g., entire nodes or individual functions), andwithinwhich stack layer (e.g., app or guestOS) instrumentationmust be added (or enabled) a priori to provide visibility into problems that may occur far in the future. It is also di�cultto knowwhat datamust be exposed within instrumentation (e.g., queue lengths or function parameters) to provide theneeded visibility. Of course, enabling all possible instrumentation would result in unacceptable overheads.

We are currently creating an instrumentation framework (called Pythia) that will automatically enable the instrumen-tationneeded to diagnose a newproblemas soon as (or very soon a�er) it is �rst observed in a deployed application [2]. Ourinitial focus is on the application and guest OS layer. Pythia’s key insight is that high performance variation among requeststhat are expected to perform similarly indicates that there is something important that is unknown about their work�ows.�is unknown behaviormay represent problems (e.g., resource contention) or generally interesting heterogeneity (e.g.,uninstrumented code paths that have very di�erent performance pro�les) [5].

Localizing the source(s) of the high variation provides insight intowhere additional instrumentation is needed. Searchstrategies based on operator knowledge, statistics, or machine learning can then be used to identifywhat instrumentationis needed to explain it. To diagnose problems that manifest as consistently slow performance (e.g., long queues), a similarprocedure that focuses on identifying dominant contributors to request response times can be used.

Using this insight, our current approach for Pythia operates in a continuous cycle of obtaining work�ow-centric traces,localizing variation and high response times, and using search strategies to enable or disable instrumentation. Pythia mayalso choose to do nothing because there are nomore unexplained sources of variation or high response times. We arecurrently building a prototype and exploring causes of high variation within OpenStack. We also plan to apply our Pythiaprototype to Ceph. Our prototype controls trace points that have already been added to the application, but it could alsouse programming-language support to dynamically inject instrumentation into a running system.

Longer-term research plansMyultimate research goal is to create a richmulti-party cloud ecosystem that supports innovation by being easily evolvableand easy to diagnose. In addition to the next steps identi�ed in previous sections, I plan to pursue the longer-term researchlisted below to achieve this ambitious goal.

Createmechanisms to correlate and share data relevant to diagnosis among providers, tenants, and ISPs: In thedata-plane, work�ow-centric tracing enables data frommultiple parties to be correlated.�erefore, I plan to instrumentadditional stack layers using it and explore its applicability to the network (both datacenter and wide area). As part ofthis e�ort, I will conduct research to determine if tracing is su�cient for all of the complex behaviors systems can exhibit(e.g., timer interrupts) and propose other primitives if it is not. For the control plane, I plan to explore whether existingprovenance mechanisms are su�cient to provide visibility into why decisions were made (e.g., why certain routing orscheduler decisions were made) and proposed alternate mechanisms if not. For both the control and data plane, I willconduct research to 1) determine if information from di�erent parties should bematerialized all at once or on-demand, 2)create protocols and frameworks that support these all-at-once and on-demand approaches, and 3) explore how theseapproaches could be deployed (e.g., via D-BGP).

Identify ways to incentivize (or avoid disincentives for) multiple parties to share information: Despite havingmechanisms for sharing data, parties may not wish to share information due to business concerns. Tomake progress onthis important research problem, I plan to identify problem scenarios in whichmultiple parties must cooperate to avoidpoor outcomes—e.g., cases where di�erent parties reacting to a problem independently will exacerbate the problem.�einformation thatmust be shared to avoid exacerbating the problem could form a “bareminimum” of data that is alwaysshared. I also plan to explore how other approaches, such asmulti-party computation or di�erential privacy, could help.

Buildmechanisms for informed evolutionbetween clouds and the Internet: Suchmechanismswould allow cloudsand ISPs to �uidly evolve to support one another’s needs. �is would spur innovation similar to that only seen withinextremely large cloud providers’ private WANs, thus leveling the playing between them and smaller clouds. �esemechanisms would also enable ISPs to better support startups operating in new domains (e.g., IoT). As a �rst step, I planto explore how our interfaces for allowing datacenter tenants to choose among di�erent datacenter networks (and therouting paths o�ered by them) could be combined with D-BGP’s and IXPs’ approaches for enabling routing evolvability.



Research statement - Raja R. Sambasivan REFERENCES

References[1] M.Abd-El-Malek,W.V. Courtright II, C. Cranor, G. R. Ganger, J. Hendricks, A. J. Klosterman,M.Mesnier,M. Prasad,

B. Salmon, R. R. Sambasivan, S. Sinnamohideen, J. Strunk, E. �ereska, M. Wachs, and J. Wylie. Ursa Minor:versatile cluster-based storage. InFAST’05: Proceedings of the4thUSENIXConferenceonFile andStorageTechnologies,December 2005.

[2] CSR: Small: A just-in-time, cross-layer instrumentation framework for diagnosing performance problems in dis-tributed applications, October 2018. https://www.nsf.gov/awardsearch/showAward?AWD_ID=1815323&HistoricalAwards=false.

[3] J.Hendricks,R.R.Sambasivan, S.Sinnamohideen, andG.R.Ganger. Improvingsmall�leperformance inobject-basedstorage. Technical Report CMU-PDL-06-104, CarnegieMellonUniversity, May 2006.

[4] M.Mesnier, M.Wachs, R. R. Sambasivan, J. Lopez, J. Hendricks, G. R. Ganger, andD. O’Hallaron. //Trace: Paralleltrace replaywith approximate causal events. InFAST ’07: Proceedings of the 5th USENIXconference onFile andStorageTechnologies, February 2007.

[5] R. R. Sambasivan and G. R. Ganger. Automated diagnosis without predictability is a recipe for failure. In Proceedingsof the 4th USENIX conference onHot Topics in Cloud Computing, June 2012.

[6] R. R. Sambasivan, I. Shafer, J. Mace, B. H. Sigelman, R. Fonseca, and G. R. Ganger. Principled work�ow-centrictracing of distributed systems. In SoCC ’16: Proceedings of the Seventh Symposium on Cloud Computing, October2016.

[7] R. R. Sambasivan, I. Shafer, M. L.Mazurek, and G. R. Ganger. Visualizing request-�ow comparison to aid perfor-mance diagnosis in distributed systems. IEEE Transactions on Visualization and Computer Graphics (ProceedingsInformation Visualization 2013), 19(12), December 2013.

[8] R. R. Sambasivan, D. Tran-Lam, A. Akella, and P. Steenkiste. Bootstrapping evolvability for inter-domain routingwith D-BGP. In Proceedings of the ACM SIGCOMM 2017 Conference on Applications, Technologies, Architectures,and Protocols for Computer Communication, SIGCOMM ’17, August 2017.

[9] R. R. Sambasivan, A. X. Zheng,M. De Rosa, E. Krevat, S.Whitman,M. Stroucken,W.Wang, L. Xu, and G. R. Ganger.Diagnosing performance changes by comparing request �ows. InNSDI’11: Proceedings of the 8th USENIXConferenceon Networked Systems Design and Implementation, March 2011.

[10] R. R. Sambasivan, A. X. Zheng, E.�ereska, and G. R. Ganger. Categorizing and di�erencing system behaviours. InHotAC II: Proceedings of the 2nd workshop onHot Topics in Autonomic Computing, June 2007.

[11] S. Sinnamohideen, R. R. Sambasivan, J. Hendricks, L. Liu, and G. R. Ganger. A transparently-scalable metadataservice for the UrsaMinor storage system. InATC ’10: Proceedings of the 2010 USENIXAnnual Technical Conference,June 2010.


https://www.nsf.gov/awardsearch/showAward?AWD_ID=1815323&HistoricalAwards=false

https://www.nsf.gov/awardsearch/showAward?AWD_ID=1815323&HistoricalAwards=false


Evolvable multi-party systems and sophisticated diagnosis ...€¦ ·...

Documents

Transcript of Evolvable multi-party systems and sophisticated diagnosis ...€¦ ·...