Quals

41
Monitoring, Monitoring, Diagnosing, and Diagnosing, and Repairing Repairing Eric Anderson U.C. Berkeley

description

 

Transcript of Quals

Page 1: Quals

Monitoring, Diagnosing, and Monitoring, Diagnosing, and RepairingRepairing

Eric Anderson

U.C. Berkeley

Page 2: Quals

2Apr 10, 2023

OverviewOverview

What is System Administration?– What is the problem?– Goals of Dissertation Research– Goals of System Administration

Monitoring, diagnosing, and repairing Dissertation Timeline Conclusion

Page 3: Quals

3Apr 10, 2023

What is the problem?What is the problem?

Problems occur in systems, and result in loss of productivity– Server failures denial of service– System overload lower productivity

Cost is too high – Cost of ownership estimated at $5,000-$15,000/year/machine– Median salary (~50k) / (median # machines/admin) $700

Our goal: Reduce cost by– Repairing problems faster (possibly automatically)– Handling more problems

Page 4: Quals

4Apr 10, 2023

Goals of Dissertation ResearchGoals of Dissertation Research

Describe field of System Administration Monitoring, Diagnosing, and Repairing:

– Approach: Synthesize solutions from other fields of research

1) Detect previously ignored problems

2) Automatic repair of some problems

3) Reduce number of administrators needed

4) Support users’ understanding of system

Apply here & distribute software Thesis: Through our approach, we can achieve

goals 1-4.

Page 5: Quals

5Apr 10, 2023

Goals of System AdministrationGoals of System Administration

Goal: Support cost-effective use of the computer environment

More specifically (some non-technical):Environment: uniform, customizable, high performance and

available

Faults & errors: recovery from benign errors, protection from malicious attacks

Users: training, accounting & planning, legal

Page 6: Quals

6Apr 10, 2023

Monitoring, Diagnosing, and Monitoring, Diagnosing, and Repairing (MDR)Repairing (MDR) Introductory examples Fundamental requirements Environmental constraints Previous work Six key innovations

Architecture Details on innovations

Evaluation methodology

Page 7: Quals

7Apr 10, 2023

MDR: Examples — IntroMDR: Examples — Intro

Four examples1) Broken component

2) Resource overload — transient

3) Resource contention — user program

4) Resource exhaustion — long term

Previous Solutions– Pay someone to watch– Ignore or wait for someone to complain– Specialized scripts (not general vast repeated work)

Page 8: Quals

8Apr 10, 2023

MDR: Example 1MDR: Example 1

Web server has crashed/hung Gather information: process existence, service

uptime, restart times Analyze data: process not responding, and hasn’t

been recently restarted. Automatic repair: restart daemon. Notify administrator: had to restart daemon.

Page 9: Quals

9Apr 10, 2023

MDR: Example 2MDR: Example 2

The NOW is “slow.” Gather data: load, process info, CPU info Analyze data: bounds on expected values Notified administrator: fileserver overloaded. Visualize data: nfsd’s are overloaded. Repair: admin moves data, adds disks, or starts

more nfsd’s

Page 10: Quals

10Apr 10, 2023

MDR: Example 3MDR: Example 3

User running program Gather: user statistics, CPU, disk Visualize: spending too much time waiting on remote

accesses

(User fixes program, gathering, visualization repeated) Analyze: some nodes have less throughput Visualize: those have other jobs running on them Repair: user is benchmarking so kills all extraneous

processes

Page 11: Quals

11Apr 10, 2023

MDR: Example 4MDR: Example 4

Web server increasing beyond capacity Gather: CPU, request rate, reply latency Analyze: Burst lengths getting longer, latency

increasing Visualize: Graph of burst lengths & CPU usage over

time Repair: Order more machines, install load balancer

Page 12: Quals

12Apr 10, 2023

MDR: Fundamental RequirementsMDR: Fundamental Requirements

Gathering Flexible data gathering, self-describing storage

Analyzing Calculate statistical measures, identify relevant statistics.

Notifying Flexible infrequent messages to administrators or users

Visualizing Maximize information/pixel, support multiple interfaces

Repairing Automate simple repairs, support group operations

Page 13: Quals

13Apr 10, 2023

MDR: Environmental Constraints MDR: Environmental Constraints Change is inherent

– Lack of Web/Mbone 5 years ago, now most/many have these.

Problems on many time-scales– Second-Minute transients vs. Week-Month capacity problems

Must operate under very adverse conditions– Often used when system is broken– Would like at least post-mortum analysis

Need to handle hundreds – thousands of nodes– Scalability: All sites are getting larger, possibly wide area– Our system has 200 (NOW) – 2000 (Soda) nodes

Page 14: Quals

14Apr 10, 2023

MDR: Previous SystemsMDR: Previous Systems

Many previous systems: I’ve looked at about 16. Not comprehensive, not extensible. Look at a few that did a nice job of a piece: [Fink97] — Run test, notify display engine

+ Easy to add tests

+ Selectivity of notification good

– Tests are just programs (redo gathering)

– Central, non-fault tolerant solution

– Many hard coded constants

Page 15: Quals

15Apr 10, 2023

MDR: Previous Systems, cont.MDR: Previous Systems, cont.

[Hard92] — buzzerd: Pager notification system+ Flexible rules for notification

+ External interface for adding notify requests

– Simplistic gathering

– Poor fault tolerance

[Pier96] — Igor group fixes+ Flexible operations

+ Nice reporting of success/failure

– Weak security, runs as root

– No delegation of responsibility

Page 16: Quals

16Apr 10, 2023

MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)

Replicated, semi-hierarchical, data storage nodes– Rendezvous point for programs– Handles scaling and fault-tolerance

Self describing structures– Functions (visualize, summarize) + data go in database

(OO)– DB has machine and human readable descriptions of data

End to end notification– Detect problems in MDR system– Guarantee important messages get to users

Page 17: Quals

17Apr 10, 2023

MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)

Aggregation and High Resolution Color Displays– Reduce information to manageable amounts– Maximize information per unit area

Partially self-configuring– Learn averages, deviations, burst sizes– Learn which values are relevant to problems

Secure, user-specified group repairs– Don’t enable malicious attacks– Automate repairs of many machines

Page 18: Quals

18Apr 10, 2023

MDR: ArchitectureMDR: Architecture

SQL-based Data RepositoryGather Agent

vmstat thread

ping thread

tcpdump thread

DiagnosticConsole

E-mail orPhoneNotifier

Long-termgraphing Tolerance,

RelevanceLearner

DaemonRestarter

AggregationEngine

Page 19: Quals

19Apr 10, 2023

MDR-Arch: DerivationsMDR-Arch: Derivations

SQL-based Data Repository

DiagnosticConsole

E-mail orPhoneNotifier

Tolerance,RelevanceLearner

DaemonRestarter

Page 20: Quals

20Apr 10, 2023

Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.

Fault tolerance Scalability:

– Caches don’t need to commit to disk — authoritative copy elsewhere.

– Batching updates over wide area links.

Top level cache Top level cache

Mid level cache Mid level cache Mid level cache

Per-nodedatabase

Per-nodedatabase

Per-nodedatabase

Per-nodedatabase

Per-nodedatabase

Page 21: Quals

21Apr 10, 2023

Key: Self-Describing Key: Self-Describing

De-couple data gathering, data storage, and data use Self-Describing for Humans

– Descriptions of meanings of values stored with tables– Description of methods of gathering stored with tables– Column names help with self

Self-Describing for Computers– Functions for visualizing or summarizing data– Indication of resource selection from resource statistics

Page 22: Quals

22Apr 10, 2023

Key: End-to-End NotificationKey: End-to-End Notification

Recall: System must operate under extreme conditions Humans must validate that system is still working

– Standalone display can indicate timestamps, mark out of date data

– Wireless machine could intermittently contact notification system

– Pager could be automatically paged every so often

Problems should be propagated to end users.– Flexible notification — connected systems, e-mail, pager.– Limit over-notification

Page 23: Quals

23Apr 10, 2023

Key: Aggregation & HiResKey: Aggregation & HiRes

System target has hundreds – thousands of nodes Aggregate by showing out of bounds, relevant values

(via automatic tuning) Also want overview of system

– Aggregate across similar statistics; show value (fill) & dispersion (shade)

– Use color to highlight important values.– Aggregate across values (machine utilization = CPU + disk

+ memory)– Maximize data/pixel [Tufte]

Page 24: Quals

24Apr 10, 2023

Key: Agg & HiRes: SnapshotKey: Agg & HiRes: Snapshot

Page 25: Quals

25Apr 10, 2023

Key: Self-ConfiguringKey: Self-Configuring

Single statistics– Phase 1: Calculate averages, standard deviations, burst

sizes– Worked in other systems [Jaco88, Karn91]

Identify relevant statistics– Give system Boolean examples (variables out of bounds,

and system working/not working) get function.– Works for Boolean disjunctions in some cases:

• With lots of irrelevant variables [Litt89]

• With random bad examples [Sloa89]

• In some cases, with malicious bad examples [Ande94]

Page 26: Quals

26Apr 10, 2023

Key: Secure Remote ActionsKey: Secure Remote Actions

Security because of malicious attacks, benign errors Delegation to remove SA from the loop Independence from particular algorithms

– Building a library– Program with principals (hosts, users), and properties

(signed, sealed, verifiable)

Use secure, run-time extensible languages Actions report through gathering system

Page 27: Quals

27Apr 10, 2023

MDR: Testing MethodologyMDR: Testing Methodology

Fault injection– Deliberately make the system slow– Break hardware/software components

Feature comparison– Paper comparison with other systems

Usage in practice– Experience important to show system works– We have need of administrative tools

Testimonials– Experience at other sites lends credibility

Page 28: Quals

28Apr 10, 2023

MDR: DemoMDR: Demo

Hierarchical structure working (1 level right now) Alternative Interface Fault Injection Need for Aggregation Crufty right now Demo

Page 29: Quals

29Apr 10, 2023

Timeline: Key PiecesTimeline: Key Pieces

1) (DBs) Replicated, semi-hierarchical, data storage nodes

2) (SDS) Self describing structures3) (Vis) Aggregation and High Resolution Color Displays

4) (E2EN) End to end notification

5) (ReS) Automatic Restart

6) (Cfg) Partially self-configuring

7) (Rep) Secure, user-specified group repairs

Page 30: Quals

30Apr 10, 2023

TimelineTimeline

Deadlines:

June, 1997 Dec, 1997 Dec, 1998June, 1998

LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98Prototype 1,2,3(DBs, SelfD, Vis)

Prototype 4,5Notify, Restart

Prototype 6,7AConfig, Repair

LISA 6/98

Experiencewith 1-7

SOSP3/99

Architecture ofComplete System

Writing

Mar, 1999

Page 31: Quals

31Apr 10, 2023

ConclusionConclusion

Description of field shows breadth Monitoring, diagnosing, and repairing shows depth

– Examples show importance of problem– Fundamental goals & environmental constraints show

understanding of problem– Key innovations show differences from previous systems. – Architecture and initial prototype show approach to problem– Testing methods show ways to validate solution.

Timeline shows plan & milestones to graduation

Page 32: Quals

Old SlidesOld Slides

Page 33: Quals

33Apr 10, 2023

SolutionsSolutions

Managing stable storage Supporting users Simplifying security Monitoring, diagnosing, and repairing

Page 34: Quals

34Apr 10, 2023

Managing Stable StorageManaging Stable Storage

Consistency vs. availability Fault tolerance Scalability Recoverability Customization

Page 35: Quals

35Apr 10, 2023

Supporting UsersSupporting Users

Automated help desk– Searchable collection of questions– Easy method for addition

Remote device access Site-wide training

Page 36: Quals

36Apr 10, 2023

Goals: EnvironmentGoals: Environment

Uniform– Supports user mobility by eliminating arbitrary changes

– Increases effectiveness by avoiding need for users to learn multiple interfaces

Customizable– Handles special systems and special needs [firewalls, servers]

– Obviously reduces uniformity

Page 37: Quals

37Apr 10, 2023

Goals: Environment, cont. Goals: Environment, cont.

High Performance– Increases effectiveness of users [HCI/psych]

– Limited by cost-effectiveness Available

– Effectiveness is 0 if system isn’t working

– Balanced against expense

Page 38: Quals

38Apr 10, 2023

Goals: Faults & ErrorsGoals: Faults & Errors

Benign errors:– Accidentally deleted files

– Unnoticed runaway processes Malicious attacks:

– TCP SYN attack

– Sendmail bugs

– Data stealing

– False data injection

Page 39: Quals

39Apr 10, 2023

Goals: Users Goals: Users

Training– Troubleshooting = one-on-one training

– Larger sessions = classes Accounting

– Supports management, helps billing Capacity Planning

– Expanding systems takes time Legal

– Sensitive information needs protection

Page 40: Quals

40Apr 10, 2023

Simplifying SecuritySimplifying Security

USENIX talk says “If cryptography is so great, why isn’t it used more?”

SA’s worry about security to protect data. Goal: Ease development of secure applications Write programs using principals & properties rather than keys and algorithms Unify various forms of available cryptography (public key, secret-key, PGP,

Kerberos) My use: protected, transferable rights to allow various actions

– Modify system configurations (add filesystems, printers)

– Kill/restart processes (runaway, after configurations modified)

– Access data (private logs, for backups, etc.)

Page 41: Quals

41Apr 10, 2023

ConclusionConclusion

System administration as area of research– Description of field

– Areas for future research• Managing stable storage• Supporting users

Initial investigation of research area– Monitoring, diagnosing, and repairing

• Broad, draws from many fields