CS514: Intermediate Course in Operating Systems

CS514: Intermediate Course in Operating

SystemsProfessor Ken Birman

Ben Atkin: TALecture 2: August 29

Overview of Lecture

• Fundamentals: terminology and components of a reliable distributed computing system

• Communication technologies and their properties

• Basic communication services• Internet protocols• End-to-end argument

Some terminology

• A program is the code you type in• A process is what you get when you run it• A message is used to communicate between

processes. Arbitrary size.• A packet is a fragment of a message that

might travel on the wire. Variable size but limited, usually to 1400 bytes or less.

• A protocol is an algorithm by which processes cooperate to do something using message exchanges.

More terminology

• A network is the infrastructure that links the computers, workstations, terminals, servers, etc.– It consists of routers – They are connected by communication links

• A network application is one that fetches needed data from servers over the network

• A distributed system is a more complex application designed to run on a network. Such a system has multiple processes that cooperate to do something.

A network is like a “mostly reliable” post office

Why isn’t it totally reliable?

• Links can corrupt messages– Rare in the high quality ones on the Internet

“backbone”– More common with wireless connections,

cable modems, ADSL

• Routers can get overloaded– When this happens they drop messages– As we’ll see, this is very common

• But protocols that retransmit lost packets can increase reliability

How do distributed systems differ from network applications?

• Distributed systems may have many components but are often designed to mimic a single, non-distributed process running at a single place.

• “State” is spread around in a distributed system

• Networked application is free-standing and centered around the user or computer where it runs. (E.g. “web browser.) Distributed system is spread out, decen-tralized. (E.g. “air traffic control system”)

What about the Web?

• Browser is independent: fetches data you request when you ask for it.

• Web servers don’t keep track of who is using them. Each request is self-contained and treated independently of all others.– Cookies don’t count: they sit on your machine– And the database of account info doesn’t count either…

this is “ancient” history, nothing recent

• ... So the web has two network applications that talk to each other– The browser on your machine– The web server it happens to connect with… which has

a database “behind” it

You and the Web

DatabaseWeb browser with

stashed cookies

Cookie identifies this user, encodes past

preferences

HTTP request

Web servers are kept current by the database but usually don’t talk to it

when your request comes in

You and the Web

Reply updates cookie

Web servers immediately forget the interaction

You and the Web

Purchase is a “transaction” on the database

Web servers have no memory of the interaction

Examples of Distributed Systems

• Air traffic control system with workstations for the controllers

• Banking/brokerage trading system that coord-inates trading (risk management) at multiple locations

• Factory floor control system that monitors devices and replans work as they go on/offline

This course is about reliability

• We want to build distributed systems that can be relied upon to do the correct thing and to provide services according to the user’s expectations

• Not all systems need reliability– If a web site doesn’t respond, you just try again later– If you end up with two wheels of brie, well, throw a party!

• Reliability is a growing requirement in “critical” settings but these remain a small percentage of the overall market for networked computers

Reliability is a broad term

• Fault-Tolerance: remains correct despite failures• High or continuous availability: resumes service after

failures, doesn’t wait for repairs• Performance: provides desired responsiveness• Recoverability: can restart failed components• Consistency: coordinates actions by multiple

components, so they mimic a single one• Security: authenticates access to data, services• Privacy: protects identity, locations of users

“Failure” also has many meanings

• Halting failures: component simply stops• Fail-stop: halting failures with notifications• Omission failures: failure to send/recv.

message• Network failures: network link breaks• Network partition: network fragments into two

or more disjoint subnetworks• Timing failures: action early/late; clock fails,

etc.• Byzantine failures: arbitrary malicious behavior

Examples of failures

• My PC suddenly freezes up while running a text processing program. No damage is done. This is a halting failure

• A network file server tells its clients that it is about to shut down, then goes offline. This is a failstop failure. (The notification can be trusted)

• An intruder hacks the network and replaces some parts with fakes. This is a Byzantine failure.

More terminology

• A real-world network is what we work on. It has computers, links that can fail, and some problems synchronizing time. But this is hard to model in a formal way.

• An asynchronous distributed system is a theoretical model of a network with no notion of time

• A synchronous distributed system, in contrast, has perfect clocks and bounds all all events, like message passing.

Model we’ll use?

• Our focus is on real-world networks, halting failures, and extremely practical techniques

• The closest model is the asynchronous one; we use it to reason about protocols– Most often, employ asynchronous model to illustrate

techniques we can actually implement in real-world settings

– And usually employ the synchronous model to obtain impossibility results

– Question: why not prove impossibility results in an asynchronous model, or use the synchronous one to illustrate techniques that we might really use?

ISO protocol layers: Oft-cited Standard

Application The program using a communication connection

Presentation Software to encode data into messages, and decode on reception

Session Logic associated with guaranteeing end-to-end reliability andflow control, if desired

Transport Software for fragmenting big messages into small packets

Network Routing functionality, limited to small packets

Data-link The protocol that represents packets on the wire

• ISO is tied to a TCP-style of connection• Match with modern protocols is poor•We are mostly at “layer 4” – session

Internet protocol suite

• Can be understood in terms of ISO• Defines “addressing” standard, basic

network layer (IP packets, limited to 1400 bytes), and session protocols (TCP, UDP, UDP-multicast)

• Includes standard “domain name service” that maps host names to IP addresses

• DNS itself is tree-structured and caches data

Major internet protocols

• TCP, UDP, FTP, Telnet• Email: Simple Mail Transfer Protocol (SMTP)• News: Network News Transfer Protocol (NNTP)• DNS: Domain name service protocol• NIS: Network information service (a.k.a. “YP”)• LDAP: Protocol for talking to the management

information database (MIB) on a computer• NFS: Network file system protocol for UNIX• X11: X-server display protocol• Web: HyperText Transfer Protocol (HTTP), and SSL (one

of the widely used security protocols)

Typical hardware options

• Ethernet: 10Mbit CSMA technology, limited to 1400 byte packets. Uses single coax cable.

• FDDI: twisted pair, self-repairing if cable breaks

• Bridged Ethernet: common in big LAN’s, ring with multiple ethernet segments

• Fast Ethernet: 100Mbit version of ethernet • ATM: switching technology for fiber optic

paths. Can run at 155Mbits/second or more. Very reliable, but mostly used in telephone systems.

Implications for reliability?

• Protocol designers have problems predicting the properties of local-area networks

• Latencies and throughput may vary widely even in a single installation

• Hardware properties differ widely; often, must assume the least-common-denominator

• Packet loss a minor problem in hardware itself

Technology trends

Source: Scientific American, Sept. 1995

0

100

200

300

400

500

600

700

1985-1990

1990-1995

1995-2000

2000-2005

CPU MIPS

Memory MB

LAN Mbits

WAN Mbits

O/S overhead

Note tremendous growth in WAN speeds

Typical latencies (milliseconds)

0.01

0.1

1

10

100

1000

1985

-199

0

1990

-199

5

1995

-200

0

2000

-200

5

Disk I/O

EthernetRPC

ATMroundtrip

WANroundtrip

Note dramatic drop inLAN latencies over ATM

WAN, disk latencies arefairly constant due to physical limitations

O/S latency: the most expensive overhead on LAN communication!

05

10

1520253035

40

1985-

1990

1995-

2000

O/Soverhead aspercentage

Broad observations

• A discontinuity is currently occuring in WAN communication speeds!

• Other performance curves are all similar• Disks have “maxed out” and hence are looking

slower and slower• Memory of remote computers looks “closer

and closer”• O/S imposed communication latencies has

risen in relative terms over past decade!

Implications?

• The revolution in WAN communication we are now seeing is not surprising and will continue

• Look for a shift from disk storage towards more use of access to remote objects “over the network”

• O/S overhead is already by far the main obstacle to low latency and this problem will seem worse and worse unless O/S communication architectures evolve in major ways.

More Implications

• Look for full motion video to the workstation by around 2005 or 2010

• Low LAN latencies: an unexploited “niche”• One puzzle: what to do with extremely high

data throughput but relatively high WAN latencies

• O/S architecture and whole concept of O/S must change to better exploit the “pool of memory” of a cluster of machines; otherwise, disk latencies will loom higher and higher

Reliability and performance

• Some think that more reliable means “slower”– Indeed, it usually costs time to overcome failure– For example, if a packet is lost probably need to resend it,

and may need to solicit the retransmission

• But for many applications, performance is a big part of the application itself: too slow means “not reliable” for these!

• Reliable systems thus must look for highest possible performance

• ... but unlike unreliable systems, they can’t cut corners in ways that make them flakey but faster

Back to the internet: IP layer

• Addresses have a machine address and a “port” number. The port selects the application when a packet arrives. (If none is bound to that port, packet is dropped)

• Fixed maximum size of 1400 bytes• Each machine can have multiple addresses• Special “broadcast” address delivered to all• “Class D” addresses used for multicast groups• Running out of addresses, so tricks with

addresses are increasingly common

IP gotcha’s

• IP messages are not authenticated in any way. The sender can lie about who it is, and can send to any host or port on the network

• A system can lie about its own machine address

• IP messages are not reliable. They can be (and are) dropped at many stages

• IP routing: basically static these days

IP multicast

sends to: “123.45.87.51”

both blue machines acceptIP address “123.45.87.51”

• Key insight: IP address is just an abstraction.• Any machine can potentially “spoof” any other!

UDP protocol

• Lives above IP and looks much like IP• Permits larger packets (fragments them

into a burst of IP packets; if any is lost UDP packet will be dropped on receive side)

• Also can run in a multicast mode• Most applications use UDP; IP layer is

typically “reserved” for kernel-level applications

UDP loss rates can be very high!

• Hunt experimented with this (book reproduces some of his data)

• UDP is normally very reliable• If sender overruns receiver, even

briefly, 100% of data may be lost!• Easy to provoke this problem even with

source, dest on same machine!!!• O/S makes no effort to detect or avoid

loss!

TCP protocol

• Implemented over IP, considered “reliable”• Supports a byte-stream model, like a pipe.• Implemented using sliding-window protocol• Many variations on protocol to optimize

performance, window size, reduce header size, etc. We’ll focus on the “most basic” TCP protocol here and won’t look at the optimizations

TCP sliding window

mi+k mi+k-1 .... mi

- - mi+k-2 - mi+k-3 ... mi

sender provides data

receiver consumes data

IP packets carry segments

window has k “segments”

receiver replies with acks and nacks. sender resends missing data

Observations

• Trick is to have a big enough window so that data flow is continuous. This permits sender, receiver to “match” their data rates

• Must retransmit as soon as possible but not so soon that duplicates get through (undesirable overhead)

• Channel “breaks” after many retries, i.e. after a crash or if the network gets “lossy” for a while

TCP failures: not “failstop”

• Many applications treat a broken TCP channel as a sign of remote crash

• But in fact, a TCP channel can easily break under network overload or due to other transient conditions!

• Applications would then believe destination to have failed when it is actually still running.

A basic insight

• In fact, there is no way to know if a message has reached its destination unless it is explicitly acknowledged

• And if it is, the sender of the ack. has no way to know if it was received!

• Distributed systems are always slightly out of sync! This will be a very big issue for us later

Most distributed systems use UDP or TCP

• Applications typically lack permission to use IP (otherwise could “break” TCP protocol)

• UDP multicast is hard to use because it isn’t always available and information on hardware layout of a LAN is not available in any standard format or from any standard service

• Under heavy load, both UDP and TCP begin to misbehave (high loss rates, broken channels, etc)

End-to-End argument

• Suppose an IP packet will take n hops to its destination, and can be lost with probability p on each hop

• Now, say that we want to transfer a file of k records that each fit in one IP (or UDP) packet

• Should we use a retransmission protocol running “end-to-end” or n TCP protocols in a chain?

End-to-End argument

source

dest

Probability of successful transit: (1-p)n,Expected packets lost: k-k*(1-p)n

Saltzer et. al. analysis

• If p is very small, the overhead of the n TCP protocols (excess bytes sent, lost CPU time, etc) will be higher than if we just send the whole file, then retransmit the missing records

• Generalization: low-level transport systems should focus on speed, not reliability; the application layer should worry about “properties” needed by the application

Example

• Suppose that 2% of file will be resent. Wouldn’t want to impose a 10% overhead on the net and slow the transfer down by 7% for this purpose!

• But the end-to-end argument would not apply if:– p or n is large, hence (1-p)n approaches 0– cost of recovery when a problem occurs is very high– reliability property is hard for users to implement

• Justify complex mechanisms against these points.

For next time

• Read chapters 1-3• If you were in charge of extending the

communication layer, what would you change or add?

• If computers can set their own addresses and fill out packets in any way they like, how can a system like UNIX possibly support user-id’s over a network? Or are user-id’s a big fake?

CS514: Intermediate Course in Operating Systems

Documents

Transcript of CS514: Intermediate Course in Operating Systems