Sunfire Design and Configuation
Transcript of Sunfire Design and Configuation
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 1/36
Send comments about this document to: [email protected]
Sun Fire ™ Systems Design andConfiguration Guide
Nathan Wiger
Roger Blythe
Part No. 816-7882-10September 2002 Revision 04
Sun Microsystems, Inc.4150 Network CircleSanta Clara, CA 95054 U.S.A.650-960-1300
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 2/36
PleaseRecycle
Copyrigh t 2002Sun Microsystems, Inc., 901San Antonio Road, Palo Alto, CA 94303-4900 U.S.A.A ll rights reserved.
This product or document is distributed un der licenses restricting its use,copying, distribution, and decompilation.N o part of this product or
docum ent may be reprod uced in any form by any means with out prior written authorization of Sun and its licensors, if any.Third -party
software, including font technology, is copyrighted an d licensed from Sun sup pliers.
Parts of the prod uct may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademar k in
the U.S.an d other coun tries, exclusively licensed throu gh X/ Op en Compan y, Ltd.
Sun, Sun Microsystems, the Sun logo, Answ erBook, Answ erBook2, docs.sun.com, Solaris, Sun Man agement Center, Sun BluePrints, Sun Q uad
FastEthernet, Sun StorEdge, Op enBoot, Sun Enterprise, Sun Fireplane, and Sun Fire are trademar ks, registered tra dem arks, or service marks
of Sun Microsystems, Inc. in the U.S. and o ther coun tries. All SPARC tradema rks are used un der license and are tradem arks or registered
trad emarks of SPARC Internat ional, Inc. in th e U.S. and oth er countries. Products bearing SPARC tradema rks are based u pon an architecture
develop ed by Sun M icrosystems, Inc.
ORACLE is a registered trademark of Oracle Corporation. Netscape is a trademark or registered trademark of Netscape Communications
Corpora tion in the Un ited States and oth er countries. Legato NetWorker is a registered tr adem ark of Legato Systems, Inc. Adob e is a registered
trademark of Adobe Systems, Incorporated.
The OPEN LOOK and Sun Graph ical User Interface was d eveloped by Sun Microsystems, Inc. for its users and licensees. Sun acknowled ges
the p ioneering efforts of Xerox in researching an d d eveloping th e concept of visua l or graph ical user interfaces for the comp uter ind ustry. Sun
hold s a non -exclusiv e license from Xerox to the Xerox Graph ical User Interface, wh ich license also covers Sun’s licensees wh o implem ent OPEN
LOOK GUIs and otherw ise comp ly with Sun’s written license agreements.
RESTRICTED RIGH TS: Use, du plication, or disclosu re by the U.S. Govern men t is subject to restrictions of FAR 52.227-14(g)(2)(6/ 87)a nd
FAR 52.227-19(6/ 87), or D FAR 252.227-7015(b)(6/ 95) and DFAR 227.7202-3(a).
DOCUMENTATION IS PROVIDED “ AS IS” AND ALL EXPRESS OR IMPLIED CON DITIONS, REPRESENTATION S AND WARRAN TIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHAN TABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NO N-INFRINGEMENT,
ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Copyrigh t 2002 Sun Microsystems, Inc., 901San Antonio Road, Palo Alto,CA 94303-4900 Etats-Unis.Tous droits réservés.
Ce prod uit ou documen t est distribué avec des licences qui en restreignent l’utilisation, la copie, la distribu tion, et la décomp ilation. Aucun e
partie de ce prod uit ou docum ent ne peut être reprod uite sous aucun e forme, par qu elque moyen que ce soit, sans l’autorisation pr éalable et
écrite de Sun et de ses bailleurs de licence, s’il y en a. Le logiciel détenu par d es tiers, et qui comprend la techno logie relative aux polices de
caractères, est protégé par u n copyright et licencié par des fournisseurs d e Sun .
Des parties de ce prod uit p ourron t être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie.UN IXest un e marqu e
dép osée aux Etats-Unis et dans d ’autres pay s et licenciée exclusivement par X/ Op en Comp any,Ltd .
Sun, Sun Microsystems, le Sun logo, Answ erBook, Answ erBook2, docs.sun.com, Solaris, Sun Managem ent Center,Sun BluePrints, Sun Qua d
FastEthernet, Sun StorEdge, Open Boot, Sun Enterprise, Sun Fireplane, and Sun Fire,son t des marques de fabrique ou des marqu es dép osées,o umarqu es de service,d e Sun Microsystems, Inc.a ux Etats-Unis et dan s d’autres pays. Toutes les marques SPARC sont utilisées sous licence et
sont des marqu es de fabrique ou des marqu es déposées de SPARC International, Inc.au x Etats-Unis et dan s d’autres pays. Les prod uits portan t
les marques SPARC sont basés sur u ne architecture développ ée par Sun Microsystems, Inc.
ORACLE est une marque déposée registre de Oracle Corporation. Netscape est une marque de Netscape Communications Corporation aux
Etats-Unis et dan s d’autres pays. Legato NetWorker est une marqu e de fabrique ou un e marqu e déposée de Legato Systems, Inc. Adob e est une
marque enregistree de Adobe Systems, Incorporated.
L’interface d’utilisation graphiqu e OPEN LOOK et Sun a été dévelop pée par Sun Microsystems, Inc.p our ses utilisateurs et licenciés.Sun
reconnaît les efforts de pionn iers de Xerox pour la recherche et le dévelop pem ent d u concept d es interfaces d’utilisation visuelle ou grap hique
pou r l’indu strie de l’informatiqu e. Sun d étient un e licence non exclusive de Xerox sur l’interface d’utilisation graph ique Xerox, cette licencecouvran t également les licenciés de Sun q ui mettent en p lace l’interface d’utilisation graphiqu e OPEN LOOK et qui en outre se conforment aux
licences écrites de Sun .
LA DOCUMENTATION EST FOURNIE “EN L’ETAT” ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES
OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISE’E PAR LA LOI APPLICABLE, Y COMPRIS
NO TAMMEN T TOUTE GARAN TIE IMPLICITE RELATIVE A LA QUALITE MARCHAN DE, A L’APTITUDE A UNE U TILISATION
PARTICULIERE OU A L’ABSENCE DE CONTREFAÇON.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 3/36
1
CHAPTER 1
Designing You r System
Now that you have completed your statement of requirements you can w ork on the
first ha lf of d esigning a Sun Fire system —designing t he logical server. By the en d of
this chapter, you will have completed a logical design containing a list of how many
of each of the components you need, and a listing of your RAS requirements. Then,
you can apply this configuration in Chapter 4 wh en you choose the p hysical system
in wh ich to p lace your d esign.
Systems design is done in this somewh at “backwards” m anner for two imp ortant
reasons:
s To make su re your requ irements are clearly stated and met.
s Multiple servers can be located inside one physical chassis because Sun Fire systems
support domains.
Followin g this process w ill also help ensure that you p urchase a system w ith enoug h
room for future expan sion.
This chapter covers the following topics, which describe the logical design process:
s Und erstanding a Runn ing System
s Design Rules of Thum b
s Analyzing an Existing System
s Designing for RAS
s A Logical Design Specification
Understand ing a Runn ing System
This section reviews the basics of a computer system. While this is likely all
"refresher" material, many people m isunderstand the real roles of comp uter
components somew hat. As such, an an alogy to a reception desk is used to h elp
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 4/36
2 Designing Your System
better illustrate the d ifferent role each major component plays. In th is analogy we
follow a receptionist answering various typ es of incoming calls to show how a
compu ter ma nages the requests it receives.
Every comp uter system h as three main compon ents that can be configured:
1. I/ O devices
2. CPUs
3. Memory
Of course, a Sun Fire system has m any other components too, including rep eater
boards, the Fireplane, and so on. How ever, in the Sun Fire system (as with m ostcompu ter systems), these are part of the fundam ental architecture of the m achine
and cannot be configured by the customer. This fact mean s that to d esign you r
system, you should pay close attention to th e decisions you make regard ing CPUs,
mem ory, and I/ O because th ese decisions w ill d irectly affect the effectiveness of
your design.
Notice the use of the term CPUs. Because the Sun Fire system board is sold with a
minimum of two processors, it is not possible to buy a single-CPU Sun Fire system.
All Sun Fires are multiprocessor systems.
I/ O Devices
The Sun Fire system uses the PCI bus for all I/ O. The I/ O is what allows you to do
anything p rodu ctive with the system. Without I/ O, you wou ld have no keyboard,
no netw ork connection, no d isks, and so forth.
Und erstanding the imp act I/ O has on the system is imp ortant. When something has
to be done w ith I/ O, an interrup t is generated. The CPUs mu st hand le this interrupt.
Frequen tly, I/ O is the single-biggest resource sink on a system. This fact is especially
true w hen you have m ultiple types of I/ O running h eavy loads concurren tly, which
generates a large num ber of interrup t requests.
For examp le, consider a backend d atabase server that is front-ended by a dozen or
more concurren t web servers. When a web server needs some dyn amic data, it hasto make a request via the network to the server, wh ich then m ust do the ap propr iate
database selects and retrieve the data from its local disk, finally shuffling the reply
back across the network to the w eb server that requ ested it. This can result in a
nu mber of I/ O interrup ts, as the system mu st hand le all of the netw ork packets as
well as all of the disk seeks to get the d atabase information off disk.
When you mu ltiply one request times a dozen or more w eb servers, each request
times a d ozen or m ore clients, you can see that the d atabase server could easilybecome swam ped w ith I/ O interrup ts, wh ich excludes the comp uting pow er needed
to run the operating system, manage m emory, and ru n the d atabase itself.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 5/36
Understanding a Running System 3
To tie everything together, think of I/ O as each individual phone call received by a
receptionist. Each p hone call generates an interru pt that th e receptionist mu st
hand le. Depend ing on the requ est, it m ay result in a lot of data tran sfer (talking)
back to the caller. More calls generate m ore interru pts. Eventu ally, the ph one system(server) hits a limit either in the am ount of concurrent requ ests that it can hand le
(memory), the speed with which the requests can be fulfilled (by the CPUs), or how
fast the caller and the CPUs can comm unicate (I/ O speed ).
CPUs
The CPU is actually responsible for mu ch more than compu tation. Anything that
pu ts a load on the system, includ ing d atabases, web servers, email, NFS, NTP, and
general networ k and user tra ffic, requires a lot of CPU p ower. The CPU d oes not d o
as much thinking as it does handling. Any time the system m ust d o anything, it mu st
ask the CPU, which has to prioritize the task, schedule it, and allocate resources for
it, and do so in a w ay that allows all the other mu ltitud e of things going on to
continue running too.
In this way, the CPU can be thou ght of as a bu sy receptionist. The receptionist ha s anu mber of stand ard routines. These may include forwarding calls to employees,
taking messages, setting u p ap pointments, and even providing d irect responses to
simple requests, “What is your ad dress?” When an incoming p hone call is received,
the receptionist executes the proper routine, and completes the request if possible. If
the request cannot be fulfilled in a reasonable amoun t of time, the receptionist may
have to place the caller on hold temp orarily to handle some other tasks and free up
some time.
In some cases, the receptionist may receive a request that is too complex to behand led by standard routines. For example, the receptionist ma y receive a call that
the boss is running late, and th at several meetings need to be rescheduled. H ere, the
receptionist must d o some thinking to determ ine wh ich m eetings can be moved to
wh en. At the same time, the receptionist mu st still pay attention to other incoming
calls, to ensure an important request is not missed.
If things get too busy for one receptionist to hand le, you m ay need two or m ore
receptionists. Some callers may even get frustrated and hang up. Even for those thatdo get throu gh, there will likely not be enough time to prop erly answer their
queries.
So, it is imp ortan t to consider not on ly the difficulty of each requ est, but the volum e
too. In our analogy, each incoming request requires a certain baseline of time to
han dle p roper ly. Typ ically, the receptionist will have to p ress a butt on to p ick up th e
app ropriate line, answ er the call with a greeting, listen and analyze the requ est, then
pr ioritize it and comp lete it app ropr iately. Even if a request consists of noth ing more
than “Is Mr. John son in?”, it still takes a certain am oun t of time to fulfill the requ est.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 6/36
4 Designing Your System
Memory
In the Sun Fire system, the system m emory is dynam ic rand om access memory
(DRAM). The system u ses mem ory to store thing s that it is using actively such as theoperating system, programs, and their data.
When asked to execute a program , the system m ust allocate space in m emory to
hold an image of the program and its associated data. This space can grow or shrink
as the program run s, since its resource requirements may change. In reality, most
app lications grow over time because they d o a p oor job of cleaning up after
themselves.
When a system is und er a very heavy load, it may ru n out of room in memory tohold a ll the information it n eeds. In this case, it u ses predeterm ined d isk space,
known as swap space, to tempora rily store lesser-used things from m emory
temporarily to make room for other things. This is known as paging, since it involves
selectively moving specific data out of memory in sections know as pages. When
those pages are needed , the system incurs a page fault , and the d ata is moved from
disk back into memory.
In extreme situations, the system may und ergo swapping. In this case, memoryimages of entire programs a re moved from m emory ou t to d isk. This is a significant
performance hit, and if the system starts swa pp ing, some serious problems may
occur. Unfortunately, the terms p aging and swap ping are often u sed
interchangeably, perhap s because the d isk storage is called “swap space,” bu t they
are really very different.
Do not undervalue how important memory is to a running system. Not having
enough m emory is perhap s the single greatest cause of performance problems.
With th e receptionist analogy, you can th ink of mem ory as th e nu mber of incoming
ph one lines available. Even if you h ave five reception ists (CPUs), it w ill not h elp the
situation if you only hav e four ph one lines (mem ory). The ph one system w ill still be
slow, since you have a bottleneck in th e amou nt of requests you can h and le
concurren tly. To accept anoth er call, the cur rent caller w ill hav e to be p laced on h old
( page-out ) in order to get back to the first caller ( page-in).
If the load gets too heavy for the phone system, and no more lines can be p ut on
hold, calls will have to be d isconnected (swap -out) to make room for others. Thereceptionists w ill then h ave to call the person back (swap -in), a mu ch more time-
intensive process.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 7/36
Design Rules of Thumb 5
Design Rules of ThumbYou can u se a num ber of rules of thumb to design a system. Properly using these rules
requires a firm grasp on you r needs—that you ha ve comp leted a statement of your
requirements using the information and tables in Chapter 1 an d Chapter 2.
This section describes the following design rules of thumb:
s Spread your I/ O devices across as man y PCI buses as p ossible.
s Decide how man y CPUs the system need s.
s Decide how much memory the system needs.
s A well-designed system should seldom page, and never swap.
s The system shou ld always ha ve some idle time.
s Whenever you add additional CPUs, you should also add memory.
I/ O Devices
You shou ld always d etermine your I/ O design first, as this along w ith your
app lication need s determ ines your comp uting requ irements. To get the best
performance and reliability from your Sun Fire server, you should lay out the I/ O
carefully. An easy rule of thumb is:
Spread your I/O devices across as many PCI buses as possible.
Doing so distributes your I/ O load across as many different controllers as possible,
thus imp roving performance. In ad dition, you are red ucing the nu mber of single
points of failure that could cause your data to go offline. Unfortunately, this rule of
thum b has m any caveats. Unlike CPUs and mem ory, the layout of your I/ O
intimately affects the reliability of your m achine, and wh ether or not you can use
features such as dynamic reconfiguration (DR). Chapter 4 d iscusses th e issue of I/ O
design in detail, taking all of these factors into consideration.
CPUs
Regardless of what your tasks are—NFS service, CAD simulations, or compiling
software builds—handling each requ est requires a certain baseline of time, as th e
receptionist examp le shows. Not only is the type of request imp ortant, but th e
qu antity of requests is impor tant too. In fact, it is often hard er for a system to han d le100 sma ll requ ests than 10 large ones, due to the inh erent overh ead of han dling each
request.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 8/36
6 Designing Your System
How man y CPUs are enough? The rules of thum b you can use to help you
determine how many CPUs you need are:
s One-half CPU per n etwork card
s One-eighth CPU per I/ O device (disk or tape)
s Two CPUs per application for mostly I/ O-based applications (NFS, web servers
and so forth)
s Four+ CPUs p er ap plication for m ostly CPU-based ap plications (simu lations,
databases and so forth).
These figures assume a mod erate load on you r system. If you are expecting a high
load on certain aspects, you should d ouble the correspond ing num bers. For
example, if a system is going to h ave a lot of network traffic, you sh ould have on e
CPU for every netw ork card to h and le the interrupts. Conversely, if you are
designing a system you expect to have a very light load, cut the nu mbers in half or
consider w hether th e tasks that system is going to be performing could be combined
with another server to lower overhead.
To get an id ea of how m any CPUs you need, ad d up each of the criteria that affect
you, then round up to the nearest multiple of two. We recomm end th at you only buy
the four-CPU boards for you r Sun Fire system. Purchasing a tw o-CPU board limitsyour futu re expansion room, since it takes up the sam e amou nt of space as a four-
CPU board. H owever, there are m erits to the 2-CPU board if you do not need
expansion room, and th e examples later in the book dem onstrate a good u se for it.
So, if you are d esigning an NFS server w ith a gigabit network card and six Sun
StorEdge T3 arrays, you h ave the following CPU requ irements (TABLE 3-1).
Rounding up, you should buy a four-CPU board to run this system.
Note – These rules work well for average systems. H owever, for high-intensity
app lications such as on line transaction p rocessing (OLTP), data m ining, and so forth ,
you should research your needs more carefully. For details, see “Analyzing an
Existing System” on page 8.
TABLE 1-1 CPU Requirements
Description Number
Gigabit network card (1) 1/ 2
Disk arrays (6) 3/ 4
NFS server 2
Total 3 1/ 4
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 9/36
Design Rules of Thumb 7
When buying CPUs, you should make sure you have enough memory to
accomm odate them , or else you run the risk of thrashing. This means that the system
spend s all its time m oving things around in mem ory, and never d oes any real work.
This is like the receptionist who sp ends time picking u p p hone lines and saying”Please hold,“ without actually fulfilling any requests. ”Memory” discusses this in
detail.
Finally, in terms of speed, getting the fastest processor you can buy is always an
adva ntage. In ad dition to the speed of the processor, you also should consider the
size of its cache. Generally this is decided for you based on the p rocessor mod el, but
you w ant to m ake sure to get as large a cache as possible. The cache determin es how
man y operations can be hand led at one time by the processor without having to
make a trip back out to system memory. Processor cache is several orders of magnitude faster than memory , so a large cache is always beneficial.
Memory
Memory is perhaps the single most imp ortant par t of a comp uter system, and h as
the most d irect impact on p erformance. The more memory you ha ve, the morethings you can d o, and th e faster you can do th em, since less disk access is needed.
There is usu ally a greater correlation betw een perceived performan ce and m emory
than processors. With relational databases, for instance, being able to fit as much of
the da tabase in memory as p ossible can yield a big improvemen t in performan ce.
If your system is ru nning slowly, you should p robably buy more m emory, not more
processors. It is more likely that your system is run ning ou t of mem ory, not
processor cycles, and is having to use sw ap space to run your a pp lications.
It is possible to waste m oney and overbuy m emory as w ell, though, so here some
specific rules. On the Sun Fire system, m emor y is tied to a p rocessor (see TABLE 1-5 in
Chapter 1). So, you cannot buy a board with just mem ory and no CPUs. This
actually simplifies the design process considerably because there are only two
decisions to m ake:
s Whether to h alf-popu late or fully-populate each CPU board
s Whether to bu y larger or smaller DIMMs
Fully-populating a CPU board allows you to pu t more mem ory on it. In add ition,
though , you get better interleaving, which increases performance. Thu s, the rules of
thumb for mem ory are:
s For I/ O-based app lications, half-pop ulate the CPU/ Memory board.
s For CPU-based app lications, fully-populate th e CPU/ Memory board.
Then, choose the app ropriate DIMM size to provide enou gh m emory for your
app lication. Following these ru les will naturally lead to smaller mem ory sizes inNFS servers (where memory is basically solely used for the file buffer cache), and
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 10/36
8 Designing Your System
larger, faster memory configurations in database and compute servers. Most systems
tend to be in one category or the other, but if there is a mix, fully-popu late all
boards.
Remember th at, as d iscussed previously, paging is und esirable. So, another good
rule of thum b is:
A w ell-designed system should seldom page, and n ever swap.
It is possible, in fact, to run a large memory system with very little (if any) swap
space. This fact is somewhat different from other commonly available information.
One comm only-used phr ase is “Your sw ap sp ace should be dou ble the size of your
ph ysical mem ory.” Consid er this for a mom ent. You can easily design a Sun Fire
system that has 64 gigabytes of memory. If you were to follow this advice, you
wou ld hav e to have 128 gigabytes of swap space. While a few vend ors may require
you to have a large swap spa ce, you should not rely on swap for real-world mem ory
usage, as it is too slow. When d esigning a system, make sure that you p urchase
enough m emory so that your system d oes not swa p. If it does, you need more
memory.
Analyzing an Existing System
Often, the pur pose of designing a new system is to replace an existing system in
your infrastructure. If so, you can benefit from analyzing your existing system
because this analysis w ill give you a better idea of w hat p roblems you are facing.
This analysis is also useful if you are trying to up grade a Sun Fire server. A prop eranalysis will ensure that you are up grading th e right parts of the system to add ress
the issues.
Before you go any further, you should revisit your design goals discussed and
developed in Chapter 2. Doing so will help you p roperly formulate your statement
of the performance problems you are encountering. A good p roblem statemen t is:
When many users are logged in, NFS performance is very slow.
A bad p roblem statement is:
There is a large number in the w column of the vmstat output.
Always start with the p erceived problems and requ irements. An imp rovement in
these area s is the only w ay you can tell if you r d esign is a success. You can only
make use of statistics if you know wh at you are looking for.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 11/36
Analyzing an Existing System 9
The easiest way to analyze a system is by using the stat command s that ship w ith
the Solaris OE, and w hich can be u sed to m onitor performance of a runn ing system.
You can get a fu ll list of the available comm and s by typing th e following com mand
in a shell prom pt:
This command will display a series of comm and s, such as vmstat, iostat,
netstat, and so on.
You shou ld never u se the uptime comm and to analyze a system . You can u se it to
show h ow long your system ha s been up, but the notion of a load is very outd ated
and fairly useless in th e Solaris OE. Most notab ly, load varies w idely from system to
system; a load of 10 may indicate a lack of activity on one machine, but extreme
activity on another. We recommend you get in the habit of using vmstat 5 instead
of uptime wh en a machine seems sluggish.
Some stat command s are m ore useful than others, so the following sections focus
on the useful commands (TABLE 3-2).
Collecting and understanding the output from these commands should give you a
good idea of what p roblems your current system is having, and how to improve
up on these problem areas in the d esign of your new server.
The following sections review each comm and in turn , along w ith how to p roperlyuse each one, so you can gather the best statistics possible. It is important to note
that n ot all options of a given stat command produce useful—or even
trustw orthy—outpu t in all situations. The focus is on th e specific parts of the ou tpu t
of each command that are the most important.
# ls /usr/bin/*stat
TABLE 1-2 Useful Stat Commands
Command Description
/usr/bin/vmstat Virtual mem ory/ paging statistics with CPU/ process summ aries
/usr/bin/mpstat Extensive per-processor statistics
/usr/bin/iostat I/ O and NFS statistics
/usr/bin/netstat Netw ork statistics
/usr/bin/prstat Sum mary of active processes very similar to the top utility
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 12/36
10 Designing Your System
How and When
How and when you monitor a system is just as important as what commands you
use and wh y you use them to collect statistics. You shou ld m ake sure that you aremonitoring the system when it is doing what you want it to do.
In some situations, this is relatively straightforwar d, such as on a mu ltiuser
interactive system. In this case, you w ant to ru n your stats during th e day, wh en
everyone is doing th eir normal w ork. Conversely, if you h ave a system that serves
mainly as a d atabase server, and th e load gets very heavy at night w hen batch jobs
are running, you should gather your stats overnight.
When collecting stats un attended (such as overnight), use a simple shell script tha twrites to a log file in /var/tmp with periodic timestamps. You can use something
like the script in CODE EXAMPLE 1-1 to run the stat comman ds mentioned
previously:
The way th is script w orks you w ill get timestamps at each interval count you
specify. So, if you run:
CODE EXAMPLE 1-1 nightstats—Script for Una ttend ed Stat Collection
#!/bin/sh
# nightstats - Script for unattended stat collection
if [ $# -lt 2 ]; then
echo "Usage: $0 stat args ..." >&2
exit 1
fi
# some basic vars holding date, etc.stat=$1
shift
date=‘date +%Y%m%d‘
logfile=/var/tmp/$stat.$LOGNAME.$date
# run the stat, writing output to our logfile
exec 1>$logfile
echo "Running ’$stat $@’ as ’$LOGNAME’"
while true
do
date$stat "$@"
done
# nightstats vmstat 5 12
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 13/36
Analyzing an Existing System 11
you get a timestamp every 12 repetitions. Note that th e nightstats script loops
indefinitely so you mu st man ually kill it when you w ant it to stop. You can u se this
script to kick off stats in the background, either using cron or before you go hom e.
For example:
Then, wh en you com e to work the next d ay, you will have a log file in /var/tmp for
each stat, with timestamp s every five minutes. Each file will be named with the
name of the stat command , the date, and your user nam e ($LOGNAME is
au toma tically set to your u ser nam e by the shell). This will allow you to collect stats
during the times when your system is under the type of load you care about.
Note – You also w ant to collect some stats w hen you r system is not busy, which you
can then use as a baseline for comp arison. Otherwise, you w ill not be able to tell
wh at stats change w hen the load increases.
Simulating Loads
Trying to simu late loads is not very u seful. In general, trying to simu late a load gives
you a poor—if not m isleading—picture of w hat th e system is trying to d o. For
example, the common practice of using dd to w rite to disks is usually a
misrep resentative m easure of I/ O load . While dd read s and w rites sequentially, most
real-world disk access is random , and is an u npred ictable combination of reads and
wr ites. Thu s, your configuration could look good on p aper, and work well when
running dd, but work poorly in a real-world app lication.
To get an accurate picture of your requirements, you should monitor a system that is
run ning w hat you wan t it to be running. If you need to simu late this, the best way is
to create a test environment that m irrors what you wan t to design as closely as
possible. If you cannot do so, then w e recomm end that you u se the design rules of
thum b, and avoid a nalyzing a dissimilar system as this can cause you to make p oordecisions.
What and Why
Now that you understand how and when to measure your system, the following
sections examine each of the different stat command s and what they tell you.
# nohup nightstats vmstat 5 300 2>/dev/null &
# nohup nightstats iostat -xcnz 5 300 2>/dev/null &
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 14/36
12 Designing Your System
prstat Command
When looking at your stats, the first thing you should know is what you r system is
doing. Seeing a large amou nt of d isk activity by itself does not tell you anything
other tha n th e system is un dergoing a large amoun t of disk activity. This is where
th e prstat command comes in. It show s you w hat p rocesses are active on the
system, along w ith how m uch CPU time they are using, wh at processor they are
bound to, their size in mem ory, their priority, and more. If you h ave used the
freeware tool top, the outpu t should look very familiar.
Unlike all of the other stat commands, to use prstat you just type the comm and
with no argum ents:
The display fills the terminal w ind ow an d refreshes every 5 second s. You shou ld
launch prstat in a separate wind ow an d keep it going as you use each of the
following stat command s. That w ay, you can correlate the performan ce of your
system with w hat is actively running on it.
vmstat Command
Memory is always the first place to start. If you h ave a mem ory bottleneck, then all
of your other stats are going to be u nreliable, since the system will be introducing
extra delays trying to m anage mem ory. Often mem ory problems are misdiagnosed
as I/ O or CPU problems, since disk access or applications seem slow to the user. In
reality, these operations are slow because the system is paging or even sw app ing.
So remem ber: Always start by looking at memory. Repeat that over an d over a s a kind
of mantra w henever you are analyzing or designing a system.
The virtual memory man agement algorithms in the Solaris OE are complex.
Basically everyth ing is seen as a pag e of mem ory, includin g files. While this is a
benefit as far as the system is concerned , it makes an alysis more d ifficult. Therefore,
properly analyzing m emory takes several steps.
# prstat
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 15/36
Analyzing an Existing System 13
The simplest way to look at memory is by specifying a time interval to the vmstat
command, and letting it run u ntil you press Ctrl-C to interrupt it. The following
vmstat command monitors the system in five-second intervals:
The first line of th e vmstat output is a summary.
Note – Always ignore th e first line of any stat command. It does not provide any
useful information because it is a sum mar y for as long as the system has been u p.
Sum maries span too long a p eriod of time, and they give you no indication as to the
use of the system du ring that time.
When looking at the outpu t from vmstat (CODE EXAMPLE 1-2), you w ill notice a lotof columns. You shou ld ignore all the fields about disks and device interru pts, as
there are better tools for m onitoring these stats, which w e w ill describe in
subsequ ent sections. In fact, only some of these colum ns (TABLE 3-3) are really useful.
CODE EXAMPLE 1-2 How to Use the vmstat Command
# vmstat 5procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 -- -- -- in sy cs us sy id
0 0 20 1461688 510080 37 185 30 1 2 0 0 1 0 0 0 667 650 292 4 2 94
0 0 64 1468888 197976 8 43 0 0 0 0 0 0 0 0 0 638 571 269 1 1 98
0 0 64 1469320 198528 0 0 1 0 0 0 0 0 0 0 0 642 467 256 0 1 98
TABLE 1-3 Important vmstat Command Output Columns
Column Heading Meaning
r Nu mber of run nable processes (waiting for CPU time)
b Nu mber of blocked processes (waiting for I/ O, paging, and so on)
w Number of runnable but swapped-out processes (normally 0)
re Page reclaims (memory pages taken from other p rocesses)
mf Minor page faults
pi Kilobytes paged in (including process startup and file access)
po Kilobytes paged ou t (should be close to 0)
sr Pages scanned by p age-out scanner (also close to 0)
us Percentage of CPU time spen t in user mod e
sy Percentage of CPU time spent in system m ode
id Percentage of CPU sp ent id le
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 16/36
14 Designing Your System
First, look at t he procs head ings. Norm ally, the r, b, and w columns are fairly low
nu mber s, if not 0. This is because, generally, these column s only become n onzero if a
process is waiting for something, either a CPU (r), I/ O (b), or enough m emory (w).
Large nu mbers in these colum ns are usu ally bad.
One caveat is that you may occasionally see a steady, unchanging num ber in the w
column . This means that th e Solaris software has d ecided these processes have been
idle so long they should be swap ped out to make room for other things. Do not be
concerned a bout this.
The cpu colum ns give you a good system-at-a-glance snapshot of what th e system is
doing, averaged across all processors. In general, non-idle time should be spent in
roug hly a 2-to-1 ratio in usr -to-sys modes. Also, if idle time (id
) is close to zeroconsistently, you probably need some additional CPUs, especially if the r column is
a large num ber. Beyond th is, to get a good view of your CPUs you should u se the
mpstat command , as explained in “mpstat Command” on page 18.
On to memory. First, note that the free colum n shou ld be comp letely ignored, as it
does not in any way correspond to w hat is thought of as free memory. Because of the
wa y the Solaris software ma nages m emory, the free list does not p roperly count
mu ltiple processes sharing the same p ages, or un used pages that h ave yet to be
reclaimed. In ad dition, the file cache grows to consume m ost of free mem ory toimprove performance.
Consequently, the free list tend s to d ecrease steadily over the up time of a system,
when in fact the system is efficiently reclaiming and reusing memory.
If you w ant a better picture of available virtual mem ory, you can u se the swap
command:
If both the free colum n from the first comm and , and the available column fromthe second comma nd are nonzero, the system is all right. Beyond th at, you can
ignore the concept of free memory.
Instead, the most imp ortant colum n of vmstat is the scan rate (sr). This colum n
shows the nu mber of pages scanned in an attemp t to free unu sed mem ory. The
pageout scanner starts run ning only when free memory goes below the kernel
parameter lotsfree, wh ich is a small p ercentage of p hysical m emory. When you
see an increase in the scan rate, you shou ld also see a jum p in the pa ge-outs (po),
indicating that pages are being moved from physical memory to sw ap sp ace. If you
# swap -l
swapfile dev swaplo blocks free
/dev/dsk/c0t1d0s0 227,6 16 4093712 4093712
# swap -s
total: 494360k bytes allocated + 35568k reserved = 529928k used,
25137440k available
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 17/36
Analyzing an Existing System 15
see this consistently, it is evidence of a memory shortage—the system needs more
memory. If this only happ ens occasionally, then you should explore w hether better
job sched uling or /etc/system tuning could h elp. If not, you need more m emory.
Note – A high nu mber in the pa ge-ins (pi) colum n is not necessarily significant.
This is because w hen a new process starts, its executable image an d data mu st be
read into memory. Also, file system access appears in the pi column too. A large
num ber in the pi colum n is only relevant if the po column is large too.
Here is an examp le of a system th at is und ergoing heavy paging because it is
reading in a large file.
CODE EXAMPLE 1-3 vmstat 5 Command Outpu t Reading a Large File
# vmstat 5
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s6 s7 in sy cs us sy id
0 0 0 2406032 431280 8 72 2 0 0 0 0 0 1 0 1 121 87 202 1 5 94
0 0 24 2489472 643792 0 0 1 0 0 0 0 0 0 0 0 328 86 108 0 2 98
0 0 24 2489472 643784 61 252 483 0 0 0 0 0 5 0 9 466 718 260 3 8 90
0 0 24 2452936 605616 1396 1753 10950 0 0 0 0 0 9 0 77 1266 2363 801 50 45 50 0 24 2383216 531176 1484 1860 11822 0 0 0 0 0 53 0 40 790 1897 357 55 33 12
0 0 24 2309576 458256 1435 1773 11475 0 0 0 0 0 69 0 23 697 1791 247 51 30 19
0 0 24 2236608 391168 1374 1761 11008 0 0 0 0 0 52 0 40 775 1613 235 49 35 17
0 0 24 2165824 324224 1411 1700 11291 0 0 0 0 0 75 0 16 751 1652 239 47 32 21
0 0 24 2097680 253816 1378 1720 11012 0 0 0 0 0 0 0 87 746 1687 246 47 33 20
0 0 24 2028800 184168 1330 2020 10614 0 0 0 0 0 73 0 11 719 1608 239 52 33 16
0 0 24 1948016 110880 1350 1649 10790 0 0 0 0 0 56 0 37 764 1605 246 49 32 19
0 0 24 1886176 48208 1282 1666 10187 8 8 0 13 0 1 0 89 793 1934 312 44 37 19
0 0 24 1835416 7280 688 836 5598 5328 5529 0 6586 0 94 0 47 1088 889 238 24 30 46
0 0 24 1803768 6680 353 675 3052 6657 6808 0 6749 0 80 0 80 1287 478 435 16 26 58
0 1 24 1790704 15856 236 393 832 4470 4481 0 1665 0 68 0 107 2579 792 1416 11 41 48
0 1 24 1784160 18152 35 388 812 3136 3144 0 839 0 60 0 92 1473 724 800 10 29 62
0 1 24 1777192 18488 29 317 988 2536 2540 0 634 0 66 0 97 988 446 422 7 15 78
0 1 24 1770768 18664 20 326 942 2334 2345 0 616 0 77 0 77 953 518 409 7 18 75
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s6 s7 in sy cs us sy id
0 0 24 1764704 18528 37 339 820 2636 2648 0 699 0 105 0 48 961 509 343 8 20 71
0 1 24 1757544 18640 30 264 1051 2206 2214 0 602 0 124 0 43 963 331 398 5 15 80
0 1 24 1753544 18248 19 255 1081 2048 2056 0 880 0 97 0 70 960 323 412 5 13 81
0 1 24 1749440 18664 20 258 1046 2048 2062 0 632 0 99 0 63 974 491 443 8 14 77
0 1 24 1744720 18720 17 255 1009 2152 2153 0 552 0 102 0 58 1012 344 449 6 15 790 1 24 1739992 18920 16 256 1008 1974 1982 0 529 0 101 0 55 929 324 379 6 16 78
0 1 24 1735416 18800 16 261 998 2048 2052 0 536 0 107 0 55 966 315 379 5 15 80
0 0 24 1729704 18768 54 268 833 2177 2179 0 546 0 83 0 55 862 352 338 18 13 69
0 1 24 1728480 18816 105 403 1140 1971 1974 0 552 0 110 0 62 1027 569 492 7 11 83
0 1 24 1728600 18888 48 196 1118 1484 1489 0 470 0 110 0 53 1014 261 484 5 10 85
0 1 24 1728496 18832 42 191 1304 1536 1544 0 525 0 123 0 51 1000 160 455 2 8 89
1 0 24 1728344 37712 372 143 946 1176 1178 0 318 0 103 0 43 789 84 335 3 21 76
0 0 24 2489048 652144 3 78 32 0 0 0 0 0 6 0 5 427 310 168 1 4 95
0 0 24 2488840 651872 0 1 11 0 0 0 0 0 1 0 1 351 134 138 0 2 98
0 0 24 2488792 651776 0 0 3 0 0 0 0 0 0 0 0 349 114 128 0 2 98
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 18/36
16 Designing Your System
Notice that for the first half of the outpu t, there is a large nu mber of pi, but no po,
du e to the file system activity of reading the file. As it prog resses, though , notice the
abrupt jump in po as w ell as the sr. Also notice how m uch the pi and u ser time
(us) drop. The system is spending an inordinate ratio of time m anaging m emory,
slowing d own how quickly it can read in the file.
As with a ll stats, brief periods of paging are not imp ortant. The pur pose of having
virtual mem ory is to allow you to tem porarily exceed you r available physical
mem ory. You just w ant to make sure the system is not paging continu ously for
extended periods of time.
By now you shou ld ha ve a rough idea of wh at your system is doing. To really
und erstand wh at is going on, though , you mu st be able to differentiate between file
system pages, executable pages, and so on. To d o this you can use th e vmstat
-p option.
vmstat Command -p Option
Using the vmstat -p option funda mentally changes the type of data reported by the
command. The -p option replaces the columns on processes, CPUs, disks, and
interrup ts with extend ed statistics on mem ory and paging, and d isplays for
executable, anonymous and file system pi, po, and pf.
Examine each of the three types of pages show n by the vmstat -p option:
Und er each page typ e head ing are the following fields, wh ere ? is replaced with th e
first letter of the p age typ e:
TABLE 1-4 Page Types Shown By vmstat -p Option
Page type Meaning
executable Images of executable program s and th eir data
anonymous Used for a p rocess heap sp ace, stack, and private p ages
filesystem Files mapped into address space through the mmap command
TABLE 1-5 Page Stats Shown By vmstat -p Option
Column Heading Meaning
?pi Kilobytes paged in
?po Kilobytes paged ou t
?pf Page faults
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 19/36
Analyzing an Existing System 17
As with the vmstat output, the key field is still sr, showing th e scan rate. The
benefit you get with -p is that you can now see wh at types of pages need the space,
allowing you to better understand what the system is doing.
Look again at the system that is reading in a large file, only this time with thevmstat -p option.
As you can see, this makes wh at is happ ening to the system m uch clearer. Thesystem starts by paging in t he file very effectively, until it hits th e lotsfree limit and
the page-out scanner starts. At this point, there is a big jump in the sr column. Also
notice the abrupt shift from file system page-ins (fpi) to anonymous pi, po, and
pf. This means that p ages are being taken from other p rocesses to make room for the
file in mem ory. Thu s, if you see a lot of activity in the apo an d sr colum ns, you
need more memory.
While memory analysis can be complicated if you pay attention solely to the sr an d
po colum ns, you should be able to tell if your system needs ad ditional memory.
CODE EXAMPLE 1-4 vmstat -p 5 Command Output Reading a Large File
# vmstat -p 5
memory page executable anonymous filesystem
swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf
2406040 431296 8 72 0 0 0 0 0 0 0 0 0 2 0 0
2489992 630792 0 0 0 0 0 0 0 0 0 0 0 0 0 02480080 620344 785 1021 1 0 0 6 0 0 67 0 0 6174 1 1
2417296 557472 1514 2830 0 0 0 0 0 0 0 0 0 10777 0 0
2349520 493576 1330 2515 0 0 0 0 0 0 0 0 0 9523 0 0
2293456 459296 1295 2684 0 0 0 0 0 0 0 0 0 9088 0 0
2230072 399424 1256 1751 0 0 0 0 0 0 0 0 0 9881 0 0
2164832 334864 1403 1700 0 0 0 0 0 0 0 0 0 11212 0 0
2097432 267288 1415 1716 0 0 0 0 0 0 0 0 0 11212 0 0
2021736 192344 1330 2024 0 0 0 0 0 0 0 0 0 10638 0 0
1947168 122688 1330 1604 0 0 0 0 0 0 0 0 0 10558 0 0
1883288 59216 1324 1658 0 0 0 0 0 0 0 0 0 10568 0 0
1832784 12056 836 863 3936 0 5059 1 0 76 1 3548 3846 6808 4 12
1798648 8656 207 654 6531 0 6519 0 0 72 353 6374 6446 1502 1 12
1787016 17864 49 461 4094 0 927 6 0 6 646 4076 4084 12 3 3
memory page executable anonymous filesystem
swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf
1776040 18448 38 530 3036 0 678 8 0 6 774 3020 3028 3 0 1
1766800 18488 32 319 2592 0 625 4 0 3 952 2585 2585 1 1 3
1761080 18696 32 309 2465 0 549 0 0 1 963 2457 2460 1 0 3
1754696 18600 31 302 2420 0 534 0 0 8 937 2406 2412 1 0 0
1748608 18640 30 308 2488 0 534 3 0 3 945 2483 2484 1 0 0
1742504 18784 23 285 2318 0 508 3 0 6 968 2304 2307 3 1 41736960 18784 21 291 2268 0 491 3 0 8 979 2252 2259 3 1 1
1731008 18584 94 291 2369 0 535 0 0 9 811 2355 2358 3 0 1
1729800 18744 75 214 1697 0 497 0 0 4 1112 1689 1692 1 0 0
1729840 18664 57 202 1601 0 538 0 0 4 1156 1587 1595 0 1 1
1881984 149608 470 122 984 0 366 30 0 6 728 972 976 0 0 1
2490440 672488 0 0 0 0 0 0 0 0 0 0 0 4 0 0
2490744 672672 10 168 0 0 0 8 0 0 6 0 0 16 0 0
2490768 672512 0 2 0 0 0 0 0 0 16 0 0 0 0 0
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 20/36
18 Designing Your System
mpstat Command
The Sun Fire system is designed to be a multiprocessor system, as evidenced by the
fact that you cannot even buy a system with only one CPU. Even thou gh you are
looking at CPUs secondarily, being processor-bound is the least likely candidate for
bad performance. If anything, you are exploring CPUs secondarily so that you can
d ouble-check this assum ption , and ru le it out as a possible factor. CPUs usu ally only
become a factor in heavily loaded systems that are doing lots of interactive or
transactional processing. In m ost other cases, if you buy enough system boards to
hold a ll your m emory, the CPUs that are includ ed are usu ally sufficient.
As mentioned previously, the cpu colum ns of the vmstat outp ut are a good place to
start. Generally, a large percentage of idle time ind icates that you r processing p oweris sufficient. However, measuring idle time across a lot of processors can mask
situations such as one processor getting swa mp ed w ith interrupts w hile the rest do
nothing. So, it is important to look at your CPUs in d etail to make sure you are not
missing an ything.
Like vmstat, just laun ch mpstat with a time interval and let it run:
This command prod uces a lot of column s, only some of w hich you care about:
CODE EXAMPLE 1-5 How to Use the mpstat Command
# mpstat 5CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 372 2 836 447 300 393 47 25 26 1 918 23 10 0 67
1 370 2 622 543 523 301 40 23 35 0 932 24 11 0 65
2 376 2 527 151 100 396 48 25 26 0 926 24 10 0 66
3 372 2 531 151 100 397 48 25 26 0 921 23 10 0 67
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 229 0 546 400 300 458 0 12 13 1 563 2 9 0 89
1 132 0 2018 585 585 111 0 9 16 0 621 4 8 1 88
2 265 0 199 100 100 354 1 9 15 0 770 21 9 1 68
3 363 0 491 101 100 671 1 14 18 0 1339 22 11 0 67CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 155 0 445 400 300 495 0 12 7 0 398 1 6 0 92
1 99 0 145 348 347 134 1 10 10 0 487 13 4 0 83
2 154 0 401 101 100 255 1 8 4 0 723 21 5 0 73
3 307 0 227 100 100 178 0 11 9 0 989 23 8 1 69
TABLE 1-6 Important mpstat Command Output Columns
Column Heading Meaning
xcal Interprocessor cross-calls
intr Interrupts
csw Context sw itches
icsw Involuntary context switches
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 21/36
Analyzing an Existing System 19
A cross-call (xcal
) is a call used by a processor to tell other processors to dosometh ing. Cross-calls are used for a variety of things, such as d elivering a signal to
another processor or ensuring virtual m emory consistency. This latter use is very
common, as it happens during file system activity. Heavy file system activity (such
as N FS) can result in a lot of cross-calls. Also, it is not u nu sual for th e boot proc to
show thousands of xcals, as it maintains lots of information about the others.
An interrupt (intr) is the m echanism th at a d evice uses to signal to the kernel that
it needs attention, and some imm ediate processing is required on its behalf. I/ O is
the m ajor contributor of interrup ts, although there are also “special” interrup ts suchas the system-wide clock thread that occurs regularly. Interrupts, u nlike everything
else, are not d istributed across all CPUs. Instead , the Solaris OE binds each sou rce of
interrupts to a specific CPU.
The term context switch (csw) refers to the p rocess of moving a thread on and off a
CPU. Context sw itches are a norma l but som ewh at expensive occurrence because
switching context involves certain overh ead , such as pop ulating th e stack. Norm ally,
a context switch occurs w hen a process is done with th e CPU and another p rocess is
given a chance to run. Thus, a steady number of context switches is insignificant.
Involuntary context switches (icsw), on the other hand , are mu ch less favorable.
When a process is given access to the CPU, it is has a limited t ime w ind ow in w hich
to run, dep ending on h ow m any other p rocesses are runn ing, what their priority is,
and so on. This is the nature of scheduling. An involuntary context switch m eans
that th e process was forcibly stopped by the schedu ler before it wa s finished; the
time allotted was too short for the p rocess to finish in, or a h igher-priority thread
preemp ted it. A few of these is nothing to be concerned abou t, but getting a largenum ber of these regularly indicates that the system d oes not have enough
processing pow er to hand le all of the things that need to run . You n eed ad ditional
CPUs.
Finally, a spin on a m utex lock (smtx) happen s when a thread cannot access a
section of the kernel that it needs on the first try. The term mutex is short for a
mu tu al exclusion lock, and is used in mu ltithread ed op erating system s like the Solaris
OE to allow mu ltiple threads to ru n concurrently in system m ode. When a thread
enters system mod e, it locks the par t of the kernel it is using by acquiring the m utex
smtx Spins on mu tex locks
usr Percent u ser time
sys Percent system tim e
wt Percent w ait time
idl Percent id le time
TABLE 1-6 Important mpstat Command Output Columns (Continued)
Column Heading Meaning
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 22/36
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 23/36
Analyzing an Existing System 21
One n ice thing is that the solution to all of these problems is the same. The system
needs m ore and/ or faster CPUs. Once again though, the imp ortance of having
enough memory is emphasized here. When you add a CPU, you incur additional
overhead in the form of more kernel space needed to m anage that CPU, and spa ce
for that CPU to d o its own work. Therefore, the rule of thumb is:
Whenever you add additional CPUs, you should also add memory.
Doing so will help p revent accidental mem ory shortages, which can actually m ake
your system run slower as you add more CPUs.
iostat Command
Proper I/ O layout is complicated; it is almost never done right the first time. Part of
the reason for this is that usage p atterns and requirements change over time. Also,
wh ere you add m emory and CPUs is somew hat predeterm ined. Where you add d isk
dev ices and controller cards, thoug h, has a big imp act on the system . Therefore, it is
importan t to make su re that the I/ O layout is flexible enough to h and le futu re
changes and expansion.
Highsmtx
High sys or idl Contention for kernel system resources exists.
High icsw Contention for basic CPU resources exists.
High csw or xcal
High sys
Low usr
If this happen s consistently, you m ay require m ore CPUs,
dep ending on your ap plications. If you are not noticing any
slowness in ap plications or system problems, how ever, ignore it.
High sys
Low usr
All other stats low
Your system is spend ing too m uch time man aging resources. Check
vmstat first.
TABLE 1-7 Analyzing the mpstat Command Output (Continued)
If you see this... It probably means...
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 24/36
22 Designing Your System
Fortunately, I/ O analysis is very straightforward. There is only one version of the
iostat comman d to run, iostat -zxcn.
For this version of the iostat command , the outpu t shows extended statistics for
only those disk devices with nonzero activity, by physical device path instead of the
logical kernel disk name (that is, c0t0d0 instead of sd0). If you are using
individual d isk partitions, you m ay also wan t to use the -p option. How ever, most
production environments manage their disks with some type of volume manager
package, so in practice this option is not that useful.
CODE EXAMPLE 1-6 How to Use the iostat Command
# iostat -zxcn 5<summary omitted>
cpu
us sy wt id
0 1 5 93
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 1.6 0.0 0.0 0.0 8.2 0 0 c0t0d0
0.0 0.2 0.0 1.6 0.0 0.0 0.0 10.1 0 0 c5t0d0
0.0 34.6 0.0 2201.0 0.0 1.0 0.0 27.9 0 97 c12t1d0
0.0 0.2 0.0 1.6 0.0 0.0 0.0 12.2 0 0 c20t122d00.0 0.2 0.0 1.6 0.0 0.0 0.0 14.1 0 0 c20t98d0
0.0 14.2 0.0 113.6 0.0 0.1 0.0 5.5 0 8 c20t101d0
0.0 0.2 0.0 0.5 0.0 0.0 0.0 30.1 0 1 c10t1d0
0.0 58.4 0.0 135.4 0.0 0.3 0.0 5.6 0 30 c2t17d0
1.0 12.8 8.0 135.0 0.0 0.2 0.0 11.6 0 13 c2t16d0
0.0 3.4 0.0 19.2 0.0 0.1 0.0 17.9 0 4 c2t9d0
0.0 0.4 0.0 0.8 0.0 0.0 0.0 5.3 0 0 c2t21d0
0.0 1.8 0.0 1.4 0.0 0.0 0.0 4.2 0 1 c27t42d0
0.4 9.2 3.2 155.9 0.0 0.1 0.0 10.9 0 8 c28t69d0
0.0 9.0 0.0 157.5 0.0 0.1 0.0 9.0 0 6 c28t68d0
0.0 1.8 0.0 1.4 0.0 0.0 0.0 4.8 0 1 c29t1d0
0.0 9.0 0.0 157.5 0.0 0.1 0.0 9.5 0 7 c30t35d00.0 0.4 0.0 0.8 0.0 0.0 0.0 5.1 0 0 c30t52d0
0.0 9.2 0.0 155.9 0.0 0.1 0.0 10.3 0 7 c30t36d0
0.0 58.4 0.0 135.4 0.0 0.4 0.0 6.5 0 35 c31t66d0
0.4 12.8 3.2 135.0 0.0 0.2 0.0 12.1 0 12 c31t64d0
0.0 3.4 0.0 19.2 0.0 0.1 0.0 17.7 0 4 c31t90d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 tomax:/export/mirrors/
pkg.eng/export/pkg
0.0 0.2 0.0 0.4 0.0 0.0 0.1 1.0 0 0 twinsun-n1:/export/workspace/
d0/nwiger
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 25/36
Analyzing an Existing System 23
As with the other stat command s, there are only a few colum ns you care about
(TABLE 3-8).
You can ignore two comm only used colum ns, %w an d %b, which are sup posedly the
percentage of time sp ent w aiting an d b usy, respectively. Because of the comp lexity of
mod ern d isks and controllers, these calculations are very inaccurate. Often the two
will total more th an 100 percent, w hich shou ld be imp ossible. Besides, these colum ns
do n ot tell you anyth ing that you cannot find ou t by looking at wsvc_t or asvc_t.
Analogous to the mpstat command, when looking at iostat you should always
watch th e first tw o colum ns listed (kr/s an d kw/s) to see how m uch activity the
disks are undergoing. Then, basically, the last three columns should be as close to
zero as possible. This indicates that the system has very fast disks, and that the I/ O
is laid out correctly to avoid controller bottlenecks.1
In practice, asvc_t will be nonzero for any disks undergoing activity, since it
always takes some am ount of time for a d isk to fulfill a request. As with any stat,
you will only be able to tell if the system is particularly busy after establishing a
baseline. However, several facts are true:
1. Service times across equally active disks should be fairly even.
2. You should not see huge peaks and valleys und er norm al conditions.
3. You sh ould rarely, if ever, see a non zero nu mber in wait or wsvc_t.
You may, occasionally, see a tem por ary jum p in service times (asvc_t) even thoughthere is nothing app arently going on (that is, kr/s an d kw/s are a lmost 0). This is
due to a somewhat strange behavior of fsflush, the d aemon responsible for
flush ing d isk buffers. Periodically, it will generate a lon g, rand om series of writes in
a short time period. This results in a qu eue forming, which bum ps u p th e service
time, even th ough there is no real appa rent activity on th e disk. If you see this,
ignore it.
TABLE 1-8 Important iostat Command Columns
Column Heading Meaning
kr/s Kilobytes read per second
kw/s Kilobytes written p er second
wait Nu mber of tran sactions w aiting for service
wsvc_t Average service time in w ait queu e, in millisecond s
asvc_t Average service time for active transactions, in milliseconds
1. Withou t the -n option,wsvc_t an d asvc_t are combined in to a single svc_t column.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 26/36
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 27/36
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 28/36
26 Designing Your System
Despite its limitations, you can tell several things from the netstat command
outp ut. Unlike the other stats, you m ust run the netstat command separately for
each interface you have configured by specifying the -I option along with the
interface n ame.
You can tell two th ings from this d isplay:
1. Total number of packets received (inpu t) and transm itted (outp ut) du ring that
interva l, both for that int erface (left set of colum ns) and for all interfaces (right set
of column s). This is not an average p er second, but a total count.
2. Nu mber of errors and collisions, which should alw ays be low or zero.
Netw ork capacity is very d ifficult to gauge with this limited informa tion. Without
the sizes of each p acket, it is imp ossible to know if you are anyw here near the
through pu t limits for the interface you are analyzing. Given this information, if the
netw ork seems slow, and you are seeing thou sands and thousand s of packets each
second, try add ing another network interface card to see if it helps. If not, you
should examine your n etwork as a w hole to see if you have more w idespread issues.
Man y ava ilable freeware tools, such as th e SE Toolkit and Multi Router Traffic
Graph er (MRTG), provide better netw ork an alysis than netstat. You can u se tools
such as these to more properly gauge the band wid th being used by each interface.
MRTG is especially useful, as it graphs utilization over time so you can easily see
wh en your n etwork interfaces are getting busy, as well as how mu ch bandw idth
they are pushing.
Analysis Reveals...
By this point, you shou ld have a good idea abou t wh ere the system is w eak. Make
sure you hav e good notes, as you need th is information in the next chap ter wh en
you design your new system.
Giving p erformance tuning a full treatment is beyond the scope of this book. True
performan ce tuning gets exponentially h arder; it is much more d ifficult to get thelast 10 percent ou t of a system th an th e first 90 percent. If you are interested in high-
CODE EXAMPLE 1-8 How to Use the netstat Command
# netstat -I ge0 5input hme0 output input (Total) output
packets errs packets errs colls packets errs packets errs colls
909076714 0 837319344 0 0 918674892 0 846917522 0 0
667 0 681 0 0 673 0 687 0 0
426 0 402 0 0 428 0 404 0 0
1886 0 3684 0 0 1886 0 3684 0 0
1878 0 3117 0 0 1882 0 3121 0 0411 0 391 0 0 411 0 391 0 0
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 29/36
Designing for RAS 27
end p erformance tuning, read Sun Performance and Tuning—Java and the Internet, 2nd
Edition by Adrian Cockcroft and Richard Pettit (ISBN 0-13-095249-4) and
“Application Performance Optimization” by Börje Lindh—Sun Microsystems AB,
Sweden Sun BluePrints™ OnLine—March 2002.
Designing for RAS
This is the final step in the design process. By now, you should have a fairly clear
un derstand ing of wh at your requ irements are, as well as any possible problems w ithyour existing system. Up un til now, this book focused mainly on performan ce
because you shou ld make sure any solution you develop can meet your fund amental
application requirements. However, properly designing for RAS is just as important,
and requires some thought.
Always keep th ree principles in mind wh en d esigning for RAS:
s The more RAS you w ant, the more hardware you must add to the system.
s RAS is not just a fun ction of the Sun Fire server, but of you r entire site.s Maximizing RAS can decrease performance.
The first point is almost always overlooked. As an example, to effectively use DR,
you should add boards in your design beyond th ose required for your app lications.
Why? Because otherwise, when the system d ynam ically reconfigures a board out of
the system, it will not have enough resources to ru n you r ap plications. The system
could start paging, or the CPUs could get too busy hand ling I/ O interrupts to d o
any real work.The requirements you have formed up to this point are the minimum you
need for your system.
As for the second p oint, pu rchasing redu nd ant p ower su pp lies does not benefit you
if your site has only a single pow er grid with no UPS system. RAS is a fun ction of
your entire site, not just one server in isolation. As w ith performa nce, getting that
final 10 percent of reliability out of a site gets exp onen tially more d ifficult—and
costly. Therefore, you should be realistic about both your requirements and
expectations—and your ability to fund them.
Third, taking advantage of certain RAS features and methodologies can decrease the
performance of your system. For example, if you mirror file systems, for each write
the system must now perform tw o writes, one to each half of the mirror. Some of
these effects can be m itigated, for instance by placing the tw o halves of the mirror on
different I/ O controllers.1 How ever, such p erformance hits can ad d u p, so it is
importan t to realize it is imp ossible to maximize both RAS and p erformance.
1. In fact, man y volum e manag ers will "round rob in" between th e two halves of a mirror on read s, actuallyincreasing your read p erform ance over a single disk.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 30/36
28 Designing Your System
Uptime Requirements
You w ere first asked to consider your u ptime requ irements in Chapter 2, “What are
the u ptime requ irements of the system?” To help an swer th is question, you can
consider the following:
s How much time do you have available for planned maintenance?
s How long can you afford to be offline during an unplanned downtime?
There are two types of dow ntime—planned and unp lanned. Planned down time
includes hardw are and software up grades, whereas unp lanned dow ntime includes
system crashes and emergency reboots. All comp uter systems have some am ount of
dow ntime; the goal of a good server d esign is to minimize the impact this down timehas on your organ ization.
For some organizations, schedu led m aintenance is not an issue; the systems u nd ergo
heavy usage d uring the d ay from employees, so taking the m achine down a fter-
hour s is a viable solution. Other organizations, however, serve a w orldw ide
aud ience and can afford little scheduled m aintenance du e to time zone d ifferences.
Also, it is not un common to have a m ix of different requ irements for different
systems at a single site. One thing that every organization has in common, though, is
the desire to minimize unplanned downtime as much as possible.
There is no reason to d ifferentiate between the tw o types of dow ntime, other than to
help you come to a conclusion regard ing your overall requirements. When you h ave
a good idea of the uptim e required for this system, TABLE 1-9 will help you d etermine
wh at your design should include to ensu re that its RAS properties meet your
requirements.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 31/36
Designing for RAS 29
Note – You shou ld always p urchase redu nd ant SCs for a system to ensu re
availability in the event of a System Controller board failure. Without a functioning
System Controller board , none of the dom ains in a system w ill work.
Note – Even though you can use DR to replace failed components, a critical
component failure on a run ning system (such as a failed CPU) w ill still cause the
system to crash. If you cann ot afford th is type of d own time, you fit in the almost none
category, and should use a clustering prod uct to guard against system failures.
For most organ izations, the little downtime category is a good cost/ benefit tradeoff.
You w ill hav e a system th at is resilient to failures and , if prop erly configured,relatively easy to service. You can u se DR to add m ore CPU/ Memory board s for
increased capacity, or to replace failed components.
Make a note of w hat category your system fits into, as well as the ad ditional
components you will need. You are going to u se this in the next chapter to d esign
your system. You w ill also use it later in the book du ring th e discussion on
configuring the system to integrate w ith your site.
TABLE 1-9 RAS Design Decision Table
Allowable
downtime Your design should include...
Some Red undant fan trays
Redundant power supplies and transfer switches1
1. Remem ber, redund ant pow er helps only if your site is equipped to sup ply it.
Little Red un dan t CPU/ Mem ory board s
DR for CPU/ Memory board s
Volum e man agem ent softw are (such as Solaris™ Volum e Manag er (SVM)
or VERITAS Volume Manager (VxVM)
Very li tt le Redundant paths to I/ O devices
Multipathing software for I/ O (such as Multipath I/ 0 (MPxIO) or
VERITAS Dynam ic Multip athin g (VxDMP)
Redundant network connectionsMultipathing software for netw orks—such as Internet p rotocol
multipathing (IPMP)
DR for I/ O devices and networks
Almost none Multiple instances of fully redund ant systems
Clustering software (such as Sun™ Cluster 3.0)
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 32/36
30 Designing Your System
Finally, some closing words on RAS. It is very important that you do not sacrifice
parts of your required configuration for add itional RAS features. For examp le, do
not d ecide to buy less mem ory so tha t you can afford add itional fan trays. You
should ensure that your base requ irements are met, or else you w ill not benefit from
add itional RAS because you r system will have fund amental shortcomings.Disk Redundancy and RAID Basics
To ensure the integrity of the data, some type of disk redu nd ancy should be u sed on
any system with importan t local data storage. The d ifferent schemes for achieving
such redun dan cy are often denoted by their RAID level. The term RAID comes from
Redu nd ant Arra y of Inexpensive Disks, and there are num bers from 0 all the way u p
through 53 denoting d ifferent w ays of laying ou t sets of disks.
For most app lications, how ever, only th ree RAID levels are useful: 0, 1, and 5. Each
of these allow you to combine multiple physical disks into a single logical volume.
The operating system then sees this volume just like a norm al disk, and it can be
mounted an d u sed in the regular manner.
RAID 0
RAID 0, comm only called striping, provides n o ad ditional data safety. Instead, it isd esigned to increase the sp eed of file system access. With striping , disks in a volum e
are interleaved at a certain data interval, called the stripe unit size. This means that
wh en read ing or w riting d ata, multiple disks are accessed in parallel, decreasing the
amou nt of time it takes to access the data. Striping is very common on any system
that n eeds fast data access, such as d atabase servers.
RAID 1
RAID 1, also referred to as mirroring, is just the reverse. It provides full data
redu nd ancy, but w ith some p erformance costs. In mirroring, twice the num ber of
disks are used for the d ata that n eeds to be stored. These disks are then arran ged in
pairs, and identical data is stored on both disks. On a file system write, two physical
wr ites must be performed , one to each d isk of the pair. The advantage is you now
have tw o complete copies of your d ata.
This means you can lose half of your d isks and still continue ru nning w ithout d ata
loss. In a large volume, this is obviously an advan tage.
RAID 0+1
RAID 0+1, usually called striping and mirroring, is a combination of these tw o
techniques. In a striped/ m irrored volum e, a set of disks is striped together to form
each half. Then, these tw o halves are mirrored to on e anoth er. It is possible to design
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 33/36
Designing for RAS 31
a striped/ mirrored volum e so that the performance is better than the individu al
disks (due to striping), and th at fully half the d isks can fail without imp acting the
volume (du e to mirroring). This technique is w idely-used in prod uction systems.
RAID 1+0
RAID 1+0 is very similar to RAID 0+1, except the volumes are assembled in the
reverse order. Here, pairs of disks are mirrored to on e another, and then these
mirrored p airs are striped togeth er. Volum es created in this man ner are slightly more
complicated to ma nage, but are slightly more reliable because of the w ays in w hichdisks typically fail. Generally, vendors decide to implement either RAID 0+1 or
RAID 1+0, but not both, so the choice of which to use is often made for you.
RAID 5
Finally, RAID 5 is one of the m ost econom ical forms of redu nd ancy. In this schem e, a
portion of each disk in a volum e is used to hold p arity. On a w rite, data isdistributed across all the d isks in the volume except on e, with the p arity being
written to the rem aining d isk. This process is repeated in a "roun d robin " fashion, so
that each w rite places the p arity for that wr ite on a d ifferent d isk. In the event of a
single disk failure, the par ity is used to recreate data th at was on the failed disk. This
allows you to lose a single disk (the most common typ e of failure) and continue
run ning w ithout interru ption. RAID 5 is somewh at slow, though , since it must
perform all those add itional w rites for the parity.
While RAID 5 is not a s reliable as RAID 0+1 (striping and mirroring), it can still be agood solution, especially for NFS servers. While you can only lose one disk, it is
un comm on to lose a whole enclosure barring hu man error or a pow er failure, both
of which will probably affect much more than your disks. To make use of RAID 5,
you shou ld consider on ly those enclosures that su pp ort hard ware RAID, since
otherwise it is too slow for many app lications.
Once you h ave selected wh at type of RAID you w ish to use for each of your
different volum es, you shou ld a djust your storage pu rchase accordingly. For
example, if you w ant to mirror a set of data, you m ust pu rchase dou ble the amou nt
of disk you calculated above. You w ill need to make sure to increase you r controller
cards as w ell.
With RAID 5, check the enclosure you are considering purchasing to verify that it
supports hard ware RAID.
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 34/36
32 Designing Your System
A Logical Design Specification
By now you sh ould hav e available all of the information you need to create a logical
d esign sp ecification:
s Design ru les of thumb
s Your existing sy stem ana lysis (if ap plicable)
s Your RAS design requ iremen ts
As you did in the Statement of Requirements Worksheet in Chapter 2, formalize this
informa tion into a d esign for you r logical system that will serve as an accurate
picture of your needs, so you can use it in the next chapter to choose the appropriate
ph ysical system. List you r sp ecifications in TABLE 1-10.
TABLE 1-10 Logical Design Specification Worksh eet
Item Description
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 35/36
8/3/2019 Sunfire Design and Configuation
http://slidepdf.com/reader/full/sunfire-design-and-configuation 36/36
34 Designing Your System