9293 Selfmanaging in-Network

8/12/2019 9293 Selfmanaging in-Network

1/12

The self-managing system

Introduction......................................................................................................................................... 2The role of clustering ........................................................................................................................ 2NonStop systems ............................................................................................................................. 3

A fundamental problem........................................................................................................................ 4

Background information ....................................................................................................................... 4

Our approach ..................................................................................................................................... 5Self-configuration ............................................................................................................................. 6Self-optimization .............................................................................................................................. 7Self-diagnosis .................................................................................................................................. 7Self-healing ..................................................................................................................................... 7Self-protection.................................................................................................................................. 8

Recent advances.................................................................................................................................. 8Lessons learned................................................................................................................................ 9Looking ahead............................................................................................................................... 10

Conclusion........................................................................................................................................ 11

For more information.......................................................................................................................... 12

f-Managing Systems Make Unplanned Downtime History

ccess this document, please complete all fields below and click 'Read Document'.

mpleting this form, you agree to the collection, use, disclosure and transfer of the profile information collected

n by TechTarget and the owner of the document. Based on the information provided, you may receive updates

the TechTarget network of IT-specific websites (and/or the document owner) to inform you of the latest White

, product, and content launches as they relate to your informational needs.

registration is complete, you will have access to all similar documents without having to fill out additional forms.

t Name:

t Name:

ail Address:

Title:

iness Phone:

mpany:

dress 1:

dress 2:

y:

te/Province: -- Select One --

/Postal Code:

untry: UNITED STATES

Employees: -- Select # of employees --

partment: -- Select your department --

ustry: -- Select your industry --

Abstract: According to Gartner, 40 percent of all systdowntime is caused by operator error. Imagine a compsystem that has virtually zero planned or unplanneddowntime and can expand its capabilities dynamically,response to an increasing workload. A system that ensuno single failure will cause denials of service or datacorruption. This may sound impossible, but in fact, thebenefits are derived from an industrial-strengthimplementation of one fundamental concept across thecomputer industry-self-management.

This paper describes a variety of self-managementtechnologies that have been implemented on HP's Nonsystem. These functions help lower the total cost ofownership of the NonStop server while continuing to

improve user application availability.

Learn about a new approach to self-management thatencompasses five distinct areas:

Self-configuration

Self-optimization

Self-diagnosis

Self-healing

Self-protection

Read Document Cancel

mation entered on this page and other data about your use of the attached document will be stored

file on your computer and transmitted to TechTarget over the Internet. TechTarget may provide this

mation to the owners of the document and either party may use this data to contact you and/or

your use of the document. In consideration of access to the attached document, you agree to such

ge and uses as more fully described in the ec ar et r vac ol c .


2/12

Introduction

Imagine a computer system that has virtually zero planned or unplanned downtimea system thatcan run for decades without failing once. Thats not to say that the individual components in such asystem dont fail. Rather, it just means that the system (and the applications) continues to operatereliably (with no loss of data) through component failure, problem diagnosis, and componentreplacement steps.

Further imagine a system that can expand its capabilities dynamically, in response to an increasingworkload. As more processors are connected to the system, the workload is automaticallydistributedtransparently and automaticallyto these additional processors, without requiringprogramming or even configuration changes. And as processors are removed, the load managersimmediately restrict the workload to the available processors, transparently and automatically, withoutany manual intervention.

And while all this is going on, the system sees to it that no single failure will cause requests for serviceto be denied, and no existing data will become corrupted.

Sound impossible? Actually, all of these benefits and others are derived from an industrial-strengthimplementation of one fundamental concept that is sweeping the computer industryclustering.

The role of clusteringIn simple terms, clusteringis about connecting a group of computers together so that they can sharethe workload (scalability) and back each other up to hide failures from users (fault tolerance). Inreality, its fairly straightforward to connect computers together. Where the process gets difficult is inmeeting mutually exclusive goals, such as

Maintaining the ability to scale to hundreds or thousands of CPUs

Verifying that the performance of each individual subsystem remains acceptable

Promoting the integrity of data across the system

Maintaining the manageability of the overall system, so that the system detects changes andhandles them automatically (self-healing), and replicates simple management tasks across different

subsystems

From a high-level perspective, in order to provide the capabilities listed above, there are severalthings that clusters need to do extremely well:

Rock-solid messaging, to enable information to be propagated between application instanceswithout fail

A message-based operating system

Heartbeat mechanisms to enable various parts of the system to tell other parts of the system abouttheir operational state. The absence of such mechanisms implies that the originator has encounteredsome problem, which needs to be corrected

Application containers, which facilitate the existence of multiple application instances across thecluster, distribute work to the various instances, provide various services to the instances (that is,atomicity, consistency, isolation, and durability, or ACID, transaction services) to enable theintegrity of the database, detect when the workload has expanded sufficiently to warrant thecreation of additional instances, manage the life cycle of all application instances, and so on

The database must be flexible and cluster aware

2

o access this document, please return to page 1 to complete the

rm.

y completing this form once, you will have access to all similar

ocuments without needing to register again.


3/12

NonStop systems

There are as many different clustering implementations as there are computer and software vendors.Only one stands out in terms of quality of implementation regarding functionality, performance,scalability, fault tolerance, manageability, and data integrity: the HP Integrity NonStop and NonStopserver platforms (subsequently referred to as NonStop systems or NonStop servers).

Figure 1 illustrates the way in which hardware and software fault tolerance and linear scalability aredesigned into NonStop systems.

This paper describes a variety of self-management technologies that have been implemented on theNonStop system. Some of the functionality was designed into the system from the beginning, andmuch has been added over the years. This combination of self-management functions helps lower thetotal cost of ownership (TCO) of the NonStop server while continuing to improve user applicationavailability. In this paper, you will find examples of specific self-management techniques that weveimplemented, lessons that weve learned along the way, and discussions about future opportunities toimprove system self-management.

Figure 1.Both hardware and software fault tolerance and linear scalability are designed into the NonStop system together.

3


rm.




4/12

A fundamental problem

As systems become more interconnected and diverse, architectsare less able to anticipate and design interactions amongcomponents, leaving such issues to be dealt with at runtime. Soonsystems will become too massive and complex for even the mostskilled system integrators to install, configure, optimize, maintain,and merge. And there will be no way to make timely, decisiveresponses to the rapid stream of changing and conflictingdemands.

IBM manifesto (2001)

Today, the effect of system complexity is most easily measured in the TCO of a system, especiallywhen measuring all the costs associated with purchasing and operating a customer environment,including the cost of system downtime. System downtime has a huge effect on TCO, and much of thatcost is directly associated with operator errors. According to Gartner, for example, 40 percent of allsystem downtime is caused by operator error. The amount of time it takes for an operator to perform atask correctly also affects TCO. The simpler the system is to operate, the fewer the number of requiredtasks there are, and the less time is needed to perform them.

As noted earlier, complexity is directly related to system downtime and the cost of the overall ITenvironment. Given that there are no signs of the IT environment becoming simpler, we have designedthe system to hide as much of the complexity as possible by automating as many operational tasks aspossible. Devices and systems manage themselves in order to reduce operator time and operatorerrors.

Background information

IBMs gloomy manifesto about the IT industry warned that software complexity was causing alooming crisis. Specifically, IBM said, the IT industry will collapse under its own weight if itcontinues to rely on applications and environments, often with millions of lines of code that requireskilled professionals to get them running and keep them running.

Managing a system is a difficult business, and IBM predicted that even professionals will soon beunable to keep up with system complexity. Interconnectivity, integration, making different systemswork together as one, and sheer scale all introduce new levels of complexity. In addition, extendingsystems beyond company boundaries to the Internet introduces even more complexity. The only optionremaining, IBM suggested, is autonomic computing: systems that, given high-level objectives fromadministrators, can manage themselves.

Autonomic computing suggests hierarchies, autonomy, interactivity, and cascading levels of smallerand smaller systems, each of which can govern itself: in other words, self-management. The idea ofself-management is to free system administrators from the details of system operation andmaintenance and to provide users with a machine that runs at peak performance 24 x 7. Autonomicsystems are expected to maintain and adjust their operation in the face of changing components,workloads, demands, and external conditions, as well as in the face of hardware or software failures,both of which may be innocent or malicious.

Figure 2 shows some of the many forms of self-management built into NonStop systems.

4


rm.




5/12

Figure 2.Some of the many forms of self-management within NonStop systems.

Our approach

HPs approach to self-management encompasses five distinct areas:

Self-configuration:Automated configuration of components and systems that follow high-levelpolicies. The rest of the system adjusts automatically and seamlessly.

Self-optimization:Components and systems continually seek to improve their own performance andefficiency.

Self-diagnosis:The system can detect and diagnose its own problems.

Self-healing: The system automatically repairs localized software and hardware problems,sometimes also reintegrating the repaired resource back into itself.

Self-protection:The system automatically defends against malicious attacks and cascading failures.It anticipates and prevents system-wide failures.

The original design goal of the NonStop server was to create a system that could survive single faultswhile hiding hardware and software errors from the application to the greatest possible extent. Toachieve this, three basic interdependent techniques wereand continue to bedeveloped:

Clustering of relatively autonomous processors:A NonStop system consists of two to 16 processors,configured in a shared-nothing cluster that, in turn, can be aggregated into a two- to 255-waygroup of clusters.

System self-management: The system is capable of automated configuration changes, optimization,diagnosis, repair, and protection. Such capabilities are natural for a system that is designed fromthe ground up to be tolerant of single faults.

5


rm.




6/12

Figure 3.Backed up data is restored automatically upon corruption.

Resource virtualization:From an application perspective, every resource in the NonStop cluster isvirtualizedno aspects of software or hardware redundancy are made visible to the application.

Virtualizing resources enables us to provide transparent system self-management that is hidden fromthe application.

In other words, to deliver the highest possible system availability, a clustered solution is needed.Furthermore, to deliver such a solution at the lowest possible TCO, self-management techniques areneeded. To deliver transparent system self-management, the clusters resources need be virtualizedfrom an application perspective, thereby allowing automatic changes to the computing environmentwithout forcing the application to implement its own self-management techniques, too.

We have developed many self-management technologies over more than 25 years and continue tomake advances in all aspects of this self-management approach: self-configuration, self-optimization,self-diagnosis, self-healing, and self-protection.

Figure 3 shows the one aspect of the self-healing capacity of NonStop systems.

Self-configuration

Automated system reconfiguration on expansion or reduction (system resizing):processors andenclosures can be added to and removed from the system online, with the system adjusting itsconfiguration automatically. Also, switches and clusters can be added to the group of clusters.

Automated configuration of controllers and disk drives: a controller of a disk drive added to thesystem is automatically configured and started. In the case of a host-based mirror (NonStop systemsuse host-based mirroring for disks), the online data copy is started automatically when a mirroreddrive is configured.

In figure 3 step 1, scanning software detects bad data. In step 2, auto-repair deletes the bad data. Instep 3, good data is copied to the location of bad data. In step 4, data is back in sync and in a goodstate. The application does not need to do anything for this process to occur.

6


rm.




7/12

Figure 4.There is no overhead for synchronizing cache data.

Self-optimization

Mixed workload environment: the NonStop system allows the user to establish workload priorities,and automatically responds to priority contention to help ensure that low-priority workloadssuchas decision supportdo not impact higher-priority transaction response time. This design meansthat lower-priority workloads can utilize free resources without impacting response time.

Automated workload distribution: the application environment (the middleware) automatically anddynamically distributes work to server processes, depending on workload and resource availability.

Data on disk are often distributed and with the Data Access Manager shown in figure 4; the systemautomatically self-optimizes, avoiding the overhead of cache synchronization.

Self-diagnosis

Detection of latent failure: alternate paths to devices and processors are either used in a ping-pongfashion (for example, the system switches between available paths on a predetermined timeinterval) or checked periodically. Data in memory and disks are checked periodically (this is knownas data scrubbing).

Incident analysis with automated data collection: Based on a highly structured common-eventsystem, incident analysis software is able to automatically diagnose 91 percent of all hardwarefailures with a 94.3 percent level of accuracy. Data needed for problem analysis is collectedautomatically and sent to the service organization.

Self-healing

Process pairs and per-processor processes: Based on resource virtualization, many of the systemservices are implemented as process pairs (two collaborating processes running in differentprocessors that checkpoint both state and state data) or as per-processor processes. If one processor its processor fails, the request is automaically rerouted to the remaining process, thereby fullyhiding the processor failure from the application (see figure 5).

Automated data repair: If the disk data-scrubbing software detects a data error, the incorrect datais repaired from a second data disk (in the case of host-based mirroring; see figure 3).

7


rm.




8/12

Figure 5. Process pairs take over automatically and retain access to data.

Automated reinstatement of repaired hardware with sanity checks:Repaired hardware insertedinto the system is detected automatically. For some types of hardware, a sanity check is performed(for example, by sending test packages) before the hardware is fully reinstated.

Self-protection

End-to-end data checksums: All messages in the system are checked from beginning to end (seefigure 6). All system data buffers are protected with buffer tags.

Fail-fast technology: If a checksum error is encountered, the action is retried. If an overwritten buffertag is detected when an operating system kernel buffer is deallocated, the processor is halted tomaintain data integrity.

Recent advances

Recent self-management improvements include

A new file-type dependent checksum technology for data stored on industry-standard 512 bytes-per-sector disks (see figure 6).

The NonStop Advanced Architecture, which introduces the next-generation self-checking processortechnology implementing an optional processor triplex. (In this system, each logical processor canconsist of one to three processor elements.)

Disk path probing, that is, periodic checking of alternate paths to a disk drive. This technology,combined with other implementations of latent-failure technology, allows us to always know that acomponent can be replaced or upgraded safely.

8


rm.




9/12

Figure 6.As data flows throught the system, checksums are persistently validated with end-to-end data integrity.

Delta-based mirrored-disk copy, which allows us to, for example, gracefully handle the failure of anenterprise storage box that hosts a whole set of backup logical unit numbers (LUNs): The data-access manager keeps track of changes that occur to the remaining LUNs and can therefore copythe delta only once the failed enterprise storage box is restored. Without this technology, it couldtake days before the storage subsystem would be restored to full fault tolerance with all of thebackup LUNs brought back up to date. With this technology, this process takes just minutes.

Enhanced background quality scans of data stored on disks allow us to detect and repair latentfailures.

Lessons learned

Self-management is not easy, and it takes time to identify what tasks can be automated. Furthermore,our experience shows that not all tasks should be automated, and that there are different ways toprovide system self-management. System self-management is a matter of continous improvement, but ifit is done correctly, much of the complexity of system management can be removed, as is illustratedby the following examples.

The dependency chain for a disk drive (HP ServerNet adapter, disk controller, disk paths, and diskdrives, that is, in which processors to place the data-access manager, and so on) is quite complex.Therefore, we decided to create a configuration manager that configures adapters and controllersautomatically, that knows the rules for the dependencies, and that can check them when the diskconfiguration is created or changed (all configuration changes may be done online). Because of this,disk-configuration errors are very rare to nonexistent.

The following points describe some of the ongoing work that we are doing to improve our self-managing capability, including some areas we are still learning about:

Wide area network (WAN) subnetworks:In comparison, initially we did not implement the samelevel of self-configuration for the WAN subsystem, which forced us to rely on our support resourcesto sort out configuration errors. Due to the many configuration problems in the WAN subsystem, wewere eventually forced to add some self-configuring capabilities to it. In addition, we created a

WAN configuration wizard. Once implemented, these features helped alleviate the situation,reducing configuration-related problems significantly.

9


rm.




10/12

Automation not needed:When a disk is inserted, it can be automatically configured (usingpredefined templates), labeled, started, and, if applicable, an online disk copy can be launched.This has turned out to be interesting in demonstrations, but we have no evidence that the feature isused by customers. Furthermore, it has proved impossible to carry the feature forward when movingto disks that are external to the system (JBOD and enterprise storage).

Self-diagnosis: The self-diagnosing software can detect hardware failures (using events), butcurrently cannot always call out the exact failing component in the case of storage. Our objective isto improve our ability to pinpoint the failing component, whether it is inside a NonStop system

enclosure or in an external subsystem such as a storage array.

Looking ahead

As mentioned earlier, system self-management is a technology that must be improved continually. Ascan be seen in figure 7, where a single or multiple applications may be distrubuted across processorsand across nodes, managing such complex distributed applications calls for continuous improvementin the techniques discribed above. Furthermore, the architecture of the NonStop server continues toevolve, which is especially obvious where specialized devices with their own managementarchitectures are used for core system functions, such as data storage and networking. The systemitself is both viewed and implemented as a hetergenous architecture. Thus, we are working on waysto continue to provide self-management technology without owning all components in the system.

Today, the system already has to handle the management of four different operating environments,which may increase in the future. Examples include

Enterprise storage server

Database and transaction server

System management console

NonStop operating system and POSIX personalities

HP and third-party developed value-added middleware

In such heteregenous system designs, a large problem that needs to be solved is how to aggregrateinformation from a number of different components to pinpoint the source of a problem to the same

level as were capable of doing today (or to do so at even higher levels). Achieving this goal willrequire increased levels of cooperation with other divisions of HP to share information andtechnologies in the interest of a common goal for both HP and our customers: simplied systemoperation using a combination of self-management and adaptive-management technologies.

10


rm.




11/12


12/12

For more information

www.hp.com/go/nonstop

2005 Hewlett-Packard Development Company, L.P. The information containedherein is subject to change without notice. The only warranties for HP products andservices are set forth in the express warranty statements accompanying suchproducts and services. Nothing herein should be construed as constituting anadditional warranty. HP shall not be liable for technical or editorial errors oromissions contained herein.

Linux is a U.S. registered trademark of Linus Torvalds.

06/2005


rm.



9293 Selfmanaging in-Network

Documents

Transcript of 9293 Selfmanaging in-Network