Download - Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi

Autonomous Recovery in Componentized Internet

ApplicationCandea et. al

Vikram Negi

Introduction

• Autonomic Problem

• Approach

• Results

• Discussion

The Autonomic Problem

• To allow the application to recover automatically from transient and intermittent software failure.

The Approach

• Introduce the idea :– Microanalysis (fault detection)– Microrebooting (rapid recovery)– External Management (recovery action)

• Integrate and Test with JBOSS

Design Overview

• Autonomous Process – Monitoring

• Java probes

– Fault detection• Generate Anomaly report

– Recovery• Takes action

• Total time to recovery.

J2EE Review

• J2EE enterprise apps = collection of reusable Java modules

• JSPs / servlets invoke EJBs, which invoke other EJBs, ...

• EJB = Java component that complies to a certain interface and provides a service

• Deployment descriptor (per-bean XML file) conveys run-time characteristics and dependencies; used in deploying the application

JBoss Design

• Open-source J2EE app server• Written entirely in Java • Microkernel with components held together by JMX (Mgmt Support)

JAGR = ROC-ified JBoss with Application-Generic Recovery

• 3 Tier Architecture

• Key Components– Macro analysis Engine

– Microrebooting Hook

– Recovery Manager

Pinpoint : Detection and Localization

• Store Observation– IP address of machine, timestamp– Globally unique request ID. – # of calls/returns to EJB’s– Association between sender and receiver.– Collect SQL Queries, update, read

Pinpoint : Analysis

• Analysis Engine– Centralized Engine

– Plugin based architecture

• Modeling Components– Assume both present

component behavior and historical (normal) behavior have same probability distribution.

– Ki square test to determine different probability distribution.

Recovery : micro-reboot is not expensive

• State Segregation– Store impt. state outside the application in database. – Persistent State

• CMP (container managed persistence, J2EE) is a requirement for prototype.

– Session State• Store in modified SSM(external session state store)

• Containment and Reintegration– Microreboot transitive closure of all inter-EJB references– XML deployment descriptors to determine grouping for closure– Complete or micro reboot

Recovery

• Enabling Micro reboot– Method in JBOSS EJB Container– Preserve Class Loader

Manage Recovery

• Recovery Policy

– Read failure report consider components > 1.0

– Micro-reboot(top n) or all >1.0

– Allow delay (~30sec)

– If error is present still try few time or reboot completely

– Finally report it to sys admin

Evaluation Test Framework

• Application– Petstore 1.1 (12 comp, 233 java file, 11K Loc)

– Petstore 1.3.1(47 comp, 310 java file 10K Loc)

– RUBiS (21 comp, 500 java file , 25K Loc)

• Workload– Implement Simulators with Transition table.

– 350 client (max utilization principle)

• Faultload– Based on industry experience

– No low level hardware or OS faults.

Evaluation Detection

• Result similar to other detector

• No discussion on absolute numbers?• Forced Java Runtime/Declared Exceptions, call emission and src code bug

• 1# How well the fault was detected, 2#how well major outage was detected ?

Evaluation : Localization

Localization % for a algorithm per fault type CIA > 85%No absolute data again ?

Evaluation : Recovery

• Introduce faults in SSM-RUBiS.

• Restart SSM-RUBiS or micro reboot component.

• Observation from 10 trials per 350 concurrent client.

Full v/s Micro reboot

• Injected a null reference fault in SB CommitBid, then a corrupt User-Item, SB BrowseCategories and SB CommitUserFeedback.

• Microreboot maintains steady response.

• 425 vs 3916 failed request

• 61527 vs 56028 success request

• What error condition did other trials had?

Total Recovery Time

• Corrupt SB_ViewItem set it to NULL.• 19.4 sec TRT• 18.5 sec in analysis• Pinpoint is bottleneck in micro reboot.

Pinpoint is app generic ?

• Upgrade to Petstore v.1.3.2– Works for the confidence interval

How different was the updated version??

Perfomance Overload

• Results for 30min fault free run w/ 350 clients

• In memory v/s Out memory (SSM)

• Marshalling costs

Assumption

• Well defined interface for components (.Net,J2ee)

• Deterministic call path b/w component

• No critical service request

• Training data for statistical model

• Guidelines (Crash Only Software)

Discussion

• Overall one of the Good Papers maybe bit verbose in introduction !

• Integrating framework for earlier work by Candea.• Limitation of the present statistical model.• Shared EJB state

– Modify JIT, disable microreboots(ref, static var)

• Application – Global data not scrubbed. • Cost Benefit : micro reboot v/s total reboot

Supplementary

• Application server = operating system for Internet applications (instantiates app components in containers, provides runtime system services, integrates with web server to make app webaccessible)

• http://people.epfl.ch/george.candea