Autonomous Recovery in Componentized Internet
ApplicationCandea et. al
Vikram Negi
Introduction
• Autonomic Problem
• Approach
• Results
• Discussion
The Autonomic Problem
• To allow the application to recover automatically from transient and intermittent software failure.
The Approach
• Introduce the idea :– Microanalysis (fault detection)– Microrebooting (rapid recovery)– External Management (recovery action)
• Integrate and Test with JBOSS
Design Overview
• Autonomous Process – Monitoring
• Java probes
– Fault detection• Generate Anomaly report
– Recovery• Takes action
• Total time to recovery.
J2EE Review
• J2EE enterprise apps = collection of reusable Java modules
• JSPs / servlets invoke EJBs, which invoke other EJBs, ...
• EJB = Java component that complies to a certain interface and provides a service
• Deployment descriptor (per-bean XML file) conveys run-time characteristics and dependencies; used in deploying the application
JBoss Design
• Open-source J2EE app server• Written entirely in Java • Microkernel with components held together by JMX (Mgmt Support)
JAGR = ROC-ified JBoss with Application-Generic Recovery
• 3 Tier Architecture
• Key Components– Macro analysis Engine
– Microrebooting Hook
– Recovery Manager
Pinpoint : Detection and Localization
• Store Observation– IP address of machine, timestamp– Globally unique request ID. – # of calls/returns to EJB’s– Association between sender and receiver.– Collect SQL Queries, update, read
Pinpoint : Analysis
• Analysis Engine– Centralized Engine
– Plugin based architecture
• Modeling Components– Assume both present
component behavior and historical (normal) behavior have same probability distribution.
– Ki square test to determine different probability distribution.
Recovery : micro-reboot is not expensive
• State Segregation– Store impt. state outside the application in database. – Persistent State
• CMP (container managed persistence, J2EE) is a requirement for prototype.
– Session State• Store in modified SSM(external session state store)
• Containment and Reintegration– Microreboot transitive closure of all inter-EJB references– XML deployment descriptors to determine grouping for closure– Complete or micro reboot
Recovery
• Enabling Micro reboot– Method in JBOSS EJB Container– Preserve Class Loader
Manage Recovery
• Recovery Policy
– Read failure report consider components > 1.0
– Micro-reboot(top n) or all >1.0
– Allow delay (~30sec)
– If error is present still try few time or reboot completely
– Finally report it to sys admin
Evaluation Test Framework
• Application– Petstore 1.1 (12 comp, 233 java file, 11K Loc)
– Petstore 1.3.1(47 comp, 310 java file 10K Loc)
– RUBiS (21 comp, 500 java file , 25K Loc)
• Workload– Implement Simulators with Transition table.
– 350 client (max utilization principle)
• Faultload– Based on industry experience
– No low level hardware or OS faults.
Evaluation Detection
• Result similar to other detector
• No discussion on absolute numbers?• Forced Java Runtime/Declared Exceptions, call emission and src code bug
• 1# How well the fault was detected, 2#how well major outage was detected ?
Evaluation : Localization
Localization % for a algorithm per fault type CIA > 85%No absolute data again ?
Evaluation : Recovery
• Introduce faults in SSM-RUBiS.
• Restart SSM-RUBiS or micro reboot component.
• Observation from 10 trials per 350 concurrent client.
Full v/s Micro reboot
• Injected a null reference fault in SB CommitBid, then a corrupt User-Item, SB BrowseCategories and SB CommitUserFeedback.
• Microreboot maintains steady response.
• 425 vs 3916 failed request
• 61527 vs 56028 success request
• What error condition did other trials had?
Total Recovery Time
• Corrupt SB_ViewItem set it to NULL.• 19.4 sec TRT• 18.5 sec in analysis• Pinpoint is bottleneck in micro reboot.
Pinpoint is app generic ?
• Upgrade to Petstore v.1.3.2– Works for the confidence interval
How different was the updated version??
Perfomance Overload
• Results for 30min fault free run w/ 350 clients
• In memory v/s Out memory (SSM)
• Marshalling costs
Assumption
• Well defined interface for components (.Net,J2ee)
• Deterministic call path b/w component
• No critical service request
• Training data for statistical model
• Guidelines (Crash Only Software)
Discussion
• Overall one of the Good Papers maybe bit verbose in introduction !
• Integrating framework for earlier work by Candea.• Limitation of the present statistical model.• Shared EJB state
– Modify JIT, disable microreboots(ref, static var)
• Application – Global data not scrubbed. • Cost Benefit : micro reboot v/s total reboot
Supplementary
• Application server = operating system for Internet applications (instantiates app components in containers, provides runtime system services, integrates with web server to make app webaccessible)
• http://people.epfl.ch/george.candea