OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study,...

17
1 OPNFV Summit 2015 Doctor - Fault Management Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC

Transcript of OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study,...

Page 1: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

1

OPNFV Summit 2015

Doctor - Fault Management

Gerald Kunzmann, DOCOMO

Carlos Goncalves, NEC

Ryota Mibu, NEC

Page 2: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

2

Doctor Overview

• Goal

– Build fault management and maintenance framework

• Approach

– Identify requirement

– Gap Analysis

– Implementation work in Upstream (OpenStack)

– Integration and testing

• Status

– Initial Requirement study, architecture design, Gap analysis : Done

– Collaborative Development: On-going (3 merged Blueprints in OpenStack Liberty)

– Standardization Sync: On-going (by NFV member efforts, joint meeting)

Page 3: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

3

Doctor Members

• At project creation (Dec 2014)

– NTT DOCOMO, Sprint

– NEC, Nokia, Ericsson, Huawei, ClearPath Network, Cisco

• Now (Oct 2015)

– NTT DOCOMO, Sprint, AT&T, Telecom Italia, KDDI

– NEC, Nokia, Ericsson, Huawei, ClearPath Network, Cisco Cloudbase Solutions, Spirent, Intel, ZTE

2x

Page 4: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

4

Assumption of VNF (NFV Application)

• Telco Applications basically deployed in active-standby or active-active fashion

App (Active) App (Standby)

VM VM

Machine Machine

App and App Manager (VNFM) cannot detect HW failures

directly

App state will be switched when failure occurred

Page 5: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

5

Consumer C1 Consumer C2 Consumer C3

Virtualized Infrastructure Manager (VIM), e.g. OpenStack

Resource Map

Server – VM mapping

Server S1 VM-1, VM-2 Server S2 VM-7 Server S3 VM-4

Ownership information

VM-1, VM-7 Consumer C1 VM-2 Consumer C2 VM-4 Consumer C3

Resource Pool

Hypervisor

Hardware Server S1

VM-1

Hypervisor

Hardware Server S2

Hypervisor

Hardware Server S3

VM-2 VM-7 VM-4

X 1. Fault Monitoring - Hardware fault - Hypervisor fault - Host OS fault

6. Execute Instruction - e.g. migrate VM

2. Inform the Consumer? If YES, find owner of

affected VMs from database

OpenStack Northbound Interface

3. FaultNotification (VM ID, Fault ID)

5. Instruction (VM ID)

4. Switch to SBY configuration

Use Case 1: Fault management

Page 6: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

6

Consumer C1 Consumer C2 Consumer C3

Virtualized Infrastructure Manager (VIM), e.g. OpenStack

Resource Map

Server – VM mapping

Server S1 VM-1, VM-2 Server S2 VM-7 Server S3 VM-4

Ownership information

VM-1, VM-7 Consumer C1 VM-2 Consumer C2 VM-4 Consumer C3

Resource Pool

Hypervisor

Hardware Server S1

VM-1

Hypervisor

Hardware Server S2

Hypervisor

Hardware Server S3

VM-2 VM-7 VM-4 6. Execute Instruction - e.g. migrate VM

OpenStack Northbound Interface

3. Maintenance Notification (VM ID) 5. Instruction

(VM ID)

4. Switch to SBY configuration

2. Which VMs are affected? Find Consumer owning the VM(s) from the database.

Administrator

1. Maintenance Request (Server S3)

Use Case 2: Maintenance

Page 7: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

7

Fault Management Sequence

Virtualized Infrastructure

Applications

VIM User and Administrator

Virtualized Infrastructure Manager (VIM)

= OpenStack

Virtual Compute

Virtual Storage

Virtual Network

Virtualization Layer

Hardware Resources

App App App

Detection

Reaction

Doctor Scope

Page 8: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

8

Key Requirements as VIM

Immediate Notification Consistent Resource

State Awareness

Extensible Monitoring Fault Correlation

Page 9: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

9

Doctor Architecture and Typical Scenario

Monitor

Notifier

Manager

Virtualized Infrastructure (Resource Pool)

Alarm Conf.

3. Update State 2. Find Affected

Application

Controller Controller

Controller

Resource Map

1. Raw Failure

Inspector

4. Notify all

5. Notify Error

0. Set Alarm

6-. Action

Failure Policy

Monitor Monitor

Page 10: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

10

Doctor OSS Map

Monitor

Notifier

Manager

Virtualized Infrastructure (Resource Pool)

Alarm Conf.

3. Update State 2. Find Affected

Application

Controller Controller

Controller

Resource Map

1. Raw Failure

Inspector

4. Notify all

5. Notify Error

0. Set Alarm

6-. Action

Failure Policy

Monitor Monitor

Ceilometer

e.g. Monasca e.g. Zabbix

Cinder

Neutron

Nova

Page 11: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

11

Doctor OSS Development

Monitor

Notifier

Manager

Virtualized Infrastructure (Resource Pool)

Alarm Conf.

3. Update State 2. Find Affected

Application

Controller Controller

Controller

Resource Map

1. Raw Failure

Inspector

4. Notify all

5. Notify Error

0. Set Alarm

6-. Action

Failure Policy

Monitor Monitor

Ceilometer

Event Alarm

Cinder

Neutron

Nova

State Correction

e.g. Zabbix e.g. Monasca

Page 12: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

12

Doctor Blueprints in Liberty Cycle

Project Blueprint Spec Drafter Developer Status

Ceilometer Event Alarm Evaluator Ryota Mibu (NEC)

Ryota Mibu (NEC)

Completed (Liberty)

Nova

New nova API call to mark nova-compute down

Tomi Juvonen (Nokia)

Roman Dobosz (Intel)

Completed (Liberty)

Support forcing service down Tomi Juvonen (Nokia)

Carlos Goncalves (NEC)

Completed (Liberty)

Get valid server state Tomi Juvonen (Nokia)

Spec approved (Mitaka)

Add notification for service status change

Balazs Gibizer (Ericsson)

Balazs Gibizer (Ericsson)

Waiting for spec approval (Mitaka)

Page 13: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

13

Doctor BP Detail: Nova – Mark Nova-Compute Down

Host / Machine

Hypervisor

VM

nova compute

nova api

nova conductor

nova scheduler

nova DB queue

External Monitoring Service

vSwitch

BMC

EXISTING (periodic update)

Force-down API

NEW API to update nova-compute service state

service state

Monitoring Client

Page 14: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

14

Doctor BP Detail: Ceilometer - Event Alarm

sample

Notification-driven alarm

evaluator

NEW Shortcut (notification-based)

EXISTING (polling-based)

Manager

Audit Service

stats

notification

event

Cinder Neutron Nova

Page 15: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

15

Doctor Southbound API

User NFVI

Conf. Policy

Controller Inspector Notifier

Admin

Conf.

Monitor

Configuration Fault Messaging

Unified Event API Monitor

Monitor

Threshold

Enable

Enable

Page 16: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

16

Doctor Status

Notifier Monitor Controller Inspector

Ceilometer

Zab

bix

Nova Monasca?

DP

DK

Neu

tron

Cin

der

Done

Next

Ste

p

To-Be Arch. Design

Gap Analysis

Blueprint

Coding

Integration

OPNFV Release

Dec 2014

Sep 2015

Feb 2016

Mar 2015

Page 17: OPNFV Summit 2015 Doctor - Fault Management · PDF file– Initial Requirement study, architecture design, Gap analysis : Done – Collaborative Development: ... – NTT DOCOMO, Sprint,

17

Don’t miss out...

• “Doctor – Fault Management” Project Theater, Wednesday, 3:55 pm – 4:15 pm

• “Doctor: Failure Detection and Notification for NFV” DOCOMO booth, PoC Demo Zone