Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

92
Under the Hood of Oracle Clusterware Miracle OpenWorld 2010 15-Apr-2010 Alex Gorbachev, The Pythian Group

Transcript of Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

Page 1: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

Under the Hoodof Oracle ClusterwareMiracle OpenWorld 2010

15-Apr-2010

Alex Gorbachev, The Pythian Group

Page 2: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Alex Gorbachev

• CTO, The Pythian Group• Blogger• OakTable Network member• Oracle ACE Director• BattleAgainstAnyGuess.com• Vice-president, Oracle RAC SIG

2

Page 3: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Why Companies Trust Pythian• Recognized Leader:• Global industry-leader in remote database administration services and consulting for Oracle,

Oracle Applications, MySQL and SQL Server

• Work with over 150 multinational companies such as Forbes.com, Fox Interactive media, and MDS Inc. to help manage their complex IT deployments

• Expertise:• One of the world’s largest concentrations of dedicated, full-time DBA expertise.

• Global Reach & Scalability:• 24/7/365 global remote support for DBA and consulting, systems administration, special

projects or emergency response

3

Page 4: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Agenda

• Place of Clusterware in Oracle RAC

• Node membership and evictions

• Clusterware startup sequence

• Oracle Cluster Registry

• Resources Management and troubleshooting

• 11gR2 Grid Infrastructure

4

Page 5: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Agenda

4

Nee

d to

mem

oriz

e

Understanding

Low

High

Shallow In-depth

The more you understand,

the less you need to memorize

Page 6: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

ArchitectureOS

Clusterware

Instance

ASM

VIPListener

Service

Shared storage

OCR Votingdisk

OS

Clusterware

Instance

ASM

VIPListener

Service

OS

Clusterware

Instance

ASM

VIPListener

Service

interconnectstorage access

5

Page 7: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

ArchitectureOS

Clusterware

Instance

ASM

VIPListener

Service

Shared storage

OCR Votingdisk

OS

Clusterware

Instance

ASM

VIPListener

Service

OS

Clusterware

Instance

ASM

VIPListener

Service

interconnectstorage access

5

Page 8: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

6

Page 9: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

Cluster Synchronization Services

6

Page 10: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD

Cluster Synchronization Services

Cluster Ready Services

6

Page 11: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD

RACG

VIP

Cluster Synchronization Services

Cluster Ready Services

HA Framework scripts

6

Page 12: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD

EVM

D

RACG

VIP

Cluster Synchronization Services

Cluster Ready Services

HA Framework scripts

Event Manager

6

Page 13: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD

EVM

D

OPROCD

RACG

VIP

Cluster Synchronization Services

Cluster Ready Services

HA Framework scripts

Event Manager

Oracle Process Monitor

6

Page 14: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSD

EVM

D

RACG

VIP

CSSD

OPROCD

7

Page 15: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSD

EVM

D

RACG

VIP

CSSD

OPROCD

7

Page 16: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

8

Page 17: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

8

Page 18: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

8

Page 19: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

9

Page 20: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

9

Page 21: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

ShootTheOtherNodeInTheHead

9

Page 22: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

10

Page 23: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

11

Page 24: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

11

Page 25: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

11

Page 26: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

11

Page 27: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

AskTheOtherNodeToRebootItself (c) known quote

11

Page 28: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

Votingdisk

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

interconnect

12

Page 29: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

Votingdisk

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

interconnect

OCLSOMON

12

Page 30: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

Votingdisk

interconnect

OCLSOMON

12

Page 31: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

13

Page 32: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

13

Page 33: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

13

Page 34: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

CSSD

OPROCD

13

Page 35: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

OPROCD

13

Page 36: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

14

Page 37: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

14

Page 38: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

14

Page 39: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

CSSD

OS

Clusterware

CSSD

CRSD EVM

D

OPROCD

RACG

VIP

interconnect

Votingdisk

14

Page 40: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

OS

Clusterware

CRSD EVM

D

OPROCD

RACG

VIP

CSSDinterconnect

Votingdisk

15

Page 41: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

OS

Clusterware

CRSD EVM

D

OPROCD

RACG

VIP

CSSDinterconnect

15

Page 42: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSDEVM

D

OPROCD

RACG

VIP

CSSD

OS

Clusterware

CRSD EVM

D

OPROCD

RACG

VIP

CSSDinterconnect

15

Page 43: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

CSSD CSSDinterconnect

15

Page 44: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Evictions

16

Page 45: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• Network heartbeat lost

Evictions

16

Page 46: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• Network heartbeat lost• Voting disk access lost

Evictions

16

Page 47: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• Network heartbeat lost• Voting disk access lost• CSSD is not healthy

Evictions

16

Page 48: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• Network heartbeat lost• Voting disk access lost• CSSD is not healthy• OS is not healthy

• OPROCD - Unix, Windows, 11g Linux

• hangcheck-timer - 10g Linux

Evictions

16

Page 49: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMONHB failure

• Simulate with “ifconfig eth1 down”• Both nodes notice the loss• Racing to evict each other

• from voting disk => 2 equal sub-clusters

• survives the one with the lowest leader #

• leader is the node with lowest # in sub-cluster

• Winner evicts another node• Setting kill-block in voting disk

• CSSD and OCLSOMON race to suicide

17

Page 50: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

NHB failure symptoms

• NHB failure on several nodes• ocssd.log

• Evicted node can contain other traces• maybe - syslog (Linux - /var/log/messages)

• maybe - oclsomon.log

• almost always - console

• Network is only *possible* root cause• check syslog, ifconfig, netstat

• Network engineering - switches logs

18

Page 51: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOCSSD is not healthy

• Simulate using kill -STOP <cssd.bin pid>• Another node observes NHB loss

• After misscount seconds => attempt eviction

• but CSSD is frozen and can’t commit suicide

• OCLSOMON detects CSSD timeout• Commit suicide

19

Page 52: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OCSSD sick - symptoms

• Error in OCLSOMON.log• OCSSD log might be clean on evicted node• syslog might contain OCLSOMON diag. err.• Console often contains diag. err.

• Depending on syslogd settings

• Set diagwait to more that 3 for better diagnosability• 3 seconds is reboottime

• Increases risk of corruption

20

Page 53: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOhost sick - CPU stalled

• Simulate by pausing OPROCD• kill -STOP <oprocd pid>

• sleep 1 or 2

• kill -CONT <oprocd pid>

• oprocd.log• Usually nothing if node is reset

• Immediate reboot• Console might contain diag msg

21

Page 54: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Killed by OPROCD - symptoms

• Hard to confirm (nothing in oprocd.log)• Console output often helps

• “SysRq: resetting” could be in syslog as well

• Root cause• Faulty hardware, drivers, caused by IO/network

• Kernel bugs, NTP bugs

• Investigate syslog messages

• Margin can be tuned• diagwait and reboottime CSSD parameters

22

Page 55: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

10g on Linux - hangcheck-timer

• Replaced by OPROCD in 11g and 10.2.0.4+• Most of the time useless and inactive!• Metalink Note 726833.1

• Updated 21-JUL-08!

• Oracle suggests to keep both• I would only leave OPROCD

• Metalink Note 567730.1• OPROCD in 10.2.0.4

23

Page 56: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Killed by hangcheck-timer

• Rarely can be confirmed• “Hangcheck: hangcheck is restarting the machine”

• Can set hangcheck_dump_tasks to dump state

• See source code...

24

Page 57: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• Linux & UNIX inittab• init.cssd

• init.evmd

• init.crsd

• Linux & UNIX init.d• init.crs

• Windows Services

Clusterware startup

25

Page 58: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Daemons startup sequence

CSSD

EVMD

CRSD

Third-party clusterware

• Triggered• by init.crs from init.d sequence

• manually

26

Page 59: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Startup in Linux & Unix[gorby@dime ~]$ ps -fe | grep 'init\.' | grep -v grep

root 6352 1 0 10:24 ... /bin/sh /etc/init.d/init.evmd run

root 6353 1 0 10:24 ... /bin/sh /etc/init.d/init.cssd fatal

root 6354 1 0 10:24 ... /bin/sh /etc/init.d/init.crsd run

root 7356 6353 0 10:25 ... /bin/sh /etc/init.d/init.cssd oprocd

root 7364 6353 0 10:25 ... /bin/sh /etc/init.d/init.cssd oclsomon

root 7383 6353 0 10:25 ... /bin/sh /etc/init.d/init.cssd daemon

[gorby@dime ~]$ tail -3 /etc/inittab

h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null

h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null

h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null

[gorby@dime ~]$ ls -l /etc/rc3.d/S96init.crs

lrwxrwxrwx 1 root root 20 Aug 1 23:51 /etc/rc3.d/S96init.crs -> /etc/init.d/init.crs

27

Page 60: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

t

Startup flow

28

Page 61: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

t

Startup flow

28

Page 62: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

t

/etc/oracle/scls_scr/{host}/root/cssrunStartup flow

28

Page 63: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

t

/etc/oracle/scls_scr/{host}/root/cssrunStartup flow

28

Page 64: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

init.crs start

init.cssd autostart

t

/etc/oracle/scls_scr/{host}/root/cssrunStartup flow

28

Page 65: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

init.crs start

init.cssd autostart

t

/etc/oracle/scls_scr/{host}/root/cssrun

/etc/oracle/scls_scr/{host}/root/crsstart• enable• disable

Startup flow

28

Page 66: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

init.crs start

init.cssd autostart

t

/etc/oracle/scls_scr/{host}/root/cssrun

/etc/oracle/scls_scr/{host}/root/crsstart• enable• disable

Startup flow

28

Page 67: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

init.crsd run

init.evmd run

init.cssd fatal

init.cssd oprodc

init.cssd oclsomon

init.cssd daemon

init.cssd oclsvmon

oprocd

oclsomon.bin

ocssd.bin

oclsvmon.bin

evmd.bin

t

/etc/oracle/scls_scr/{host}/root/cssrun

/etc/oracle/scls_scr/{host}/root/crsstart• enable• disable

crsd.bin

Startup flow

28

Page 68: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOStartup troubleshooting

• Check processes using “ps -fe | grep init”• Check syslog (/var/log/messages)

• Can point to /tmp/crsctl.#####

• Remember boot sequence• Clusterware log files

• if *.bin processes are running already

• crsctl• crsctl check crs/cssd/crsd/evmd

29

Page 69: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Log files

• log/{host}/cssd/ocssd.log• log/{host}/cssd/oclsomon/ocslmon.log

• ocslmon.ba1, ocslmon.ba2,...

• /etc/oracle/oprocd/{host}.oprocd.log• {host}.oprocd.log.{timestamp}

• syslog• Linux /var/log/messages

• Solaris /var/adm/log

• Console logs

30

Page 70: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Windows world

• OPROCD = OraFenceService• EVMD = OracleEVMService• CRSD = OracleCRService• CSSD = OracleCSService• OPMD

• Oracle Process Manager Daemon

• Start trigger like init.crs in *nix

• registered with Windows Service Control Manager (WSCM) and delay start by 60 seconds

31

Page 71: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSD

EVM

D

RACG

VIP

CSSD

OPROCD

32

• Passing clusterware events

• Usually not a problem• Verify

• evmwatch -A

• evmpost -u "my message"

Page 72: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSD

EVMD

RACG

VIP

CSSD

OPROCD

32

• Passing clusterware events

• Usually not a problem• Verify

• evmwatch -A

• evmpost -u "my message"

Page 73: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

OS

Clusterware

CRSD

EVM

D

RACG

VIP

CSSD

OPROCD

33

Page 74: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• CRSD manages cluster resources• Stop / Start

• Failover

• VIP management

• New resources and etc.

• RACG helper scripts

OS

Clusterware

CRSD

EVM

D

RACG

VIP

CSSD

OPROCD

33

Page 75: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

• After CSSD and EVMD• Re-spawned on failure

• No eviction

• Runs as root• VIP control

• OCR management

• root ulimits are in place!

• Can run resources owned by any user

• owner is the property of a resource

CRSD startup

34

Page 76: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Oracle Cluster Registry

• Repository for all configuration data• Except OCR location itself

• OCR is accessed mostly read-only• Every component reads OCR

• OCR is written only by CRS• only from a single OCR master node

### crsd.log ###

2008-08-02 22:23:50.958: [ OCRMAS] [3065154448]th_master:13:I AM THE NEW OCR MASTER at incar 12. Node Number 1

35

Page 77: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

CRS resources

• Standard Oracle resources• ASM

• Listener

• VIP

• Database and Instance

• etc..

• srvctl => manages Oracle resources

• Custom user resources• crs_% => manages any resources

36

Page 78: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

CRS resource internals

• Unique name• Associated action script

• stop / start / check functions

• Other attributes• check frequency

• pre-requisites

• restart retries

• etc...

• All info stored in OCR

37

Page 79: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOResource profiles

• Use crs_stat [-t] to check status• Use crs_stat -p to check attributes• crs_* vs srvctl (like srvctl config ... -a)• Standard action scripts

• racgimon

• racgwrap / racgmain

• racgvip

• racgons

• usrvip

38

Page 80: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOOCR internals

• ocrcheck• ocrconfig

• used during install/ugrade

• backup OCR

• recover OCR

• ocrdump• txt or xml

39

Page 81: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOracgvip case study

• Check the script• Set env. vars and simulate the call• Use _USR_ORA_DEBUG=1 in the script

40

Page 82: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Resources hierarchy

• 10.2.0.2 (?)• released dependency of

ASM and Instance on VIP

• If DB registered manually with srvctl• ASM dependency missing

DB

Instance

Nodeapps

GSD ONS

VIP

Listener

ASM

Only 10.1 and 10.2.0.1

CS(Collective Service)

Service

41

Page 83: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DB

Instance

Nodeapps

GSD ONS

VIP

Listener

ASM

Only 10.1 and 10.2.0.1

CS(Collective Service)

Service

Resources and Oracle homes

DB Home

ASM Home

CRS Home

Listener can be in ASM homeASM home can be Oracle home

Logs are in appropriate home

42

Page 84: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

DEMOtroubleshooting resources

• {home}/log/{host}/racg/{resource_name}.log • Old way - edit racgwrap

• Uncomment _USR_ORA_DEBUG=1

• crsctl debug log res ‘{res_name}:{0|1}’• crs_stat -p | grep DEBUG

• Run “srvctl start ...” manually• SRVM_TRACE=TRUE

43

Page 85: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Troubleshooting summary

• crsctl check crs | crsd | cssd | evmd• crs_stat [-t]• crs_stat -p [{res_name}]• crsctl debug log css | crs | evm | res• crsctl lsmodules css | crs | evm• crs_stop {res_name} [-f] (stop force resource)• ocrdump• See scripts

44

Page 86: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Troubleshooting flow

• Is Clusterware up?• Is Oracle resources up?

• Listener & VIP

• Database & ASM instance

• Services

• Did any nodes got rebooted?• Did any resources re-started?

• $ORA_CRS_HOME/log/{host}/crs/crsd.log

• $ORA_CRS_HOME/log/{host}/alert{host}.log

• MOS Note 265769.1 “Troubleshooting 10g and 11.1 Clusterware Reboots”

45

Page 87: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Enter the 11gR2 World - Grid Infrastructure

46

Page 88: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Enter the 11gR2 World - Grid Infrastructure

46

Oracle Clusterware Administration and Deployment Guide

Page 89: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

Enter the 11gR2 World - Grid Infrastructure

47

My Oracle Support Note 1053147.1

Page 90: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

11g Grid Infrastructure Documentation

• Oracle Clusterware Administration and Deployment Guide• MOS Note 1053147.1

• 11gR2 Clusterware and Grid Home - What You Need to Know

• MOS Note 1050908.1• How to Troubleshoot Grid Infrastructure Startup Issues

• MOS Note 1053970.1• Troubleshooting 11.2 Grid Infastructure Installation Root.sh Issues

• MOS Note 1050693.1• Troubleshooting 11.2 Clusterware Node Evictions (Reboots)

48

Page 91: Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

© 2009/2010 Pythian

11gR2 Node Evictions

• Same as in 10g + member kill escalation• LMON process may request CSS to remove an instance from the

cluster via the instance eviction mechanism.  If this times out it could escalate to a node kill.

• Processes evicting• CSSD

• CSSDAGENT

• CSSDMONITOR

49