Firewall Tips & Tricks and All About Network_ How to Troubleshoot Failovers in ClusterXL

Firewall-Network tips and tricksThis blog will provide more info on Checkpoint, Cisco, Bricks, Netscaler, F5 loadbalancer

home

Labels

Cheat Sheets (7)

Checkpoint (159)

Cisco (24)

Commands (5)

Fortigate (2)

Frame-Relay (9)

Linux (3)

Netscaler (29)

Netscreen (2)

Nokia (7)

UNIX (2)

Live Traffic

how to troubleshoot failovers in clusterxl

Basic Information About the Failover ProcessA failover occurs when a cluster member is no longer able to perform its designated functions, and

another member in the cluster assumes the failed member's responsibilities.

In a Load Sharing configuration, if one cluster member goes down, its connections are distributed

among the remaining cluster members. All cluster members in a Load Sharing configuration are

synchronized, thus avoiding interrupted connections.

In a High Availability configuration, if currently Active cluster member in a synchronized cluster goes

down, another cluster member becomes new Active member and "takes over" the connections of the

failed cluster member. If state synchronization is not enabled, then existing connections are closed

when failover occurs, although new connections can be opened.

In order to exchange the control information between cluster members (state of each member, state

of interfaces, etc), the ClusterXL Cluster Control Protocol is used between cluster members (UDP port

8116). If a certain predetermined time has elapsed and no CCP messages are received from a given

cluster member, it is assumed by other cluster members that the given cluster member has failed, it

will be declared as Down and a failover might take place. At this point other cluster members

automatically assume the responsibilities of the failed one.

Notes

A cluster member may still be operational, but if any of the above checks fail in the cluster, the

faulty member initiates the failover because it has determined that it can no longer function as

a cluster member.

More than one cluster member may encounter a problem that will result in a failover event. In

cases where all cluster members encounter such problems, ClusterXL will try to choose a single

member to continue operating. The state of the chosen member will be reported as Active

Attention. This situation lasts until another member fully recovers. For example, if a cross cable

connecting the cluster members malfunctions, both members will detect an interface problem.

One of them will change its state to theDown, and the other will change its state to Active

Attention.

A failover takes place when one of the following occurs on the active cluster member:

Any critical device (pnote) reports a problem (refer to the output of 'cphaprob -ia list'

command).

The machine freezes / crashes (in this case refer to sk31511).

When a failing cluster member recovers:

In a Load Sharing configuration, all connections are redistributed among all Active members.

In a High Availability configuration, the recovery method depends on the configured cluster

setting.

The options are:

Maintain Current Active Gateway

If one member passes on a control to a lower priority machine, then control will be

returned to the higher priority member only if the lower priority member fails. This

mode is recommended if all members are equally capable of processing traffic, in order

to minimize the number of failover events.

Switch to Higher Priority Gateway

If the lower priority member has control and the higher priority member is restored,

then control will be returned to the higher priority member. This mode is recommended

if one member is better equipped for handling connections (e.g., has more CPU power,

more memory).

When a failed cluster member recovers, it will first try to download a policy from one of the other

Active cluster members. The assumption is that the other cluster members have a more up-to-date

policy. If this does not succeed, the recovering cluster member compares its own local policy to the

policy on the Security Management server. If the policy on the Security Management server is more

up-to-date than the one on the cluster member, the policy on the Security Management server will be

retrieved. If the cluster member does not have a local policy, it retrieves one from the Security

Management server. This ensures that all cluster members use the same policy at any given moment.

Troubleshooting the ClusterXL FailoversTo identify the root cause of the ClusterXL failovers:

In this step, we use several methods to understand what is causing the ClusterXL failovers.

SmartView Tracker - to filter all ClusterXL messages.

ClusterXL CLI - to understand the current status of the cluster members and the reason for

that.

Analyze the Syslog messages which were generated at the time of the failovers.

Step 1: Using SmartView Tracker:

To facilitate analysis of what happens at the time of the failover, the relevant messages should be

Search This Blog

Blog Archive

► 2011 (107)

▼ 2010 (146)

▼ December (13)

What's New in R75

Error: "Packet is droppedbecause there is no vali...

How to configure automaticbackups in SecurePlatfo...

How to Set Up a Site-to-SiteVPN with a 3rd-party ...

How to Configure ManagementHA

How To Troubleshoot MemoryLeaks on IPSO

How to troubleshoot failovers inClusterXL

Backup procedures forCheckpoint

Best practices CheckpointProvider-1

Check Point logging issueswhen the Management Ser...

Checkpoint : Nokia Hardware -Model - Serial Numbe...

TCP DUMP - Deep Inside

R75 was released!

► September (32)

► August (2)

► June (9)

► May (90)

Total Pageviews

73,321

Live Traffic Feed

See your visitors inRealTime! Get the Free LiveTraffic Feed Get FeedjitNow!

A visitor from Casablancaviewed "Firewall Tips &Tricks and all aboutNetwork: How totroubleshoot failovers inClusterXL" 0 secs ago

A visitor from Mumbai,Maharashtra viewed"Firewall Tips & Tricks andall about Network:Checkpoint Commands" 5mins ago

A visitor from India viewed"Firewall Tips & Tricks andall about Network" 12 minsago

A visitor from New Delhi,Delhi viewed "Firewall Tips& Tricks and all aboutNetwork" 14 mins ago

A visitor from New Delhi,Delhi viewed "Firewall Tips& Tricks and all aboutNetwork: Checkpoint" 14mins ago

A visitor from Moscow,Moscow City viewed"Firewall Tips & Tricks andall about Network: DisablingAnti-Spoofing" 21 mins ago

A visitor from Doha, AdDawhah viewed "FirewallTips & Tricks and all aboutNetwork: CheckpointTroubleshooting -Debugging" 25 mins ago

Firewall Tips & Tricks and all about Network: How to troubleshoot failo... http://firewalltipss.blogspot.com/2010/12/how-to-troubleshoot-failovers-...

1 of 5 3/29/2012 9:10 AM

exported to a Text file, after modifying the filter:

To export the cluster messages in SmartView Tracker:

Go to the right-most column "Information"

Right-click on the name of the column

Click on "Edit filter"

Under "Specific" choose "Contains"

In "Text" type the word "cluster" (do not check any boxes)

Click on "OK"

Go to all the empty columns - "Source", "Destination", "Rule", "Curr Rule Number", "Rule

Name", "Source Port", "User"

Right-click on the name of the column

Click on "Hide Column" (these columns will re-appear after closing and re-opening SmartView

Tracker)

Save all the Cluster messages - go to menu "File" - click on "Export..."

Troubleshooting Method:

Open the exported file you have just saved, look for the ClusterXL messages during the time of the

failover and see if there are messages indicating a problem of one or more critical devices (pnotes).

Examples - if you see problems on:

"Filter" critical device, most likely the policy was unloaded form that member, which caused the

failover.

Investigate and understand why the policy was unloaded form that member.

"Interface Active Check" critical device, it means that on one or more interfaces, CCP traffic

was not heard on the specified default time configured. This can be a result of networking

issues, high latency, physical interface problems, drivers, etc.

Try to eliminate networking issues, drivers, etc. before you change something in the ClusterXL

configurations.

Use the ClusterXL Admin Guide for further help on this topic.

Step 2: Using ClusterXL CLI:

There are several commands that can help us to see the ClusterXL status. The most useful commands

while a failover took place, are the following:

# cphaprob state

# cphaprob -ia list

# cphaprob -a if

# fw ctl pstat

# cphaprob state:

The following is an example of the output of cphaprob state from Load Sharing Multicast cluster:

Cluster mode can be:

Load Sharing (Multicast)

Load Sharing (Unicast, a.k.a Pivot)

High Availability New Mode (Primary Up

or Active Up)

High Availability Legacy Mode (Primary

Up or Active Up)

For 3rd-party clustering products: "Service"

The number of the member indicates the Member ID in Load Sharing mode, and the Priority in High

Availability mode.

In Load Sharing configuration, all members in a fully functioning cluster should be in 'Active' state.

In High Availability configurations, only one member in a properly functioning cluster must be in

'Active' state, and the others must be in the 'Standby' state.

3d-party clustering products show 'Active'. This is because this command only reports the status of the

Full Synchronization process. For Nokia VRRP, this command shows the exact state of the Firewall, but

not the cluster member (for example, the member may not be working properly, but the state of the

Firewall is active).

Explanations on the ClusterXL members' status:

Active - everything is OK.

Active Attention - problem has been detected, but the cluster member still forwarding

packets, since it is the only machine in the cluster, or there are no active machines in the

cluster.

Down - one of the critical devices is having problems.

Ready -

When cluster members have different versions of Check Point Security Gateway, the

members with a new version have the ready state and the members with the previous

version have the activestate.

Before a cluster member becomes active, it sends a message to the rest of the cluster,

and then expects to receive confirmations from the other cluster members agreeing that

it will becomeactive. In the period of time before it receives the confirmations, the

machine is in the ready state.

When cluster members in versions R70 and higher have different number of CPU cores

and/or different number of CoreXL instances, the member with higher number of CPU

cores and/or higher number of CoreXL instances will stay in Ready state, until the

configuration is set identical on all members.

Standby - the member is waiting for an active machine to fail in order to start packet


2 of 5 3/29/2012 9:10 AM

forwarding. Applies only in high availability mode.

Initializing - the cluster member is booting up, and ClusterXL product is already running, but

the Security Gateway is not yet ready.

ClusterXL inactive or machine is down - Local machine cannot hear anything coming from

this cluster member.

# cphaprob -ia list

When a critical device (pnote) reports a problem, the cluster member is considered to have failed. Use

this command to see the status of the critical devices (pnotes).

There are a number of built-in critical devices, and the administrator can define additional critical

devices. The default critical devices are:

Interface Active Check - monitors the cluster interfaces on the cluster member

Synchronization - monitors if Full Synchronization completed successfully.

Filter - monitors if the Security Policy is loaded.

cphad - monitors the ClusterXL process called 'cphamcset'.

fwd - monitors the FireWall process called 'fwd'.

For Nokia IP Clustering, the output is the same as for ClusterXL Load Sharing.

For other 3rd-party products, this command produces no output.

Example output shows that the fwd process is down:

In these cases, check if 'fwd' process is up,

using the following commands:

# ps auxw

# pidof fwd

# cpwd_admin list

If it is not up and running,you need to

investigate why.

Check $FWDIR/log/fwd.elg file, try to locate

the reason why the fwd process crashes.

Open a case with Check Point support to keep

the investigation.

# cphaprob -a if

Use this command to see state of the cluster

interfaces.

The output of this command must be identical

to the configuration in the cluster object

Topology page.

For example:

The state of interfaces is critical to operation of ClusterXL. ClusterXL checks the number of working

interfaces at boot and sets a value of 'Required interfaces' to the maximum number of 'good'

interfaces seen since the last reboot. If the number of working interfaces changes and becomes less,

than the 'Required interfaces', ClusterXL initiates failover from this member to other members. The

same applies to 'Required secured interfaces', where only the good synchronization interfaces are

counted.

For 3rd-party clustering products, except in the case of Nokia IP Clustering, cphaprob -a if should

always show cluster Virtual IP addresses.

When a state of an interface is DOWN, it means that the ClusterXL on the local member is not able to

receive and/or to transmit CCP packets withing pre-defined timeouts. This may happen when an

interface is malfunctioning, is connected to an incorrect subnet, is unable to pick up Multicast Ethernet

packets, CCP packets arrive with delay, and so on. The interface may also be able to receive, but not

transmit CCP packets, in which case the status field should be checked. The displayed time is the

number of seconds that have elapsed since the interface was last able to receive/transmit a CCP

packet.

See: "Defining Disconnected Interfaces" section - in the ClusterXL_Admin_Guide.


3 of 5 3/29/2012 9:10 AM

0 Responses to "How to troubleshoot failovers in ClusterXL"

Post a Comment

Comment as:

NewerPost

OlderPost

# fw ctl pstat

Use this command in order to get FW-1 statistics.

When troubleshooting ClusterXL issues, use this command in order to see the Sync network status and

statistics.

At the bottom of this command output, are the Sync network statistics.

If that status is in "off", you will see the 'Synchronization' critical device reporting a problem, when

running the# cphaprob -i list command.

That means that the Full Sync in the cluster (when a member is booting up or the cluster is

configured) did not succeed.

In these cases, refer to the following SK articles:

Forcing Full Sync in Cluster

Debugging Full Sync in Cluster

Step 3: Analyzing Syslog Messages:

In Linux environment, like SPLAT, go to /var/log/messages files, and search for the time of the

failover.

In Solaris environments, go to /var/adm/messages and search for the time of the failover.

Look for possible issues that can cause the problem - Link on interfaces going Down, lack of resources

like CPU/memory, etc.

When opening Service Request with Check Point support, please attach all these messages to the case,

and specify the exact time and date this issue happened, so we can correlate the failover with the

logs.

Completing the ProcedureIf after following the steps in this guide the failovers are not resolved, open a Service Request with

Check Point and provide the following information:

CPinfo files from all cluster members (make sure to use the latest CPinfo utility installed per

sk30567)

CPinfo file from the MGMT server (make sure to use the latest CPinfo utility installed per

sk30567)

/var/log/messages* from ALL the cluster members --- please supply all the messages files in

this directory

$FWDIR/log/fwd.elg* from ALL the cluster members --- please send all the fwd.elg files in this

directory

$CPDIR/log/cpwd.elg* from ALL the cluster members --- please send all the cpwd.elg files in

this directory

An export of all cluster messages from SmartView Tracker (see Step 1 above)

12/30/2010 04:11:00 AM

Posted by kishore | Filed Under Checkpoint | 0 Comments

Comments

Home

CREDITS


4 of 5 3/29/2012 9:10 AM

Copyright 2008 Firewall Tips & Tricks and all about

Network - WP Theme by Brian Gardner - Blogger

Template by ThemeLib.com


5 of 5 3/29/2012 9:10 AM

Firewall Tips & Tricks and All About Network_ How to Troubleshoot Failovers in ClusterXL

Documents

Transcript of Firewall Tips & Tricks and All About Network_ How to Troubleshoot Failovers in ClusterXL