SUSE High Availability for SAP HANA...1 SUSE High Availability for SAP HANA Tales from the real...

25
1 SUSE High Availability for SAP HANA Tales from the real world, tipps, tricks & troubleshooting BP-1351 Thomas Korber Lars Pinne [email protected] [email protected]

Transcript of SUSE High Availability for SAP HANA...1 SUSE High Availability for SAP HANA Tales from the real...

  • 1

    SUSE High Availability for SAP HANATales from the real world, tipps, tricks & troubleshooting

    BP-1351

    Thomas Korber Lars [email protected] [email protected]

    mailto:[email protected]

  • 2

    Abstract

    SAP HANA is the in-memory database for transactional and analytical

    workloads. The SUSE HA cluster is an industry-leading open source high

    availability clustering system designed to virtually eliminate unplanned

    downtime. Combined, this two technologies are core IT for world's most

    relevant enterprises. Learn on best practices from current projects for

    setting up and troubleshooting SUSE HA and SAP HANA in scale-up and

    scale-out scenarios and related Linux components.

  • 3

    Agenda

    1. Best Practices SUSE HA for SAP HANA

    2. Where to look for what

    3. SAPHanaSR Scale-Up Demo

    4. Conclusion

  • 4

    Best PracticesSUSE HA for SAP HANA

  • 5

    SAPHanaSR Scenarios

    pacemaker

    SAP HANA primary

    SAP HANA secondary

    PRD

    SAPHana Promoted

    SAPHanaTopology

    SAPHana Demoted

    SAP HANA secondary

    SAP HANA primary

    PRD

    SAP HANA secondary

    PRD

    SAPHanaTopology

    vIP vIP

    pacemaker

    active / active

    SAP HANA primary

    SAP HANA secondary

    System Replication

    PRD

    SAPHana Promoted

    SAPHanaTopology

    SAPHana Demoted

    SAPHanaTopology

    SAP HANA primary

    SAP HANA secondary

    vIPvIP

    PRD

    pacemaker

    active / active

    SAP HANA primary

    SAP HANA secondary

    System Replication

    PRD

    SAPHana Promoted

    SAPHanaTopology SAPHana Demoted

    SAPHanaTopology

    PRD

    SAPInstance

    SAP HANA primary

    SAP HANA secondary

    vIP

    SAP HANA QAS

    QAS

    vIP

    NodeA1

    1

    NodeA2

    2 3 N

    NodeA3 NodeA4 NodeA5

    ...

    NodeB1

    1

    NodeB2

    2 3 N

    NodeB3 NodeB4 NodeB5

    ...

    SR sync

    Majority maker

    SAP HANA primary SAP HANA secondaryvIP

    SAPHanaTopology SAPHanaTopology

    SAPHanaController D emotedSAPHanaController D emotedP

    Performance Optimized - Scale Up Cost Optimized - Scale Up

    Multi Target - Scale Up Performance Optimized - Scale Out

    Fencing

    Fencing

    Fencing

    Fencing

  • 6

    Fencing

    Remote Mgmt.

    IO Fencing(locking, reservation)

    Node Fencing *(reboot, shutdown)

    Built-in Locking

    In Node

    External

    SBD + Watchdog (diskbased, diskless)

    iLODRACIPMI

    ...

    vCenterlibvirtHMCEC2...

    hpwdtiTCO_wdtipmi_wdtsoftdog

    ...

    Pure Locking

    SFEX

    SCSI2 ReservationSCSI3 Reservation

    cLVM+DLMLVM+lvmlockd+DLMCluster MD TODO

    Cluster-based

    MD-RAIDCluster-handled

    In Cluster

    External

    * mandatory

    Fencing Phenomenology

  • 7

    Cluster and Fencing TopologyTwo Sites Scale-Up

    2 nodes, 2 FC SBDs

    2 nodes, 2 iSCSI SBDs

    Three Sites Scale-Up

    2 nodes, 3 FC SBDs

    2 nodes, 2 FC SBDs, 1 iSCSI SBD

    2 nodes, 3 iSCSI SBDs

    ( 2 nodes, 1 iSCSI SBD )

    2+1 nodes, diskless SBD

    Three Sites Scale-Up

    2xN+1 nodes, 3 FC SBDs

    2xN+1 nodes, 2 FC SBDs, 1 iSCSI SBD

    2xN+1 nodes, 3 iSCSI SBDs

    2xN+1 nodes, diskless SBD

  • 8

    It's good NOT to do:- directly re-use concepts from other cluster solutions- cluster resource, STONITH, and SBD timings shorter than SAN timings- OCFS2 if no concurrent access is needed- without stonith at all- manually changing status of cluster-controlled resources - other software use the watchdog in parallel to SBD- issue commands to cluster while it is in transition - go live without tests planned and done

  • 9

    + two independent LAN links for cluster communication

    + two or three SBD devices, or diskless SBD

    + adapt resource time-outs to infrastructure. e.g. SAN MPIO or VMotion

    + make CIB simple, e.g. few groups instead of many constraints

    + resource naming schema, e.g. prefixes rsc_, msl_, grp_, ord_, loc_, col_

    + set up cluster step-by-step

    + use crm

    + always issue crm unmigrate after migration has completed

    + check cluster for clean idle state before triggering actions

    + be patient, respect cluster timings

    + define and perform tests for all failure scenarios

    It's good to do:

  • 10

    Where to lookfor what

  • 11

    ● Config files /etc/hosts /etc/ntp.conf /etc/multipath.conf

    /etc/modules-load.d/ /etc/sysconfig/sbd

    /etc/corosync/corosync.conf /etc/sudoers /usr/sap/sapservices /usr/sap/$SID/SYS/profile/$SID_HDB$nr_$host /usr/sap/$SID/SYS/global/hdb/custom/config/global.ini /hana/shared/myHooks/SAPHanaSR.py

    ● Log files /var/log/messages /var/lib/pacemaker/pengine/pe-input-*.bz2 /usr/sap/$SID/HDB$nr/$host/trace/

    See man sbd, stonith_sbd, crm_no_quorum_policy, sudoers, multipath.conf, corosync.conf,SAPHanaSR-ScaleOut_basic_cluster

    Config and Log Files

  • 12

    # sg_persist --read-reservation --device=/dev/...# cs_show_hana_autofailover --all# cs_show_error_patterns -c | grep -v ”=.0”# cs_show_hana_info --info $SID $nr# cs_show_memory# cs_sum_base_config# rear --help

    ~> sapcontrol -nr $nr -function StartSystem~> sapcontrol -nr $nr -function StopSystem ALL~> sapcontrol -nr $nr -function GetSystemInstanceList~> hdbnsutil -sr_state~> HDBsettings.sh systemOverview.py~> HDBsettings.sh systemReplicationStatus.py~> HDBsettings.sh landscapeHostConfiguration.py

    Useful Commands online - plain

  • 13

    ~> HDBSettings.sh landscapeHostConfiguration.py; echo RC:$?| Host | Host | Host | Failov| Remove | Stor | Stor | Failov | Failov | NameSrv | NameSrv | IndexSrv | IndexSrv | Host | Host | ...| | Actv | Status | Status| Status | Config| Actual| Config | Actual | Config | Actual | Config | Actual | Config | Actual || | | | | | Part | Part | Group | Group | Role | Role | Role | Role | Roles | Roles || ----- | ---- | ------ | ----- | ------ | ----- | ----- | ------ | ------ | -------- | ------- | -------- | -------- | ------- | ------- || db101 | yes | ok | | | 1 | 1 | deflt | deflt | master 1 | master | worker | master | worker | worker || db102 | yes | ok | | | 2 | 2 | deflt | deflt | master 2 | slave | worker | slave | worker | worker || db103 | yes | ignore | | | 0 | 0 | deflt | deflt | master 3 | slave | standby | standby | standby | standby | ...overall host status: okRC:4 ~> HDBSettings.sh systemReplicationStatus.py; echo RC:$?| Database | Host | Port| ServiceName | VolumeID | SiteID | SiteName | Secondary | Sec | Sec | Sec | Sec | Repl | Repl | Repl | | | | | | | | | Host | Port| SiteID | SiteName | Active | Mode | Status | Details | | -------- | ------| --- | ----------- | -------- | ------ | ---------| --------- | --- | ------ | -------- | ------ | ---- | ------ | ------- | | SYSTEMDB | db101 |31001| nameserver | 1 | 1 | DC1 | db401 |31001| 2 | DC2 | YES | SYNC | ACTIVE | | | P04 | db101 |31007| xsengine | 2 | 1 | DC1 | db401 |31007| 2 | DC2 | YES | SYNC | ACTIVE | | | P04 | db101 |31003| indexserver | 3 | 1 | DC1 | db401 |31003| 2 | DC2 | YES | SYNC | ACTIVE | | | P04 | db102 |31003| indexserver | 4 | 1 | DC1 | db402 |31003| 2 | DC2 | YES | SYNC | ACTIVE | |status system replication site "2": ACTIVEoverall system replication status: ACTIVE

    Local System Replication State~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~mode: PRIMARYsite id: 1site name: DC1RC:15

    Note: output beautified.

    Note: output beautified

    Example master+worker+standby

  • 14

    Useful Commands online - cluster# crm_mon -1Ar# crm configure show | grep cli-# cs_clusterstate -i# SAPHanaSR-showAttr

    # SAPHanaSR-monitor

  • 15

    Example master+worker+standby# SAPHanaSR-showAttrGlobal cib-time prim sec srHook sync_state -----------------------------------------------------------global Thu Aug 1 09:39:01 2019 DC1 - SOK SOK

    Sit lpt lss mns srr -------------------------------------DC1 1564645080 4 db102 P DC2 30 4 db403 S

    Hosts clone_state node_state roles score --------------------------------------------------------------------------db101 online master1:master:worker:master -33333 db102 PROMOTED online master3:slave:worker:slave 110 db103 online master2:slave:standby:standby db401 online master2:master:worker:master db402 online master3:slave:worker:slave db403 online master1:slave:standby:standby mm888 online :shtdown:shtdown:shtdown

    Note: output beautified, does not match other examples.

  • 16

    Useful Commands offline# cs_show_supportconfig -g $supportconfig_directory # cs_show_supportconfig -f $supportconfig_directory chk_saphana chk_sleha# cs_show_hana_autofailover_patterns --all $date message-\*# crm_simulate -S --xml-file $pengine-input# crm_mon --xml_file $pengine-input# SAPHanaSR-showAttr --sid=$SID:$nr --cib=$path/cib.xml# SAPHanaSR-replay-archive --format script $crm_report | \ SAPHana-filter --host='Hosts/$host/role' --filterDouble

  • 17

    SAPHanaSRScale-Up

    Demo

  • 18

    SAPHanaSR Scale-Up Demo

  • 19

    Conclusion

  • 20

    TUT-1092Bootstrapping SLES for

    SAP HANA & NetWeaver clusters with Terraform & Salt on public clouds

    TUT-1212Running SAP Data Hub

    on Kubernetes with SUSE CaaS Platform

    BP-1209Planning, deployment,

    maintenance, & operations of SAP S/4HANA

    FUT-1439 SUSE Linux Enterprise

    Server for SAP applications: The road ahead

    TUT-1226

    SAP HA on SUSE – All you need to know

    HOL-1225 High Availability for

    SAP application servers using ENSA2 enqueue replicationBP-1351

    SUSE High Availability forSAP HANA: Tales from thereal world, tipps & tricks, &

    troubleshooting

    Related Sessions

  • 21

    More Informationhttps://www.suse.com/products/sles-for-saphttps://documentation.suse.com/sbp/allhttps://www.suse.com/c/tag/towardszerodowntime/https://www.suse.com/service/traininghttps://documentation.suse.com/sbp/allhttps://www.suse.com/c/saphanasr-scaleout-automating-sap-hana-system-replication-scale-installations-sles-sap-applications/https://www.suse.com/c/tag/supportconfig-analysis-sca-tools/https://software.opensuse.org/package/python-cluster-preflight-checkhttps://github.com/Thr3d/supportutils-plugin-suse-saphttps://github.com/SUSE/node_exporterhttps://github.com/SUSE/hanadb_exporterhttps://github.com/ClusterLabs/ha_cluster_exporterhttps://blogs.sap.com/2017/11/19/be-prepared-for-using-pacemaker-cluster-for-sap-hana-part-1-basics/https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.03/en-US/a165e192ba374c2a8b17566f89fe8419.htmlhttps://www.sap.com/documents/2016/06/84ea994f-767c-0010-82c7-eda71af511fa.htmlhttps://www.b1-systems.de/

    https://www.suse.com/products/sles-for-saphttps://documentation.suse.com/sbp/allhttps://www.suse.com/c/tag/towardszerodowntime/https://www.suse.com/service/traininghttps://documentation.suse.com/sbp/allhttps://www.suse.com/c/saphanasr-scaleout-automating-sap-hana-system-replication-scale-installations-sles-sap-applications/https://www.suse.com/c/tag/supportconfig-analysis-sca-tools/https://software.opensuse.org/package/python-cluster-preflight-checkhttps://github.com/Thr3d/supportutils-plugin-suse-saphttps://github.com/SUSE/node_exporterhttps://github.com/SUSE/hanadb_exporterhttps://github.com/ClusterLabs/ha_cluster_exporterhttps://blogs.sap.com/2017/11/19/be-prepared-for-using-pacemaker-cluster-for-sap-hana-part-1-basics/https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.03/en-US/a165e192ba374c2a8b17566f89fe8419.htmlhttps://www.sap.com/documents/2016/06/84ea994f-767c-0010-82c7-eda71af511fa.htmlhttps://www.b1-systems.de/

  • 22

    Conclusion – what to take with● HANA System Replication is complex, particularly scale-out.

    ● Topology of cluster and fencing defines how well failures are handled.

    ● Understanding of resources is needed for planning, building, running and trouble shooting this clusters.

    ● Tools and documented procedures can help. E.g. SAPHanaSR-showAttr, SAPHanaSR-replay-archive.

    ● Cluster and maintenance procedures have to be tested carefully before going live.

    ● Professional services are available, e.g. review before going live.

  • 23

    Q&A

  • 24

    General Disclaimer

    This document is not to be construed as a promise by any participating company to develop, deliver, or market a product.  It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.  SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose.  The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE.  Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE, LLC, Inc. in the United States and other countries.  All third-party trademarks are the property of their respective owners.