Do-it-yourself Disaster Recovery for GCP

2

TABLE OF CONTENTS

Introduction ................................................................................................................................................. 3

Product/Service/Methodology .................................................................................................................... 4

Key Findings ............................................................................................................................................... 11

Key Findings #1 ...................................................................................................................................... 11

Key Findings #2 ...................................................................................................................................... 11

Key Findings #3 ...................................................................................................................................... 11

Visual Data ................................................................................................................................................. 12

Conclusion ................................................................................................................................................. 15

Key Takeaways ....................................................................................................................................... 15

3

INTRODUCTION

One day you may attempt to access applications hosted in GCP only to find that the region hosting your mission-critical instances is inaccessible, possibly due to natural disaster, power outage, or failed upgrade. To maintain business continuity, you must have a duplicate environment of the same applications in a different region, ready to be accessed on a moment’s notice, with a reasonably up-to-date copy of production data. Given the large number of GCP customers, it is dubious to count on Google to assist you promptly to restart your business in this scenario. Instead, you must protect yourself.

While it might sound difficult, in reality all of the components are present within GCP to put together your own cost-effective disaster recovery infrastructure. This white paper describes a methodology for creating a robust standby copy of your mission-critical applications in a different GCP region, which will hopefully serve either as a template for do-it-yourself disaster recovery, or understanding that if you choose to use a commercial tool like our Marketplace offering, Thunder for GCP, that you don’t overpay for the solution.

4

PRODUCT/SERVICE/METHODOLOGY

The assumption must be made that any outage may be total and without warning: you cannot count on being able to access any configuration information such as what applications you have running, what deployment they are based on, nor can you access any data on GCP disks. Any solution must be prepared in advance.

The first step is to pick a disaster recovery region as a failover target for the instances in your primary region. All GCP regions are physically far enough apart that whatever outage affects the primary region presumably would not affect the DR region. For example if your production is based in us-west2 (Los Angeles) you may choose to replicate to the nearest region, us-west1 (Oregon), far enough to withstand any issue in Los Angeles but the shortest distance for replicated data to travel.

As the figure below summarizes, the overall approach is to provision duplicate instances in the same type of subnet at the DR region; snapshot the volumes of the primary instances; copy those snapshots to the DR region; convert the replicated snapshots to volumes and to attach those new volumes to the DR instances. In this way, in case of an outage, merely power on the DR instances, all of which will have a reasonably up-to-date copy of the production data. Google DNS can be leveraged to failover DNS names between regions as well.

The following describes each step-in detail in how to provision, replicate, test, and failover mission-critical instances between regions using the GCP console, using examples, screenshots, and referencing the above diagram.

5

• For each primary virtual machine, the first action is to perform a one-time provision of a duplicate instance at the DR region using the same configuration as the primary. This can be done by creating a replicated snapshot of the primary VM’s disk(s) and then creating an instance from that snapshot in the remote region, all of which will be described in detail below. When configuring the various attributes of the new instance, make sure to perform the following:

o Make sure to tag the DR instance with the same network tag as the primary VM so that client traffic is allowed through an identical firewall.

o Place the DR instance in a similar VPC as the primary, such as the default VPC or a custom VPC with similar subnets.

o Give it a recognizable name, such as the same name as the primary instance appended by a suffix such as “DR”

o power it off when the launch completes, so that you are not charged for usage until a failover test or an actual failover.

• The following shows an example of provisioning a failover instance: o Take a snapshot of the primary virtual machine’s disk(s) by locating the list of disks underlying

the VM (starting with the boot disk), select the disk by clicking on the hyperlink for the disk name, such as:

And clicking on Create Snapshot in the list of actions at the top of the page. Select a Regional snapshot and select the failover region, in this case Oregon:

o Once the snapshot completes, select the Create Instance action from the top of the snapshot details screen. Select the DR region and assign a useful name to the instance.

6

o Select network parameters that are identical to the primary virtual machine so that the DR

instance is network-addressable on a failover. For generic VMs on the default public network, generally you can select the defaults.

o Assign a network tag with the same name as the network tag as the primary, which will open the same firewall to the DR instance, for instance in MySQL you may need to open port 3306

o Create the instance and wait for it to start. Then power it off once it is complete. This is the initial provision of a duplicate DR VM at the failover region.

• The duplicate instance of course has a single point-in-time copy of the primary’s database. The next step is to periodically copy the data from the primary to the DR instance. This can be accomplished by once again creating a replicated snapshot of the primary instance’s disks(s). However, once the snapshots complete, instead of provisioning a new instance, instead create a new disk from the replicate snapshot, detach the existing disk from the DR instance, and attach the new one. Power on the DR instance as a test to make sure it will boot.

7

o Follow the steps above to create a replicated snapshot of the primary virtual machine’s disk(s). However, when the snapshot completes, navigate to the Disks section of the Compute Engine subsection of the GCP console, and select Create Disk. In the dialogue window select the DR region and the DR zone where the DR instance was provisioned above. Then select as the source snapshot the replicated snapshot that was just completed. Perform this operation for each disk in the VM. For example, for a single-disk virtual machine:

o Detach the disk(s) from the DR instance, essentially by editing the DR virtual machine, navigating to the list of disks, and pressing the X at the end of the line defining the disk:

o Then attach the newly created disk by again editing the VM and in the disks section Add Item and selecting as the disk the newly created disk:

8

• Confirm that your instance can start by powering it on and make sure it can transition to Running.

Ideally confirm that the application in the instance can recover. Snapshot is point in time, it is a copy of the running application, but it is equivalent to if the primary crashed and rebooted. It will merely recover off of the data at that time. Make sure to power it off after the test in order to minimize usage charges.

• Delete the snapshot from the primary and secondary, and the old volume detached from the DR instance. They are no longer necessary.

• If you have any static infrastructure (load balances, etc.) set them up in advance one time also. • Repeat the replication steps on a regular basis to keep DR up to date. It could be hourly or daily

depending on your recovery point objectives. Snapshots are differential so you pay only for the data changes you make

• Now in an emergency merely power on the DR instances. They will recover their application to the point in time of the last snapshot and be ready to serve users.

• If your application is network-addressable through a DNS-resolvable FQDN, make a note of that FQDN. On a failover, update the DNS entry to the public IP address of the DR instance. Otherwise, clients will not be redirected to it. It is not possible to use the same public IP across regions.

o For example if your FQDN is mysql-w2b-vm.thunderforgcp.net, navigate to the Network Services section of the GCP console and then the Cloud DNS subsection and update that A record to the public IP of the DR instance, which is available after it boots. Clients will resolve that FQDN to the new IP address eventually, at most within the TTL (time-to-live)

9

o If your production instances are solely on a private IP network, for example they are connected to a corporate VPN or are visible through a bastion server, you can do away with the need for address failover by creating an identical private VPC at the DR region, and then assigning the DR instance the same private IP as the production. In that way, when you start the DR instance, it will be visible to clients on the same subnet in exactly the same way it was at the primary. For example, if you have a VPC called at the primary with subnet of CIDR 192.168.1.0/24 and the primary VM has been assigned with the private IP 192.168.1.3:

When provisioning the DR instance, first create an identical VPC with an identical subnet at the DR region, attach the DR instance to that VPC and assign the same Primary internal IP to the DR instance as is assigned to the primary. You will see this option when configuring the network and selecting reserve a static IP address:

10

While all of these steps may seem cumbersome, some like provisioning need by done once and the snapshot copy can be done at a certain time per day, or you can use one of the many scripting languages available with the GCP API to write a program to do it.

11

KEY FINDINGS

Key Findings #1

Business continuity for GCP must be made under the assumption that both the data and the meta-data (what applications you run, what networks to which they are attached, what firewall rules they are a member of, etc.) may be permanently unavailable in case of a disaster. For this reason, you must plan ahead with a duplicate infrastructure, at the same time minimizing time and resources spent.

Key Findings #2

The infrastructure exists within the GCP console to deploy a duplicate of your mission-critical assets in an alternate region as your production applications, and to copy point-in-time snapshots of data volumes to the remote region on a regular basis. All of which can be done through the GCP console.

Key Findings #3

However, to keep this going with minimal intervention, ideally you would deploy a solution to automate the provisioning, replication, testing and failover of GCP virtual machines across regions. Because it can be done manually as described in this document, you might be suspicious of solutions that charge high fees, such as in the thousands of dollars per year in subscription fees. Instead, you might consider our solution, Thunder for GCP, which automates the simple steps described in this document for just $20/month, which we think is the right price for a solution of this nature.

12

VISUAL DATA

Thunder for GCP automates all of the steps described in this document, all from a single pane of glass, and all for a flat $20 / month subscription fee, regardless of the number of DR instances. Competing solutions change five to ten times as much per month, or more per month than we charge per year.

Our user interface shows on one screen the relationship between the primary and DR sites, the list of primary instances that can be provisioned in the DR site, and the DR instances that have already been provisioned, as well as when the last snapshot copy was completed.

Provisioning is accomplished by selecting the instance to be duplicated and choosing the replication schedule. Snapshot creation, security group duplication, and VPC discovery are automatic.

13

Snapshot creation, copy, volume attachment and power-on testing are all automated based on a predefined schedule. Each step is logged for transparency and for debugging.

Deep testing is available for certain applications. During the power-on test, Thunder for GCP can connect to the instance’s application and perform a sanity check to confirm the application can start. For example MySQL can be configured with a test table from which timestamp data can be extracted to prove the database is up-to-date.

14

In case of a true disaster at the primary region, one-button failover is available to power-on the DR instances for seamless business continuity. Because the DR instances have been tested after each replication job you can be confident that your business can resume without any unexpected surprises.

15

CONCLUSION

Prior to the advent of cloud computing, essentially all applications were run within an on-premise data center using proprietary storage hardware. Disaster recovery was tied to enterprise storage systems such as EMC or NetApp, which required either buying a custom solution from those vendors or a third party, who themselves had to outlay significant capital to develop a solution. All of this infrastructure was justifiably expensive.

For cloud all of the components are there for anyone to develop a solution using the GCP console or the published APIs without any need for capital outlay to purchase expensive, on-premise storage hardware. If you have the time and inclination, you can do it yourself. Or you can pay a small monthly fee to us to automate the solution. Because it is so straightforward, what you should hesitate to do is a pay a lot of money to legacy enterprise vendors charging a high subscription fee. They might be more interested in protecting are their margins than protecting your business.

Key Takeaways • Disaster recovery for GCP must be prepared in advance; in a true failure, information on how to

recover your mission-critical instances may not be available • All of the steps required to provision, replicate, test, and failover instances is available in the GCP

console, as well as through GCP command line tools and APIs. • These steps can be somewhat cumbersome but are neither proprietary nor secret; instead of investing

your time or paying a lot of money to enterprise vendors, you might consider Thunder for GCP, DR automation for GCP without the high cost. For more information please visit www.thundertech.io or check out a brief demonstration at https://youtu.be/kTC8dAKBWcw

Do-it-yourself Disaster Recovery for GCP

Documents

Transcript of Do-it-yourself Disaster Recovery for GCP