Your business is important, and information is important to your business. Consider how long you can go without each application used in your business. Days? Hours? Minutes? Whether you are an executive or a network administrator, knowing the answer to these questions can impact how effective you are at your job. Business Continuity Planning (“BCP”) and Disaster Recovery (“DR”) have become a key requirement for many middle-market businesses. The expectation of always-available applications and data necessitates a robust, always available infrastructure.
Replication, or copying data to another location on a regular schedule, is a key component of any BCP/DR strategy. The purpose of replication is to limit the scope and effect of an outage or other disasters. Building truly redundant, always available systems necessitate having your data in multiple geographically disperse locations to limit the effect of localized disasters. Whether your data centers are in your own buildings, co-location facilities, or the cloud, the replication network determines whether or not replication as a whole will be successful. In order to properly design your replication network, you need to gather business and technical requirements, along with industry-leading practices. After these steps, you will be prepared to recover from future disasters in a timely manner.
Calculating Recovery Point Object and Recovery Time Objective
Now that you have determined that replication is a key component of your BCP/DR strategy, you need to understand what needs to be replicated and how often. To do this, you need to know every application in use in the business, the relative criticality of each, and the associated recovery objectives. Two key measures are used by most businesses to articulate recovery objectives for applications and data. These are Recovery Point Object (“RPO”) and Recovery Time Objective (“RTO”). RPO is the point in time that you want to be able to restore systems back to and RTO is the amount of time you can wait for systems to be back online.
Let’s look at an example. You have an application for which you have defined an RPO of one hour and an RTO of two hours. That means that when you recover this application after an outage, you have accepted that you will lose all of the transactions posted in the hour leading up to the outage and that the application will be offline for two hours while it is being recovered. The calculation for RPOs and RTOs can be different for each application or data set, and how you get to these numbers will be different for every organization. This is an in-depth topic that is outside of the scope of this article. However, this information is critical in determining your replication strategy. Once you have all of this information gathered, it is important to document all of the business requirements and make it part of your BCP/DR documentation.
Determining the Right Tools and Methodologies
You have a list of the applications and their recovery objectives; now it is time to dig a little deeper. For all of these applications, you will need to find out the servers and systems involved in each, where they are, the interactions between them, and how much data is there. Next, determine what methodologies and tools you will be using. The methodology will depend on what your recovery plan is, what type of data you are looking at (databases, files, etc.), and what your recovery objectives are. Some network-related considerations around methodology might include: IP addressing (utilizing the same IP addressing or separate IP addressing in multiple data centers), extending layer 2 between the data centers, optimizing IP routing, and moving of IP addresses in a recovery scenario.
When you are deciding on tools, you can consider tools that you may already have in place, but they may not be ideal for the type of replication required. Will the tool work with the methodology you have chosen? Once you know these things, you will understand what needs to be replicated, where it is coming from and going to, and what the requirements are specific to how you are replicating and recovering. Now you are ready to consider these factors in designing the replication network to support it.
Designing a Replication Network
Finally, after you have laid out the technical requirements, you need to consider vendor standards and leading practices. Each application, network, or replication tool vendor will have specific configuration recommendations, such as a minimum MTU size along the path of the replication network. They’re also generally accepted leading practices. For example, replication typically occurs over an isolated network dedicated to replication. Often, this will be a flat network with a single IP subnet for all of the systems participating in redundancy. Whatever network components, tools or methodologies you choose, you will want to ensure that you have the appropriate redundancies in place and that replication does not hinder the performance of your production network.
To summarize, some key replication network design considerations to keep in mind are:
- Sufficient dedicated connection bandwidth
- Latency meets requirements of particular replication tools
- Layer 2 extension between data centers (e.g. Cisco OTV, VMware NSX)
- Front end/production side IP addressing and IP migration (e.g. Cisco LISP, automation)
- Replication traffic on a single segregated IP network
- Isolation of replication traffic from production
- Jumbo frames / MTU size requirements
No matter how your business operates and whether your applications are hosted in the cloud or in your own facilities, a well designed and implemented network is critical to providing you with the best performing and most highly available systems. Think of it this way; you don’t build a house top-down, you build it from the ground up. The network provides the underpinnings for a solid infrastructure, which provides a solid base for high performing applications.
Taking the time to research, plan, and design the appropriate infrastructure can help ensure you are prepared for a disaster. If this feels like a daunting task, you are not alone. There are people who have done this before! Reach out to friends and colleagues, Google, or contact RSM for guidance