VMware Disaster Recovery Planning: Essential Checklist

VMware Disaster Recovery Planning: Essential Checklist Sponsored by VMware Disaster Recovery Planning: Essential Checklist Drop the baggage and simp...
Author: Melinda Stanley
4 downloads 0 Views 2MB Size
VMware Disaster Recovery Planning: Essential Checklist

Sponsored by

VMware Disaster Recovery Planning: Essential Checklist Drop the baggage and simplify your Disaster Recovery strategy

Sean Clark VCP 2, 3, 4 and VMware vExpert 2009

VMware Disaster Recovery Planning: Essential Checklist

Executive summary Planning and implementing a VMware disaster recovery (DR) plan is not a task to be taken lightly. If you jump in without some minimum planning you’ll end up with some surprises you, your boss or your budget won’t be agreeable to. Worse yet, if you skip key steps neglecting to collect key information prior to designing, you can end up with an unwieldy DR solution that doesn’t meet the needs of the business. This eBook will serve as a checklist that can guide you on the creation of a top-notch VMware disaster recovery plan. The checklist follows industry best practices for implementing complex technology solutions by taking a phased approach to instituting a DR plan. This approach includes the following phases: • Assessment—Gathering key requirements for DR solution • Design—Creating a DR plan to meet business and technical requirements • Deploy—Stand up necessary infrastructure. Install, configure and test solution • Manage—Test your DR plan as frequently as possible This approach should be reapplied as your business requirements change or to take advantage of technology advancements that can reduce costs and enhance DR capabilities. The result is a DR plan that is flexible enough to adapt with the times and your business. Since we’re talking about recovery of VMware environments, we will focus on leveraging its unique capabilities to the maximum degree. The unique capabilities and properties of VMware environments allow us to create the ultimate test-driven DR plan and help be a catalyst for moving to 100% virtualized environments. The focus of this eBook is the planning side of DR and less on the execution and operation of the DR plan. As such, we'll focus on the Assess and Design phases.

Assess  Business impact analysis  Determine RPO (recovery point objective) and RTO (recovery time objective)  Understand your budget  Understand application dependencies  Automate VMware environment data collection

Design  Virtualize stragglers  Analyze resource requirements  Design for easiest restore  Decide on infrastructure configuration  Test-drive DR plan

2

VMware Disaster Recovery Planning: Essential Checklist

Business Impact Analysis Let’s be frank. As cool as server virtualization technology is, it doesn’t provide your business with revenue, unless you are company like VMware or Veeam. As IT professionals, your job is to design, develop and manage the technology systems that support the business’ ability to turn a profit. But equally important, is the ability to restore these systems as rapidly as possible in a disaster. Before you begin spending money on your disaster recovery software, hardware and facilities, you need to know where to start. You should first assess your business and gain an intimate understanding of which processes, systems and data are most critical to the success and future survival of your company. This process is referred to as a business impact analysis (BIA). Without one, you open yourself to the risk of wasting resources or overprotecting assets of little value to your business. Worse yet, without a BIA, you may end up neglecting to plan recovery for key IT systems that your mission-critical systems depend on.

What's in a BIA? A BIA can be as complex and time consuming as you want it to be. Complexity and time are also a discrete function of the size and complexity of your business. In either regard, all business impact assessments are looking to accomplish the same fundamental tasks. For example, the following list was gathered from a great free BIA template published by the U.S. Centers for Disease Control and Prevention for the purpose of helping guide the development of a DR plan. In this template, the following main areas of assessment are addressed:

Key BIA activities 1. Identify critical business systems 2. Identify system resource dependencies 3. Identify key support personnel or teams 4. Estimate disruption impact 5. Determine resource recovery priority Having this granular knowledge of the business impact of your critical systems not only will help you in a real disaster scenario, but it will help you test your preparation for the disaster. Knowing where to focus your DR planning efforts will help you greatly streamline and prioritize your disaster recovery exercises, or tests. The BIA is not just a technical system inventory, but you’ll need to work with the business side to define RPO and RTO for key business services and the IT services they depend on. Setting the objectives gives DR planners and managers something to design and manage to. Without defining these key requirements for your disaster recovery planning, you risk spending too much on a higher level of data protection than is required. The other risk is that you don’t provide enough protection for your critical resources and end up costing the business time and money in the event of a disaster. In the following section, we’ll walk through why RPO and RTO planning matters.

3

VMware Disaster Recovery Planning: Essential Checklist

Determine RPO and RTO During your BIA, you should spend a good deal of time developing an intimate understanding of how your business makes money and catalog the key resources and processes necessary to enable revenue creation. In some companies, thismay be a single business process that will translate to clear guidance on RTO/RPO requirements. In larger companies, with diversified products and services, you will likely catalog multiple use cases to address with different DR strategies. Regardless of the business size, the process of DR planning is similar, and we apply the same process to categorize the use cases we discover in our DR assessments. The following are some example DR use cases. By reviewing these common use cases and identifying their RPO/RTO, you should be able to apply similar logic to your own business and gather these guiding requirements.

Educational Institution: Keeping costs low On one end of the spectrum, would be a grade school with fixed budget based on enrollment and no revenue generation based on system up-time. However, the core product and service coming out of the school is the quality of the children’s education. In an environment like this, it’s important to keep costs as low as possible, but still provide a system that will reduce cost of recovery that can likely call for third-party IT staff to help recover systems. Most systems can withstand some loss of data and recovery needs to be timely, but certainly not instant. Since schools can still operate effectively without IT, it is likely that a RTO measured in days to a week will result from this assessment. The RPO would likely be around 24 to 48 hours for most systems.

Accounting Firm: CDP during tax season Throughout the year accounting firms do a certain amount of business and are tolerant of 24 to 48 hour RPO and RTO. But in the United States, from February through April, it is the busy tax season. Overtime and packed daytime schedules are the reality. Data loss and system downtime is NOT an option. Near-zero RTOs and RPOs are required for this use case. For the small and midsized firms, expensive active-active accounting system designs are not an option, and neither is expensive array level replication. This is a perfect use case for Veeam’s SmartCDP. SmartCDP is a near-continuous data protection (near-CDP) solution where virtual machines (VMs) are continuously replicated to a safe site which brings RPO down to around 5 minutes and RTO can be as fast as the accounting firm’s IT professionals and executives are ready to declare a disaster and restart the affected VMs.

Healthcare: Ensuring quality patient care In a healthcare setting there exists the perfect storm for DR planning. Today's healthcare IT systems can actually be a profit center for many organizations. These systems need to be functioning to allow profitable procedures like MRIs, ultrasounds or any of the cardiac-related procedures. The profit side of the equation is important, but not the only concern. In healthcare, patient care is king. If technology is not available to serve patient needs, then the profits are meaningless. Healthcare operates around the clock and pressures IT to deliver services at a very high level. These highly critical systems demand a low RTO to guarantee patient care goals are met and healthcare organizations can remain profitable through systems disasters. Data loss isn't necessarily a financial risk,

4

VMware Disaster Recovery Planning: Essential Checklist

unless lawsuits are considered, but a high quality of patient care dictates that the best patient health information to be available to ensure continuously good care through shift changes and doctor rotations. So, in most healthcare organizations, budget becomes the determining factor for what you'll end up choosing for RTO and RPO. While working within your fixed budget, you may establish RTOs and RPOs that are as low as possible during the times you require. Veeam Backup & Replication with its ability to back up, replicate and provide near-CDP in one product, provides the flexibility to address all healthcare needs.

Understand Your Budget Virtualized DR is much less expensive than traditional physical DR, but it is still an additional cost on top of your already sizable investment in virtualization software, supported server hardware and new storage systems. Some companies will have the budget freedom to design the “Cadillac” of DR solutions. For the rest of us, we have to be very aware of the budget limitations and the factors contributing to that limitation. Knowing your budget limit and some key strategies will allow you to optimize your DR plan to get the most capabilities out of your limited resources.

Budget purchases wisely One recommendation for DR budget planning is to consolidate all disaster recovery options inside a single product so that you’re not paying twice for two software products, two backup infrastructures and the associated operational costs that can sink you. Instead, plan to make a strategic investment in virtualized DR and go “all-in” with Veeam Backup & Replication and drop strategies that leverage legacy components. This can reduce ongoing license and support cost as well as simplify operations, creating a substantial cost savings.

Phased Budget Ideally, you should suggest a phased DR assessment budget by the business or by your customer. In this method, you are allowed a fixed budget to conduct your assessment and initial planning with the expectation the outcome will be a more accurate budget estimation of the final solution. This method builds trust with the business or customer and ensures them that you are not making these recommendations on a whim and that the DR plan is exercising due diligence. But once the assessment is complete and a preliminary design can be conducted, you'll have to a decision to make in determining how much you will ask for.

Communicate the Vision Just like companies and their initial server virtualization efforts realized, you may have to spend money to save money in the case of virtualized DR. These first pioneering IT organizations realized that the sooner they virtualized their entire environment, the sooner they could start reaping the rewards of reduced power consumption, reduced server hardware cost, and greater flexibility in operations. Delaying ROI of virtualizing your physical servers was not recommended and the same thing is true with a virtualized DR plan. The longer you have to maintain a legacy DR solution alongside a virtualized one, the more risk your business accepts, the more costly DR exercises are, and the more operational costs IT departments take on to maintain two or more separate DR solutions. At this point in the budget process, it's important to lay out the long term vision for 100%

5

VMware Disaster Recovery Planning: Essential Checklist

virtualized DR and communicate the benefits of legacy-free disaster recovery. If business leaders understand the true value of virtualized DR, you should have success funding the project properly to realize the full benefits. A good rule of thumb is to shoot for the plan that can create the lowest TCO (total cost of ownership) over the next 3 to 5 years.

Executive Champion Virtualization can be a complex topic for business leaders to understand. Throwing DR planning on top of that can sometimes put non-technical leadership into a corner they are uncomfortable with. To make sure you get the DR plan your company needs, you'll need to understand the concerns and realities of the business, and be able to clearly communicate the solution's benefits to the company. If you are dreading these conversations you should consider identifying an executive champion to get involved. This champion is usually an executive familiar with IT but accustomed to planning with other executives and speaking at a technical level they can understand. For companies without formal technology executive positions like a chief information officer (CIO), an executive champion dedicated to DR planning project can be a critical component.

Understand Application Dependencies It’s crucial to understand dependencies of core applications when creating your DR plan. Does an application depend on an external database on another VM? What restore order is needed to test the application? And even today, we still need to be concerned about what servers are still physical. These dependencies are critical to catalog and plan for. No matter how well you protect the critical VMs, if you forget the weakest link in the dependency chain, you might as well have not protected any systems.

Active Directory, DNS and DHCP Everyone has some critical applications that are core to the organization’s success, and maybe people look to protect these applications first. This is a mistake that can be avoided but continuing to follow the change of dependencies down the lowest level. Assuming you have basic physical facilities and network access accounted for in the DR plan, you need to make sure that core infrastructure services are next on the plan. Like buying a car without wheels, restoring your critical applications without these core infrastructure services will have you going nowhere fast.

Core Infrastructure Services Needed for DR • IP address assignment (DHCP, etc.)—Local to each site and required for any communication on the network • DNS—Ensures that servers and PCs understand how to reach each other as well as Internet resources • Active Directory—Provides directory service to secure access to recovered systems These services can be provided in a number of ways, but in the majority of small business and mid-sized companies, we are talking about Microsoft Windows Server VMs running Active Directory. Before Active Directory servers providing

6

VMware Disaster Recovery Planning: Essential Checklist

DHCP and DNS can be recovered, you need to ensure you have a strategy in place to guarantee their recovery. If Active Directory servers are restored incorrectly, you will waste precious hours conducting manual recovery steps to restore function to this critical service. Ensuring a successful trouble-free restore actually starts with a VM backup and replication tool that fully supports VSS (Volume Shadow Copy Services) for both backups and restore. Veeam Backup & Replication has provided this functionality since its first release. This VSS-aware technology is the key layer of defense for situations in which you are unable to replicate Active Directory to your disaster recovery facility. Example scenarios include single site businesses without the budget for a DR site, or businesses that choose to only maintain DR contracts with cloud providers and don’t want to maintain active VMs that incur monthly fees.

Mission Critical Applications Mission critical applications are the lifeblood of the company and consist of stateless application server VMs as well as VMs containing persistent data like databases or file system objects. With traditional legacy backups this can be a terrible chore to document and remain current on all the specific files and folders inside a VM that needs to be backed up, especially when application upgrades may create new files and folders not included in the original backup configuration. Focusing on entire VMs for a virtualized DR plan avoids focus on the micro-management of the data protection, but rather on identifying the higher level service components that need to be protected and restored. Similar to the Active Directory backups mentioned earlier, if your mission critical database is running SQL Server, you need to leverage VSS-aware products to ensure application consistency is maintained during backup and replication. Without this key technology, you will be left with crash-consistent databases which will slow your recovery time after a disaster or during disaster exercises. Microsoft SQL Server provides what’s called a VSS writer with its products, and is one of the key pieces that Veeam can leverage with its own advanced VSS requestor technology to ensure application-consistent backups. However, if you are not running a database or OS that supports VSS you will have to take other precautions. In the non-windows world, there can be various ways to ensure database and application consistency during backup and replication windows. You will need to identify these non-VSS enabled Databases and work with good database administrators (DBAs) that understand the application requirements so you can create a backup and replication method that can guarantee application consistency. This may require pre- and post-scripts to be developed, purchased or borrowed from the community or a vendor. These scripts can help guarantee recovery consistency by shutting down the database service and freezing I/O on the mission critical VM until Veeam can initiate the VM snapshot.

Stateless Application Servers Not all servers contain persistent data, but all servers have a rebuild time. This can be as simple as an automated deployment or a large effort involving significant time and labor costs. These stateless application servers can be important servers to protect, but they will not be protected in the same way as busier servers with critical business data being creating daily. These servers can be protected less frequently and triggered only during critical periods containing change. For instance, prior to application upgrades or major software changes, you could initiate a whole-VM backup and take small, quick incremental backups for a

7

VMware Disaster Recovery Planning: Essential Checklist

period of time to ensure you have a good known state to restore to. Then shortly after the software change has been determined to be successful, you would then want to take another series of full and incremental backups to provide a good known state of the VM to restore back to if disaster occurs after the VM change. Once you have proven you have good backups of these stateless VMs, the schedule can be relaxed to save bandwidth and I/O. As long as you continue to test restores of the entire application stack, along with its dependent data tier, this can be an option for certain DR requirements.

Protect Your Deployment Systems Some businesses choose to invest heavily into their ability to quickly re-install application stacks on stateless VMs rather then protecting more data than is technically needed. This can be a great strategy for scale-out workloads like Terminal Server-based application servers, java application or web servers. In these situations there is a large ROI (return on investment) from automating the provisioning of complex server configurations and utilizing it often for upgrades, server refreshes or scale out operations. This doesn’t mean you’re off the hook, though. You'll need to follow the dependencies and identify the systems responsible for re-deploying these stateless VMs to make sure that they are protected and available in the DR site. This exercise will raise some interesting questions about what should and should not be re-deployed as part of a disaster, ultimately bringing into the question of the value of end-toend provisioning systems. In short, end-to-end provisioning strategies should also include re-deployed tested whole-VM images as well as whole application stacks. Your ultimate decision will just come down to numbers—how many VMs are based on a common configuration and how quickly they need to be restored in the event of the disaster.

Automate VMware Environment Data Collection VM inventory automation can help quickly and accurately collect information on the virtual environment that’s invaluable to designing your DR plan. Whether you have 25 VMs or 2,500, automating the collection of VM information can jumpstart you on the way to accurately designing your DR plan. There are a multitude of tools available to assist with this data collection task. Since we’re planning for a virtualized DR plan, I’m a big fan of leveraging virtualization management software from Veeam to help ease this task.

Veeam Reporter Veeam Reporter is a great tool that gives you the insight into your environment very quickly. There is both a free version and full version of Veeam Reporter. Both can be invaluable in quickly collecting inventory information from existing VMware environments. This information helps document your primary virtual datacenter, and can be a guide for setting up the disaster recovery site. Output from this tool include spreadsheets, Word documents and even Visio diagrams of your VMs, ESX(i) hosts, VMware datastores and networks. Veeam Reporter can grab a lot of data in just a few minutes.

8

VMware Disaster Recovery Planning: Essential Checklist

Veeam Monitor Whether you are looking to build out the bare minimum DR infrastructure or you are looking to determine at what point your DR solution is just getting gaudy, you'll need good statistics on current resource utilization to properly size your DR infrastructure. Veeam Monitor can be used to assist you with this resource utilization collection. Again, there is a free version, but the full version is the way to go if you would like to continue charting virtual infrastructure utilization after the planning phase. When measuring virtual infrastructure utilization, we are trying to capture a few key areas that will determine the ultimate cost of the solution. CPU and memory utilization is important for sizing the servers required for the recovery site. Here we’ll focus on CPU GHz used on average, and GB memory consumed.

Storage Consumption with Veeam Backup & Replication Storage requirements can be complex to discover. Although Veeam Monitor can report on how much storage you are using today, it’s not operationally realized for a virtualized DR plan. This means it doesn’t give you an easy way to estimate your storage requirements for backup and replication of whole VMs. To accurately assess storage needs you should consider using Veeam Backup & Replication and conduct an actual proof-of-concept (POC) of the software. This will allow you to learn how the product will work in your environment but also to give you real-world values for backup storage needs. Everyone can determine the cost to store full backups since it is just a multiplier of the original disk, but incremental replication passes can be tricky. By running Veeam Backup & Replication for a few days on actual workloads, you’ll accurately record the real-world values for the daily change rate as well as gain statistics required to size your DR storage systems. This daily change rate of the VM data will be critical in determining future replication bandwidth requirements as well. So, with a simple POC of your possible DR solution, you’ll be able to gather the statistics needed to properly size two of your most expensive DR resources: storage and network bandwidth.

9

VMware Disaster Recovery Planning: Essential Checklist

Virtualize the Stragglers If you still have physical servers, it’s time to make the switch. The benefits of virtualized DR are well known and have been written about and practiced for more than 5 years. No matter how good your DR solution for physical servers is, it can’t come close to approaching the capabilities and cost of a virtualized DR. But many workloads have avoided migrating to virtual for one reason or another, which throws a wrench in to the DR planning machinery.

VMware Is Best DR Platform for Business Critical Workloads You probably have a large portion of your environment virtualized with VMware, but it’s likely that you have some business critical systems that remain on physical systems because of their importance. Unfortunately, placing your business critical systems on physical servers because they’re important is a mistake. If they were that critical to your business, don’t you think that providing the ultimate DR solution for them would be your organization’s starting point? These critical systems should be virtualized to experience the full benefits of virtualized DR, but not every application owner and DBA fully understands our virtualized DR zealotry. You'll need to communicate the benefits inherent to today’s virtualized DR to these application owners to win them over.

Virtualized DR Benefits • Snapshots—Be able to test critical patches and roll back if failure occurs • Restore entire server image (operating system [OS], application, data) to any hardware without messing with drivers • Refresh hardware with zero downtime • Restores can be automated and tested often • Files and other application items can be restored from VM backup images • VSS integration can ensure application consistency

VMs Performance Is No Longer a Barrier Stubborn DBAs and critical-application owners may love to take you up on DR capabilities alone, but they still have one trump card in their hand: VM performance concerns. Five years ago, they might be justified in playing that trump card. But with advances in virtualization software and hardware purposely built for virtualization, these concerns are no longer warranted. • VMs can now scale up to 8 processors with 255 GB of RAM • VMs are capable of providing north of 300,000 IOPS when configured with a powerful enough backend storage system • 10 GB networking erases concerns of VMs being starved of bandwidth by other VMs • Next vSphere version due out in 2011 will likely increase these limits more Yes, VMs can scale larger than ever before, but without proper VMware capacity planning practices, the benefits are negated. Planners still need to ensure they

10

VMware Disaster Recovery Planning: Essential Checklist

have the right consolidation ratios given the virtualized workloads and underlying hardware resources. Having the tools to properly plan these consolidation ratios, monitor utilization and give application owners relevant virtual hardware performance statistics is critical. Tools like Veeam Monitor can provide the performance view at the hypervisor level and provide OS level statistics in a single view. Having the right plan and the right visibility into performance will help gain trust and drive more use of virtualization where you need it most.

The Singular Option: 1 VM Per Server All the management tools and virtualization know-how may still not be enough gain permission to virtualize critical servers. Many times these mission critical workloads have seen duty on non-x86 server platforms in the past, such as an AS/400 or a proprietary RISC-based platform. These options were not cheap and many times more expensive than a physical x86 server, so they are not easily defeated by server consolidation benefits. With today’s powerful virtualization enabled hardware and VMware software to provide the DR platform, you have a singular option available to meet the performance and DR requirements of the critical workload. That means dedicating a single host server to a critical VM. Who says just because you are running VMware on your servers that you have to maintain hero consolidation ratios? Putting performance (and egos) first and not being afraid to consider limited use of 1:1 consolidation ratios, will help you win over the stodgiest application owner. Over time, you may be able to win their trust to allow higher consolidation ratios as long as performance requirements are met. The singular option is a little of both, but when compared to the lack of DR features for physical servers or the cost of the proprietary past, it can make good sense.

On-Demand Sandbox Testing Once virtualized, another unique benefit you can offer owners of business critical applications is the ability to troubleshoot on mirror image systems in real time without disrupting the production server image. How is this possible? Veeam Backup & Replication offers an On-Demand Sandbox functionality. . This sandbox functionality allows you to boot VMs directly from an NFS server presented by the Veeam backup host, without requiring a time-consuming restore or provisioning extra storage to the vSphere environment. This capability allows problems to quickly be resolved since administrators will have exact copies of productions systems to test with, restart services or try fixes without the fear of disrupting operations on the live systems. This enables the system administrator's version of the Hippocratic Oath so administrators “do no harm” to production systems in the process of fixing them. This is no replacement for good system deployment capabilities, but there are plenty of instances where, due to the number of systems deployed, it doesn’t make sense to spend time automating the deployment. In this case, having the entire VM backed up for redeployment is a big benefit for developers.

Analyze Resource Requirements It should go without saying, but DR planning is about preparing for the loss of your primary datacenter. This means you need to size and budget for a recovery site capable of meeting your requirements in a disaster scenario. During the assessment phase of our DR planning, automating the collection of some

11

VMware Disaster Recovery Planning: Essential Checklist

key virtual infrastructure inventory as well as resource utilization statistics is recommended. It’s now time to use that information to properly size your DR solution.

Compute It's best to start with compute statistics, that is, how much CPU and memory are utilized. If you're planning to restore every single server and maintain identical capacity, this exercise is easy and you'll duplicate your production environment at the DR facility. In more budget-minded organizations, you're going to analyze the statistics of production workloads and identify only the critical workloads that need to be running in order to establish the estimated DR infrastructure required in a disaster. These utilization statistics translate into CPU sockets required.

Storage When analyzing the storage requirements, we want to make sure that we have enough raw storage space to store what is necessary. But we also want to know what kind of performance characteristics are required to drive your primary workloads. For storage space requirements you will start with the total gigabytes of all the VMs that you plan to recover to the DR site. Add to that, the amount of full backups or replicas you'll want to keep and the amount of daily incremental backups. Generally speaking, DR storage is based on the VM's configured memory and storage allocation, so calculations could be derived from the statistics gathered with Veeam Reporter. In addition to the raw space, it's important to understand the performance required of your storage systems. This is usually measured as IOPS and storage bandwidth. These two statistics describe how active your VM storage is, and whether you can get by with SATA drives, SAS drives, or whether you'd be a good candidate for an auto-tiered storage system with enterprise flash or SSDs for tier 0, SAS for tier 1 and SATA for tier 2. Many people make the mistake of buying large capacity SATA for DR because they can save on storage purchase costs. However, when it comes time to rely on that storage in a disaster, the availability of their systems is in jeopardy due to performance. It's understandable to want to save money on your DR, but for this size of an investment, you need to make sure you're not shooting yourself in the foot by getting too risky. Analyze the statistics and budget accordingly.

Network We talked about calculating daily change rate of data earlier. If you decide to use replication over a secured VPN connection over the Internet or leased WAN circuit, you'll want to know how much data will need to be moved across the network in a single day or replication window. This will help you forecast the size of Internet bandwidth required to be successful and if you need to upgrade your WAN circuits. If your options for Internet bandwidth are limited, this analysis will be crucial to understand whether you will be a good candidate for replication or whether whole-VM backups to tape-backed disk archives is a better option for you.

WAN Acceleration Need In analyzing your network connectivity, it's critical that you understand your bandwidth, the latency between your primary site and DR site, and the reliability

12

VMware Disaster Recovery Planning: Essential Checklist

of the connection. If you have high latency and packet loss connections, you may not be able to meet your backup windows, and consequently, suffer lower RPOs. Products like HyperIP from Netex offer WAN acceleration technology that is purpose-built for accelerating large data transfers over packet loss and high latency network links. If you are a Veeam customer, they even offer a 1-year free trial version of HyperIP to allow you to thoroughly kick the tires before purchasing.

Design for Easiest Restore It is imperative that a good DR plan design for the easiest restore possible. In a disaster situation, there can be a lot of confusion, different environments, and possibly a different or missing workforce. A disaster is not the time to have many complex, manual steps to follow while under the stress knowing your business' ability to pay your next paycheck may hang in the balance. Simplicity is king in a disaster situation. You can't count on having your best DR expert available to coordinate the recovery. You should plan for other personnel to be coordinating the recovery while you or your DR expert is "stuck" on a Caribbean island with the cell phone turned off. Being successful in this situation will require your skeleton DR staff to have plenty of DR exercises under their belt and have the simplest restore procedures possible.

VM Recovery Steps Comparison Legacy Recovery Method 1. Provision empty VM 2. Reinstall a fresh system at the DR site 3. Patch the OS 4. Install application binaries and other dependencies 5. Install backup agent and then proceed with restore of unverified application data 6. Configure application to work with recovered data 7. VM is ready to be restarted and application verified for first time

OR

Whole-VM Recovery with Veeam 1. Restore whole-VM 2. Power-on VM that was preverified with SureBackup Note: Replication would be a single step to power on the VM.

Don’t Reinvent the Wheel, Restore the Whole VM In a VMware DR plan, you want to utilize the most powerful features of VMs: encapsulation and hardware independence. VMs are just a group of files and can be copied to other ESX(i) servers (regardless of hardware vendor) and can be powered on and returned to service, sometimes, without any further modification. For this reason, whole-VM protection methods are superior to legacy methods that rely on multiple manual steps to create a workload from start to finish. From the comparison chart below it's pretty easy to see why whole-VM recovery is preferred.

13

VMware Disaster Recovery Planning: Essential Checklist

Using legacy protection measures on virtualized workloads introduces complexity, cost and risk to into your DR plan that your company can’t afford. By restoring the whole VM you drastically cut the number of steps required and you open the door to be able to test and verify the recovery of the VM prior to needing it.

Start with Replication There are two general methods to accomplish whole-VM recovery: VM backup and VM replication. The first requires you to backup the entire VM to some external storage media or a replicated file system. Then when recovery is required, restore the VM to the recovery site ESX(i) servers through a simple file copy. Veeam Backup can make this process slightly quicker in that it offers instant-on features allowing you to present an NFS export to the ESX(i) server, and boot a VM directly from its backup file. Eventually you would need to perform a Storage vMotion of the VM to primary storage or do a full cold restore of the VM from a backup media/file system to the recovery VMware environment. The other recovery method is to replicate the entire VM to the recovery site ESX(i) servers so that the replica VM is already pre-staged and ready-to-boot. To recover the VM, you power it on. As simple as VM backup is, one-step recovery with VM replication is very hard to beat and it is almost impossible to provide better RTO unless you call active-active geo clustering a "recovery technique.” When designing for easiest restore, VM replication has to be on the top of your list for tools to consider.

Decide on Infrastructure Configuration After all the interviews, data collection and analysis, you'll eventually have to chart a direction and make some decisions. You'll have to decide on a final configuration for hardware, software, off-site data transport method, and recovery processes. There is no one right way to do this but armed with business requirements for recovery, application dependencies, and your budget guidelines, you should have enough information to start to make some decisions

Servers Any x86 server hardware will do here, but the question is more about what kind of capacity do you require in a disaster and what's the most cost effective way to meet that need at the DR site? Those 5-year-old 2U rackmount servers with 4 total processor cores and 16GB of RAM might do okay in a pinch for a small portion of your DR environment, but only if you don't recover the whole environment. Although those servers might be free, they don't look so good when you're paying for rack space and power for dozens of servers at a co-location facility. It may be cost advantageous to purchase new servers that have 10 times the capacity which can reduce DR licensing costs for Veeam and VMware while cutting your physical space and power requirements by a factor of 10. Whatever your decision, make sure you provide for adequate capacity based on real world measurements from your production VMware environment and guided by the Business Impact Assessment (BIA).

Storage If you need vMotion and High Availability at the DR site, you'll need to invest in shared storage to go with the VMware ESX(i) servers that you'll be replicating to or restoring VM backups to. Choosing NAS, iSCSI and Fiber Channel are all good

14

VMware Disaster Recovery Planning: Essential Checklist

decisions, and most are valid options. If you are a small business or a small remote office, shared storage for VMware may not always be possible and you may be required to use local storage contained within the recovery servers. Although these setups aren't as efficient to manage in a production environment, they can be good enough in a disaster to allow your business to provide revolutionary DR capability at a bargain price. In configurations with locally attached storage, you'll be happy to know that Veeam Backup & Replication can support that option as well since it can write to any VMware datastore visible to ESX(i) server.

Network We talked earlier about network considerations. Basically, there is enough network bandwidth or there is not. There either is high latency or there is not. If you have the budget, make the investment in high bandwidth links between your recovery site and your primary datacenter. This can allow the most reliability and lowest operational cost for your backups and replication since no error-prone manual or physical methods are required to me move data to recovery site. Whether it is due to budget-related or geography-related limitations, not everyone has the network bandwidth available to replicate critical assets. That's why the old adage remains true, "Never underestimate the bandwidth of a van full of tapes driving down the highway." Your network realities may dictate a whole VM backup to disk or tape that is then trucked off-site for safe keeping or for test restoration at the DR site. In these situations you will be sacrificing the ultimate RTO from the start, so it's not as important to have all servers racked, stacked, powered and ready to go. You might even consider alternate means for provisioning server resources in these scenarios.

DR in the Cloud? In the case where replication is not an option and you have VM backup images that can be restored to any ESX(i) servers in the world, why not restore to the cloud? There are countless VMware hosting providers available today that can rent you resource pools or whole VMware environments. Rather than investing in expensive, duplicate datacenter locations that will only be used in the unlikely event that an actual disaster occurs, you can instead bank that money and only pay a small portion in the event a disaster happens or in the event you'd like to test your recovery. If you do decide to move forward with restoring to a VMware hosting provider, you may need to do some advance planning on the contract side to help speed your recovery if needed. Although we're getting closer to the dream world that allows you to whip out the company credit card and spin up a DR site in minutes, it's more likely that you'll want to sign a contract in advance in order to get some guarantee that the resources you'll require will be available should you declare a disaster. Of course this insurance will cost you, but it will be much less than if you purchased the resources full time or if you stood up your own DR site.

Test-Driven DR Plan In the software development world, a popular software development process is test-driven development (TDD). In TDD, developers first create automated unit tests that will only pass successfully if the new piece of code under development fulfills all criteria. By writing the test first and then developing the code, quality is

15

VMware Disaster Recovery Planning: Essential Checklist

improved since code can’t be released until the unit tests pass successfully. This process has proven very successful for software developers, and a derivation of the process can now be applied to virtualized DR. This derivation can be called test-driven DR and it turns traditional DR planning on its head by planning for the restore and verification of the restore before you plan to do your first backup.

Design with Testing in Mind It takes more than petabytes of shiny deduplicated, compressed backups to save the day in a disaster situation. If you can't restore successfully and quickly from those fancy backups, then what was the point? With today's VMware virtualized systems, there is no longer an excuse to not test your recovery as often as possible. If replica VMs are ready to boot in a test recovery environment, you are just a PowerCLI script or web service call away from creating custom workflows to orchestrate the recovery of your protected VMs. By building upon what you learned with application dependencies and target RTOs, you can create custom workflows that will bring up your protected VMs in the order required to test your ability to restore. And since we're talking about virtualized environments, you can easily adjust the VMware networking with a simple web service call to ensure that recovered VMs are placed in an isolated network that you prepared in advance. Your functionality is only limited by your time and scripting ability. However, not everyone is a programmer or has the time to create these workflows and verification scripts. Veeam SureBackup is the solution to help programmers and non-programmers create a foundation for their test-driven DR plan.

SureBackup Recovery Verification Is Test-Driven DR With the SureBackup functionality in Veeam Backup & Replication, you can verify the recoverability of every VM backup every time. This recovery verification of individual VMs is the essence of test-driven DR planning since you are designing tests that will fail until you properly design and execute your automated backup and restore system properly. As in TDD, if a SureBackup verification test fails, you will reconfigure your recovery or backup until the verification passes all tests. Since it’s all virtual and able to automatically run within an isolated network, you can run these DR tests repeatedly until you find the hidden dependency, incorrect IP addressing, or out-of-order recovery step. This cycle of DR plan refinement can verify your entire DR plan within a few short days for small DR plans to weeks for more complex DR plans.

Test-Driven DR Enables Continuous Testing With traditional DR exercises leveraging legacy DR solutions, businesses are lucky if they have the time and budget to test their DR plan annually, let alone resolve all outstanding issues with the recovery. The technology is available to make test-driven DR planning a continuous process for your virtualized environment. You should work toward a goal of near-daily automated recovery verifications to bulletproof your DR plan. Limiting yourself to annual or quarterly DR tests is a relic of DR planning past that no longer applies. When creating your VMware DR plan, be sure to not only eliminate legacy technologies of the past, but to scrap the legacy processes as well. Freeing your mind of yesterday’s DR baggage will help you embrace the possibilities of virtualized DR and fully experience its benefits for your business.

16

VMware Disaster Recovery Planning: Essential Checklist

About the Author

Sean Clark VMware vExpert

Sean is a ten-year IT veteran with a background in software development, database administration, security coordination, and IT management. The last five years, he has focused on developing his expertise in VMware virtualization and surrounding technologies. He has kept current VMware Certified Professional (VCP) status on VI 2.5, VI 3.5 and vSphere 4. In 2009 Sean was awarded VMware vExpert status, one of 300 globally to receive the award recognizing their contribution to the virtualization community. Since then, Sean has been an active member of the virtualization community as a notorious Twitter contributor with the handle of @vSeanClark, as co-instigator of the popular vmunderground.com community party at VMworld, and as a random blogger at http://seanclark.us. He has provided guidance on virtualization strategy to businesses of all sizes and from all industries, and is currently a virtualization consultant with TEKsystems working on a long-term cloud computing project for a Fortune 500 company.

About Veeam Software Veeam Software, an Elite VMware Technology Alliance Partner, develops innovative software to manage VMware vSphere®. Veeam vPower™ provides advanced Virtualization-Powered Data Protection™ and is the underlying technology in Veeam Backup & Replication™, the #1 virtualization backup solution. Veeam nworks extends enterprise monitoring to VMware and includes the nworks Management Pack™ for VMware management in Microsoft System Center and the nworks Smart Plug-in™ for VMware management in HP Operations Manager. Veeam ONE™ provides a single solution to optimize the performance, configuration and utilization of VMware environments and includes: Veeam Monitor™ for easy-to-deploy VMware monitoring; Veeam Reporter™ for VMware capacity planning, change management, and reporting and chargeback; and Veeam Business View™ for VMware business service management and categorization. Learn more about Veeam Software by visiting www.veeam.com.

17

2010

Products

of the Year

GOLD

VMware Backup Best RTOs

100% Reliability

SureBackup

TM

Best RPOs

InstantRestore

TM

SmartCDP

TM

vPower

TM

Virtualization-Powered Data Protection

TM

5 Patents Pending!

VMware vSphere

5

Patents Pending!

NEW Veeam Backup & Replication™

vPower enables these game-changing capabilities in Veeam Backup & Replication v5: 





Instant VM Recovery—restore an entire virtual machine IN MINUTES by running it directly from a backup file U-AIR™ (Universal Application-Item Recovery)—recover individual objects from ANY application, on ANY OS SureBackup™ Recovery Verification—automatically verify the recoverability of EVERY backup, of EVERY virtual machine, EVERY time

To learn more, visit www.veeam.com/vPower

VMware Disaster Recovery Planning Essential Checklist DR planning in a VMware environment requires old-fashioned DR planning fundamentals but also requires fully leveraging virtualization’s unique characteristics. This checklist of 10 proven DR planning activities provides you a jumpstart towards an award-winning VMware DR plan.

Conduct a Business Impact Analysis (BIA).

A BIA helps you identify critical business systems, their IT and human dependencies, and an estimated disruption impact to your business. You can then determine which applications are most important.

Know your recovery point objective (RPO) and your recovery time objective (RTO).

Develop an intimate understanding of how your business runs and catalog the key resources and processes necessary to enable revenue creation. Translating the findings from the BIA into RTO/RPO requirements for each application, helps you focus your resources where they are needed most.

Understand your budget.

Virtualized DR is much less expensive than traditional physical DR, but it is still an additional cost on top of your already sizable investment in virtualization software, supported server hardware and new storage systems. Avoid the expense of maintaining both legacy and virtualization-aware backup systems by migrating to all virtual DR.

Understand application dependencies.

Most applications have dependencies external to the virtual machine (VM) that it runs on. In a disaster, it’s critical to have cataloged these dependencies because you will have to recover each one to restore endto-end functioning for that application. Start at the base infrastructure services like DHCP, DNS and Active Directory. But don’t forget to account for file shares, databases or other non-virtualized servers recovered through legacy means.

Automate VMware environment data collection.

VMware inventory automation can quickly and accurately collect information on the virtual environment that’s invaluable to designing your DR plan. Tools from Veeam can help ease the task. Veeam Reporter can catalog the configuration of the VMware environment, even providing Visio diagrams to reference. Veeam Monitor can provide the performance statistics you need to size your DR infrastructure. Plus, a Veeam Backup & Replication proof of concept (POC) is a good way to learn what your daily data change rate is so you can appropriately size your network connections to the recovery location.

Virtualize stragglers.

If you still have physical servers, it’s time to make the switch. The benefits of virtualized DR are well known and have been written about and practiced for more than 5 years. No matter how good your DR solution for physical servers is, it can’t come close to the capabilities and efficiencies of virtualized DR. Virtualize your remaining physical servers to achieve the most benefit from your DR Plan.

Analyze resource requirements.

Using the data collected in the assessment phase of your DR plan is invaluable in sizing the DR site and creating your DR budget for servers and storage. The largest limiting factor to the ideal DR plan for VMware is the bandwidth required to replicate all necessary VMs. Consider products, such as HyperIP from Netex, that offer WAN acceleration technology purpose-built for accelerating large data transfers over high packet loss and high-latency network links. This can allow for better use of available bandwidth without breaking your budget.

Design for easiest restore.

Simplicity is king in a DR situation. Rather than reinvent the wheel by reinstalling operating systems, applications and restoring individual files, you should restore the entire application as a VM to minimize restore time. Replicating the VM with Veeam Backup & Replication can provide the lowest RTO possible since VMs only need to be powered on to restore service.

Decide on infrastructure configuration.

In today’s cost-conscious IT environment, it’s good to know there are options for your recovery site configuration. Although you can choose to self-host DR options in your own facilities, you can also take advantage of VMware service providers that could provide your DR infrastructure as an on-demand cloud service.

Test-drive your DR plan.

Setting up a DR plan for VMware environments is not a one time activity. You need to ensure that you test, test and test. Manual tests are good, but since you’re working with VMware technology, there’s no reason testing can’t be automated and run as often as daily if needed. Using Veeam Backup & Replication’s SureBackup automated backup verification feature is a great way to do this.