Tag Archives: continuous availability

Mission Critical Center: A community for continuous availability

This blog post is about an internal effort we have started within EMC. We have talked about this at EMC World. Based on the initial response, the interest level behind this effort seems to be quite high. Here is some more information about the effort.

The challenge and the concept behind the solution

Over the past few years, customers are increasingly adopting / expecting continuous availability in their data centers. While it may be obvious, it still deserves saying that continuous availability is an end-to-end paradigm starting with the application to multi-pathing to SAN configuration to IP configuration to capabilities like VPLEX Metro and last but not the least physical storage.

We have always recognized that this has an impact on how customers view and purchase solutions. In other words, when a customer thinks about continuous availability, they think about continuous availability for their SAP Environment running on VMware in a SAN with multiple data centers etc. This has major implications for how we think about testing and validating what customers are deploying.

If you think of the normal testing paradigm for any product team, their responsibility is testing the product capabilities, product handling for failure conditions as well as performance, scale and other system testing needs. There is a second envelope of testing that is a superset of all of this – interoperability testing. EMC has built a core capability around interoperability testing with the world class ELab within the EMC family. ELab is responsible for interoperability and protocol testing and certifying products to work with EMC products. This results in generating Support Matrices. Customers and the field treat these support matrices as their bibles for how to configure and deploy products for interoperability. One more envelope around this testing is solution testing. This is now taking the end-to-end pieces that are supported and deploying them and testing them for functionality and performance.

One critical piece is still missing – especially with the focus that customers are putting on continuous availability. With the paradigm rapidly moving to 6 9s and 7 9s availability, it is not sufficient to test the part pieces and trust that interop and solution testing will result in customers reaching those hallowed availability levels. Instead, what is needed is proactive stress and failure testing of these end-to-end deployments. It is also important that we understand the operational paradigm a customer is likely to take in such a deployment.

How are we solving this challenge?

As you can imagine, in a multi-business unit company such as EMC, this is a herculean effort. You need different business units to buy into the concept of solution level failure and stress testing and then align on what is needed to validate and test this capability. Ultimately, our vision as EMC was to deliver to customers a continuous availability experience at the data center level. Talk about setting ambitious goals. But then, our goal was to deliver value to our customers. And setting goals only because they are achievable is not the way to get there.

Similar to when we built ELab, the decision was to invest in a new competency center – Mission Critical Center (MCC).

The mission of the MCC is to build a platform to test and demonstrate greater than 6 9s availability in production for products in the EMC portfolio.

And when we say production, we mean it. For our internal purposes, we treat the MCC exactly as we would treat a customer. They file an SR, escalations to engineering go through exactly the support route that the customer would follow. Upgrades to systems are done similar to how customers would go through it. For all practical purposes, they get exactly the same handling and care that EMC would provide in a customer environment. This teaches us about not only how the product behaves but also about what the impact is of our support processes from a customer perspective. Finally, this helps us also start to look at the problem holistically – i.e. we do not approach debugging the problem from a product perspective but rather from the perspective of the complete solution that the customer deploys.

Mission Critical Center: What is in place and where are we going?

Now that we have talked through the concept, let’s look at what the MCC team has done so far. The MCC team was started as a ground up effort looking for like minded and interested stakeholders across different business units (translation: it has largely been built through a lot of conviction and convincing). The team is essentially built through a shared collaboration between a lot of business units (VMAX, VNX, RecoverPoint and VPLEX). Here is the configuration they have put together.

Mission Critical Center Architecture
Mission Critical Center Architecture

For readers of this blog, you should be very familiar with this topology – it represents the cascaded VPLEX and RecoverPoint topology discussed here and specific topology captured here. The team has built use-cases around stretched Oracle RAC across DC1 and DC2, stretched VMware HA and other applications all running production level workloads across DC1 and DC2 and protected in DC3. Once this mission critical platform was built, their focus was certainly to run I/Os and then start to do accelerated failure testing (i.e. simulate data center type failure scenarios to understand what failures happen across the entire solution set). The goal of this is _NOT_ to test interoperability of VPLEX with VMAX or VPLEX with VNX or to test the performance any one component. The goal is to take real world customer workloads and deploy them across infrastructure the way a customer would and to learn their operational challenges as well as how the infrastructure handles and recovers from failures. So, the MCC team will often fail WAN links, entire arrays, do tech refreshes, introduce a fabric wide zoning change, simulate disaster of a data center, … you get the idea. Needless to say, I am a big fan!

The team has some very concrete plans on how to take this forward. This configuration is now being morphed into the MetroPoint configuration. That way, they can implement this new and exciting capability in much the same way as a customer would and corresponding to that is a whole new set of failure modes to test and simulate. We will continue to add more applications (SQL, SAP HANA, Hadoop), more infrastructure variances (data center moves, network outages, rolling outages and the ilk) and then more of EMC’s product families (DataDomain, Networker, Avamar, ViPR).

Mission Critical Center: The call to action

As the team is building their capabilities, we have a very real need for active guides / participants to build a strong community around the mission critical center. So, here are the concrete asks:

  1. If you are a customer / field person with solutions / design experience and would like to participate in this effort, do reach out to me and I can put you in touch with this effort. You can contribute as often or as little as you like. Your role will be to provide guidance to the team in terms of what they should look for, help understand operational processes on your end and to help us along the journey towards how your data center is evolving to make our products provide the same world class capabilities as they do in your environments today
  2. If there are specific scenarios / applications that you think would be worthy additions to this environment, please reach out to me and we can work to get those on our TODO list for the Mission Critical Center

In the end, this is a community of some very talented engineers within EMC volunteering a big chunk of their time (in addition to doing their day jobs) to enable EMC products to deliver a 6 9s experience in customer data centers. Your help is going to help us get there sooner and make this process more effective. Do consider contributing to this effort!

Advertisements

2014 Launch Post 2: MetroPoint: Extending the Availability and Protection Continuum

On April 4th, 2014, as part of the Data Protection and Availability Division (DPAD) launch, there were three VPLEX and RecoverPoint items that were launched or GAd:

  • VPLEX Virtual Edition – Availability late Q2
  • MetroPoint Topology – Joint capability of VPLEX and RecoverPoint – Availability Late Q2
  • VPLEX Integrated Array Services – Available now

This is the second in a series of posts to walk through what was launched / delivered.

VPLEX and RecoverPoint

It has been two years since we introduced the RecoverPoint splitter within VPLEX. The awesomeness of VPLEX was joined with the coolness of RecoverPoint. With this combination, we delivered operational and disaster recovery to VPLEX customers to add to the continuous availability that they already had access to. These were extremely complementary use-cases. While there were a lot of skeptics outside of EMC about this combination, we were quietly confident in our belief that customer wanted an extended continuum between disaster recovery and continuous availability. Suffice it to say, that this combination has exceeded our revenue expectations. Since the launch in May 2012, the organizations have come even closer together within a single business unit further solidifying the bonds between the two teams.

A quick recap of the current integration points between VPLEX and RecoverPoint.

RecoverPoint delivers continuous data protection enabling local and/or remote protection. This is enabled by a RecoverPoint splitter which resides within the VPLEX platform. RecoverPoint has a similar splitter in the VMAX and VNX platform as well. The RP splitter enables WRITES to be sent to a RecoverPoint Appliance (RPA). From there, you can enable local protection (where the writes are journaled locally) or remote protection (where the writes are journaled remotely) or both. The beauty of RecoverPoint is that it can store every single write to give recovery a DVR like capability. The other benefit of RecoverPoint is that the protection is heterogeneous i.e. it can protect between every combination of VPLEX / VMAX and VNX.

The combination of VPLEX and RecoverPoint supports the following topologies:

  1. VPLEX Local with RecoverPoint Local Protection
  2. VPLEX Local with RecoverPoint Remote Protection
  3. VPLEX Metro with RecoverPoint Local Protection
  4. VPLEX Metro with RecoverPoint Remote Protection
  5. The slide below shows the currently supported topologies.

    Currently supported VPLEX and RecoverPoint topologies
    Currently supported VPLEX and RecoverPoint topologies

    Customer topologies are all over the map – we see a lot of traction with the VPLEX Local and RecoverPoint Remote Protection (as we expected). However, the second largest topology is the three sided cascaded topology. And that was a surprise. Upon digging further, a lot of customers have business requirements that need them to have out-of-region disaster recovery site. Yet other customers are deploying VPLEX Metro within one site. So, the usage of RecoverPoint in this case is to provide DR to a Metro deployed within the site. This is the cascaded topology.

    As you can imagine, the downside of the cascaded topology is that if the replicating VPLEX Cluster fails or loses connectivity, DR protection is lost. Since the launch of RecoverPoint on VPLEX quite a few customers have been asking us to add the capability to protect both sides of a VPLEX Metro to a common third site using RecoverPoint. Well, that is exactly what we have done.

    MetroPoint: Operational and Disaster Protection across both sides of a VPLEX Metro

    MetroPoint Topology
    MetroPoint Topology

    The MetroPoint solution launched April 4th will GA at the end of Q2. This is a joint capability between RecoverPoint and VPLEX. Starting with RecoverPoint 4.1 and GeoSynchrony 5.4, customers will now be able to add Disaster Recovery and Operational Recovery protection to both sides of a distributed volume. With MetroPoint, we took the time to do this right – although the protection is on both sides of a distributed volume, only one of the sides is replicating data. The data goes to a single copy of a DR leg. In other words, no additional bandwidth or storage is needed to enable MetroPoint as compared to enabling a standard DR scenario.

    To enable this, we have created a new kind of consistency group called MetroPoint consistency groups. This enables replication on both sides of a distributed volume. Another characteristic of the MetroPoint consistency group is that you can load balance which site is the primary replication site. If there is a failure on the primary replication site, the replication will AUTOMATICALLY switch to the surviving site. In other words, there is no loss of DR protection even if you lose the primary replication site.

    To me, one of the more exciting implications is the extension of the VMware HA and VMware SRM use cases to the MetroPoint topology. Here is what this looks like:

    image
    MetroPoint with VMware HA and SRM

    The VPLEX Metro sites are protected with VMware HA and the remote DR site is protected with VMware SRM. This now gives our customers simultaneous HA and DR.

    One comment here: We talk about MetroPoint as a three site deployment and that is true. However, it is worth remembering that there are a number of customers who deploy VPLEX Metro within a data center either to protect multiple floors or multiple SANs or across a campus type environment. In those scenarios, customers can use MetroPoint to protect to a second site. There is a lot of interest in this deployment model.

    More coolness – along the way, we were able to meet one more request that our customers had requested. With the MetroPoint consistency group, we were able to provide operational recovery on both sides of a VPLEX Metro. And this does not need a third site!!

    Operational Recovery on both sides of a VPLEX Metro
    Operational Recovery on both sides of a VPLEX Metro

    To top this all off, MetroPoint is completely heterogeneous. All these goodies work with both EMC as well as non-EMC arrays. So long as the storage array is supported by VPLEX, you are good to go.

    Here is a short video that Paul Danahy and I put together to give you brief overview of MetroPoint:

    With MetroPoint, we have raised the bar on continuous availability and disaster recovery. This has been the result of collaboration between the VPLEX and RecoverPoint engineering team with a lot of input from some of our lead customers. To all those who helped us get here, a very BIG thank you!