This blog post is about an internal effort we have started within EMC. We have talked about this at EMC World. Based on the initial response, the interest level behind this effort seems to be quite high. Here is some more information about the effort.
The challenge and the concept behind the solution
Over the past few years, customers are increasingly adopting / expecting continuous availability in their data centers. While it may be obvious, it still deserves saying that continuous availability is an end-to-end paradigm starting with the application to multi-pathing to SAN configuration to IP configuration to capabilities like VPLEX Metro and last but not the least physical storage.
We have always recognized that this has an impact on how customers view and purchase solutions. In other words, when a customer thinks about continuous availability, they think about continuous availability for their SAP Environment running on VMware in a SAN with multiple data centers etc. This has major implications for how we think about testing and validating what customers are deploying.
If you think of the normal testing paradigm for any product team, their responsibility is testing the product capabilities, product handling for failure conditions as well as performance, scale and other system testing needs. There is a second envelope of testing that is a superset of all of this – interoperability testing. EMC has built a core capability around interoperability testing with the world class ELab within the EMC family. ELab is responsible for interoperability and protocol testing and certifying products to work with EMC products. This results in generating Support Matrices. Customers and the field treat these support matrices as their bibles for how to configure and deploy products for interoperability. One more envelope around this testing is solution testing. This is now taking the end-to-end pieces that are supported and deploying them and testing them for functionality and performance.
One critical piece is still missing – especially with the focus that customers are putting on continuous availability. With the paradigm rapidly moving to 6 9s and 7 9s availability, it is not sufficient to test the part pieces and trust that interop and solution testing will result in customers reaching those hallowed availability levels. Instead, what is needed is proactive stress and failure testing of these end-to-end deployments. It is also important that we understand the operational paradigm a customer is likely to take in such a deployment.
How are we solving this challenge?
As you can imagine, in a multi-business unit company such as EMC, this is a herculean effort. You need different business units to buy into the concept of solution level failure and stress testing and then align on what is needed to validate and test this capability. Ultimately, our vision as EMC was to deliver to customers a continuous availability experience at the data center level. Talk about setting ambitious goals. But then, our goal was to deliver value to our customers. And setting goals only because they are achievable is not the way to get there.
Similar to when we built ELab, the decision was to invest in a new competency center – Mission Critical Center (MCC).
The mission of the MCC is to build a platform to test and demonstrate greater than 6 9s availability in production for products in the EMC portfolio.
And when we say production, we mean it. For our internal purposes, we treat the MCC exactly as we would treat a customer. They file an SR, escalations to engineering go through exactly the support route that the customer would follow. Upgrades to systems are done similar to how customers would go through it. For all practical purposes, they get exactly the same handling and care that EMC would provide in a customer environment. This teaches us about not only how the product behaves but also about what the impact is of our support processes from a customer perspective. Finally, this helps us also start to look at the problem holistically – i.e. we do not approach debugging the problem from a product perspective but rather from the perspective of the complete solution that the customer deploys.
Mission Critical Center: What is in place and where are we going?
Now that we have talked through the concept, let’s look at what the MCC team has done so far. The MCC team was started as a ground up effort looking for like minded and interested stakeholders across different business units (translation: it has largely been built through a lot of conviction and convincing). The team is essentially built through a shared collaboration between a lot of business units (VMAX, VNX, RecoverPoint and VPLEX). Here is the configuration they have put together.
For readers of this blog, you should be very familiar with this topology – it represents the cascaded VPLEX and RecoverPoint topology discussed here and specific topology captured here. The team has built use-cases around stretched Oracle RAC across DC1 and DC2, stretched VMware HA and other applications all running production level workloads across DC1 and DC2 and protected in DC3. Once this mission critical platform was built, their focus was certainly to run I/Os and then start to do accelerated failure testing (i.e. simulate data center type failure scenarios to understand what failures happen across the entire solution set). The goal of this is _NOT_ to test interoperability of VPLEX with VMAX or VPLEX with VNX or to test the performance any one component. The goal is to take real world customer workloads and deploy them across infrastructure the way a customer would and to learn their operational challenges as well as how the infrastructure handles and recovers from failures. So, the MCC team will often fail WAN links, entire arrays, do tech refreshes, introduce a fabric wide zoning change, simulate disaster of a data center, … you get the idea. Needless to say, I am a big fan!
The team has some very concrete plans on how to take this forward. This configuration is now being morphed into the MetroPoint configuration. That way, they can implement this new and exciting capability in much the same way as a customer would and corresponding to that is a whole new set of failure modes to test and simulate. We will continue to add more applications (SQL, SAP HANA, Hadoop), more infrastructure variances (data center moves, network outages, rolling outages and the ilk) and then more of EMC’s product families (DataDomain, Networker, Avamar, ViPR).
Mission Critical Center: The call to action
As the team is building their capabilities, we have a very real need for active guides / participants to build a strong community around the mission critical center. So, here are the concrete asks:
- If you are a customer / field person with solutions / design experience and would like to participate in this effort, do reach out to me and I can put you in touch with this effort. You can contribute as often or as little as you like. Your role will be to provide guidance to the team in terms of what they should look for, help understand operational processes on your end and to help us along the journey towards how your data center is evolving to make our products provide the same world class capabilities as they do in your environments today
- If there are specific scenarios / applications that you think would be worthy additions to this environment, please reach out to me and we can work to get those on our TODO list for the Mission Critical Center
In the end, this is a community of some very talented engineers within EMC volunteering a big chunk of their time (in addition to doing their day jobs) to enable EMC products to deliver a 6 9s experience in customer data centers. Your help is going to help us get there sooner and make this process more effective. Do consider contributing to this effort!