Customer Focus: The EMC Way

Every company has some core founding principles / values – some rather overt, some implicit. These are not principles related to product or technology. Nor are they related to vision. More often than not, these values are not written down. You only learn these through oral traditions of stories / myths at bars through people who have been in the organization long enough. Rarely you get to experience these first hand.

So why bring this up now?

After I shipped myself to the west coast, I am in the odd position of being in a minority (a person who was with EMC but moved to Isilon, from Hopkinton to Seattle). I am a conduit of these very myths – some I have learned (The ‘Yes it does snow in New England’), others I have lived. This is one in the latter category.

This story is from many _many_ years ago. All names (except some key principals who I am sure won’t mind) have been kept confidential for obvious reasons.

I had recently taken on a management role within the engineering organization. I was responsible for SW development and customer escalation management.

As most stories go, this one started with a rather innocuous request from a customer (BTW, they have since become one of my favorite customers – visionary, drive technology, take calculated risks and in every way, partner with us to build better products. It also helps that they are a household name – one of the few ways I can help my non techie family members understand what it is I work on). The request was for them to migrate between data centers with a special request to ensure that engineering was involved.

As it came to us, this seemed like a normal request and we assumed that engineering involvement was needed largely for review. Then the oddness started – the product we had was specifically designed for migrations. But the customer was apparently not using this product for migration.

We dug in, contacted the account team who contacted the customer. Turns out it is a migration, except it isnt a copy of the data but rather a physical move of infrastructure. And the customer wanted to keep their operations online through the physical move and were convinced that they could do this with our product.

I have to share this in all honesty – as the person responsible for carrying the quality banner, I was petrified. While the customer, in theory, was RIGHT (aren’t they always?), we had never quite anticipated a customer contemplating using the product in this manner. Oh yeah, the move was going to happen on Thu and we learned about this on the Monday of that week.

Once you get past the seven stages of coming to terms with reality, we got down to brass tacks (and yes, involvement from engineering was going to be more than just review :-)). One of the engineers from my team was going to head down to the customer site (drivable distance from Hopkinton) to perform this ‘move’. He got testing and practicing the move procedure working out any kinks. So far so good and nothing too far out of the ordinary as far as customer escalations go.

One more thing …

So we have worked through the kinks, the engineer going to the customer site was feeling confident. Thursday morning we run through a last check. Lo and behold, we found out that the long distance SFPs that are needed to enable the migration were misplaced at the customer site. For some reason that I cannot remember, these were not SFPs that were just lying around (I seem to remember that the default was to use short distance SFPs). So, here we are at noon on Thu with everything set except the key ingredient to make the move successful.

I remember going to my manager (@MattWaxman) with a complete dead look that basically said, ‘I am out of options’. In an inspired moment, he suggested something off the wall – since then I have learnt that desperation makes you creative – “Let’s email all the people we know at EMC (our PMTs, BMTs, execs, support) with a system wide SOS that said something to the effect of ‘We need long distance SFPs for a customer in the next three hours – here is the model number. Please contact us if you have any of these lying around. We need 24.’.”

We sent that note – expecting this to be a complete Hail Mary with no chance of success.

Not having much else to do, we ran through one more dry run for the move and let the account team know that we may have to cancel since we didn’t have the SFPs. The final go no go was set at 3:00 PM. The engineer was leaving at 4:00 PM for the drive over.

Here is what happened instead.

I returned to my office post that dry run. And on my desk, I had ten to twenty different packages of SFPs – some from people I knew, most of these from people I didn’t know. I had one guy who drove from our factory in Franklin MA with a box of these SFPs. His exact statement to me was – ‘Someone told me that they had heard about you needing SFPs for a customer. I have tested all of these – they work. Make the customer successful’. I had sticky notes that said the same thing.

Needless to say, the engineer was able to take this on their ride to the customer site with them. The move was executed flawlessly. In fact, the customer’s end customers didn’t see a single app bounce. This customer and that account team became one of our biggest advocates.

What did I learn

Customer focus was something Dick Egan intentionally drove into the EMC culture. Many many years removed from his direct involvement with the company, EMC employee #1 continued to cast a large shadow. It tells you how important founders (and the culture they establish) are.

Customer success is everyone’s responsibility whether you work on a product or not. Customers make the world go round. Over time I have been in many situations where my direct responsibility would have caused me to not act. In all those situations, I try to act in the same proactive ownership culture that I was the beneficiary of.

Always focus on what’s right for the customer. The easy answer above would have been to declare the use case as unsupported. The harder answer was to look past the execution risk and focus on what the customer justifiably needed. Many thanks to the customer for pushing us.

Even as I type this blog many years removed from the actual incident, I continue to be touched by the camaraderie and the sheer stick-to-itiveness of the EMC culture that did not allow us and the customer to fail.

It is one of those rare moments where it feels like the entire company stands behind you as an individual helping you succeed. In my day-to-day interactions within EMC, I continue to use this incident as the yardstick by which I measure myself.

InsightIQ: Basic workflow demos

InsightIQ is Isilon’s software for capacity planning / reporting and performance troubleshooting / reporting. Over the past few releases, we have been working diligently to making some major shifts to how IIQ workflows function and what capabilities the product can provide.

Instead of trying to describe these workflow changes, our TME team (specifically the awesome Robert Chang) has come up with some demos to show these common workflows to our customers.

Use-case 1: Identify demanding NAS Clients

This use-case focuses on how you can identify a client that is consuming network resources. In this case, start with the external throughput and work your way to the actual workstation / IP address that is consuming the resources. That can now break out into the type of I/Os and which protocol within that client to help narrow down what is happening.

When do you use this – when you start seeing clients who are not able to get the bandwidth they need from Isilon, this can be a great first step to understand who is consuming the bandwidth resources.

Use-case 2: Protocols Operations Average Latency

This use-case focuses on identifying the latency for protocol operations within the Isilon cluster. In scenarios when clients are trying to debug latency issues, this procedure can be very helpful. It is important to understand that Isilon can only help identify latency once the I/O enters the system. There may be network contention outside of Isilon or even on the client if multiple hosts are contending for the CPU. In a lot of cases, this is a good sanity check. I know of a couple of customers who maintain tabs on this latency as a means to indicate overall Isilon system health.

Use-case 3: Capacity utilization

Isilon runs a job called File System Analytics (FSA) which collects the metadata for files. This is then combined with the raw performance and capacity data to derive some very helpful information. In this particular demo, Robert tracks through capacity utilization.

When is this useful? – the primary case is when you are trying to understand where your capacity is being consumed but more importantly, which client has had the biggest delta in capacity. Note that, this is one of the mechanisms to debug this – you should always manage capacity through proper use of soft and hard quotas.

And we are just getting started – there is a lot more that we need to tell you and are planning to tell you about the InsightIQ space. Stay tuned!

Back to blogging

Over the past year, I have taken an extended break from blogging.

A lot has been going on in my small corner of the world. I transitioned to the Isilon Product Management team last year, moved from Boston to the west coast. It was a new product (to me), a new market, new team and just a ton going on. As much as I tried to put pen to paper, stuff kept coming up. What that means is that I have accumulated a lot of topics to write about.

In a lot of ways, I am still getting used to the West Coast – no snow in  winter was a pleasant change. And what a winter it was in Boston! I feel settled enough that I can get back to writing again. I am taking this opportunity to capture all that I am learning about Isilon, and unstructured data. Some areas I know well – other areas, I know enough to be dangerous but have a lot more to run. Either ways it should be a fun ride. Join me!

Mission Critical Center: A community for continuous availability

This blog post is about an internal effort we have started within EMC. We have talked about this at EMC World. Based on the initial response, the interest level behind this effort seems to be quite high. Here is some more information about the effort.

The challenge and the concept behind the solution

Over the past few years, customers are increasingly adopting / expecting continuous availability in their data centers. While it may be obvious, it still deserves saying that continuous availability is an end-to-end paradigm starting with the application to multi-pathing to SAN configuration to IP configuration to capabilities like VPLEX Metro and last but not the least physical storage.

We have always recognized that this has an impact on how customers view and purchase solutions. In other words, when a customer thinks about continuous availability, they think about continuous availability for their SAP Environment running on VMware in a SAN with multiple data centers etc. This has major implications for how we think about testing and validating what customers are deploying.

If you think of the normal testing paradigm for any product team, their responsibility is testing the product capabilities, product handling for failure conditions as well as performance, scale and other system testing needs. There is a second envelope of testing that is a superset of all of this – interoperability testing. EMC has built a core capability around interoperability testing with the world class ELab within the EMC family. ELab is responsible for interoperability and protocol testing and certifying products to work with EMC products. This results in generating Support Matrices. Customers and the field treat these support matrices as their bibles for how to configure and deploy products for interoperability. One more envelope around this testing is solution testing. This is now taking the end-to-end pieces that are supported and deploying them and testing them for functionality and performance.

One critical piece is still missing – especially with the focus that customers are putting on continuous availability. With the paradigm rapidly moving to 6 9s and 7 9s availability, it is not sufficient to test the part pieces and trust that interop and solution testing will result in customers reaching those hallowed availability levels. Instead, what is needed is proactive stress and failure testing of these end-to-end deployments. It is also important that we understand the operational paradigm a customer is likely to take in such a deployment.

How are we solving this challenge?

As you can imagine, in a multi-business unit company such as EMC, this is a herculean effort. You need different business units to buy into the concept of solution level failure and stress testing and then align on what is needed to validate and test this capability. Ultimately, our vision as EMC was to deliver to customers a continuous availability experience at the data center level. Talk about setting ambitious goals. But then, our goal was to deliver value to our customers. And setting goals only because they are achievable is not the way to get there.

Similar to when we built ELab, the decision was to invest in a new competency center – Mission Critical Center (MCC).

The mission of the MCC is to build a platform to test and demonstrate greater than 6 9s availability in production for products in the EMC portfolio.

And when we say production, we mean it. For our internal purposes, we treat the MCC exactly as we would treat a customer. They file an SR, escalations to engineering go through exactly the support route that the customer would follow. Upgrades to systems are done similar to how customers would go through it. For all practical purposes, they get exactly the same handling and care that EMC would provide in a customer environment. This teaches us about not only how the product behaves but also about what the impact is of our support processes from a customer perspective. Finally, this helps us also start to look at the problem holistically – i.e. we do not approach debugging the problem from a product perspective but rather from the perspective of the complete solution that the customer deploys.

Mission Critical Center: What is in place and where are we going?

Now that we have talked through the concept, let’s look at what the MCC team has done so far. The MCC team was started as a ground up effort looking for like minded and interested stakeholders across different business units (translation: it has largely been built through a lot of conviction and convincing). The team is essentially built through a shared collaboration between a lot of business units (VMAX, VNX, RecoverPoint and VPLEX). Here is the configuration they have put together.

Mission Critical Center Architecture
Mission Critical Center Architecture

For readers of this blog, you should be very familiar with this topology – it represents the cascaded VPLEX and RecoverPoint topology discussed here and specific topology captured here. The team has built use-cases around stretched Oracle RAC across DC1 and DC2, stretched VMware HA and other applications all running production level workloads across DC1 and DC2 and protected in DC3. Once this mission critical platform was built, their focus was certainly to run I/Os and then start to do accelerated failure testing (i.e. simulate data center type failure scenarios to understand what failures happen across the entire solution set). The goal of this is _NOT_ to test interoperability of VPLEX with VMAX or VPLEX with VNX or to test the performance any one component. The goal is to take real world customer workloads and deploy them across infrastructure the way a customer would and to learn their operational challenges as well as how the infrastructure handles and recovers from failures. So, the MCC team will often fail WAN links, entire arrays, do tech refreshes, introduce a fabric wide zoning change, simulate disaster of a data center, … you get the idea. Needless to say, I am a big fan!

The team has some very concrete plans on how to take this forward. This configuration is now being morphed into the MetroPoint configuration. That way, they can implement this new and exciting capability in much the same way as a customer would and corresponding to that is a whole new set of failure modes to test and simulate. We will continue to add more applications (SQL, SAP HANA, Hadoop), more infrastructure variances (data center moves, network outages, rolling outages and the ilk) and then more of EMC’s product families (DataDomain, Networker, Avamar, ViPR).

Mission Critical Center: The call to action

As the team is building their capabilities, we have a very real need for active guides / participants to build a strong community around the mission critical center. So, here are the concrete asks:

  1. If you are a customer / field person with solutions / design experience and would like to participate in this effort, do reach out to me and I can put you in touch with this effort. You can contribute as often or as little as you like. Your role will be to provide guidance to the team in terms of what they should look for, help understand operational processes on your end and to help us along the journey towards how your data center is evolving to make our products provide the same world class capabilities as they do in your environments today
  2. If there are specific scenarios / applications that you think would be worthy additions to this environment, please reach out to me and we can work to get those on our TODO list for the Mission Critical Center

In the end, this is a community of some very talented engineers within EMC volunteering a big chunk of their time (in addition to doing their day jobs) to enable EMC products to deliver a 6 9s experience in customer data centers. Your help is going to help us get there sooner and make this process more effective. Do consider contributing to this effort!

ViPR 2.0: New use-cases to support VPLEX and RecoverPoint

The GA of ViPR 2.0 was announced in time for EMC World. While there are significant announcements in ViPR 2.0, I will focus on the pieces that benefit VPLEX and RecoverPoint in this new integration.

A quick recap of what was supported prior to the 2.0 release is available here.

Support for Snaps and Clones on arrays behind VPLEX

In the 2.0 release, ViPR now supports full life cycle management of Snaps and Clones on arrays behind VPLEX. This allows customers to get a single pane of glass management function for snaps and clones. This seamless experience makes it easy for customers to take advantage of the performance and scale of these capabilities on underlying arrays and not compromise on the ease of use needed to make this capability work. Here is a demo of this capability.

Setting up a Local Mirror (RAID-1)

Another addition made in the ViPR 2.0 release is the ability to add a local mirror leg to a given virtual volume for the purposes of creating a RAID-1. This allows the volume to be protected across arrays. Here is a demo of what this capability is:

VPLEX and RP Protection

One of the big additions with the ViPR 2.0 release was common management for RecoverPoint within the VPLEX context. This allows RecoverPoint protection for VPLEX volumes to be accomplished through the same user interface. Combined with the end-to-end VPLEX provisioning through ViPR, you can now accomplish complete VPLEX provisioning with RecoverPoint. Please note that ViPR 2.0 does not support the MetroPoint topology. This is targeted for future releases.

Updated Provisioning use-case

Since ViPR 1.0, the provisioning for VPLEX has been updated. Here is a demo of the updated provisioning workflow.

A Blog About Clouds and Data Center Technology … Mostly