VMworld 2012: They did WHAT?!?

[DISCLAIMER] This is about the future – everything here is being looked at / worked on but there is no guarantee if or when this capability will become available. This does not impact any existing support statements. So there, you have been warned. On to the coolness ..

When one of my prior posts talked about VM granular storage, this was what I could not talk about. But, now that the curtains are off at Barcelona, I am able to post this. Chad has posted the demo on his blog here (“VMworld 2012 – Psst… Want to see the future of storage with VMware and EMC?”).

Here is the demo itself.

What EMC and VMware demonstrated at VMworld Barcelona is a proof of concept displaying virtual machines moving non-disruptively across asynchronous latencies and under load, using VM granular storage from VMware and VPLEX Geo from EMC.

This demo won the partner demo challenge in the Steve Herrod keynote:

Chad presented this at the 47 min mark – thank you to all those that voted!)

Can’t I do this today? aka What’s the big deal?

In a word, NO. You can move VMs from one side to another with VPLEX Metro (synchronous latencies now up to 10 msec). However, going asynchronous is a whole different ball of wax. Why is that? (By the way, this is a topic of discussion in ~100% of my VPLEX Geo conversations so this post is long overdue).

The answer lies in the interaction between vmfs and the asynchronous behavior of VPLEX Geo.

When vmfs was originally designed, it was a file system expecting disk attached to a server. It was extensible to storage coming from the SAN. Then a technology like VPLEX Metro extended vmfs across data centers. However, the common thread running through all of this is that the disk underlying vmfs is ‘synchronous’. In other words, when a write is issued from the host, before success is returned to the host, the write is on the media (yes, I understand that it is on the cache in the array but it is ‘on the box’ and will be on media should a failure happen).

This paradigm breaks when you go to disks that are asynchronously replicated. In this case, the big difference is that when a write is acknowledged on one side, the peer (asynchronously replicated) leg(s) of the disk, will not have access to the data until such time as the write is flushed from one side to the other. This should have made active / active on asynchronous disks impossible (after all, you should not have been able to maintain a single consistent disk image and be able to read on the second side the data that you just wrote on the first side until the flush time has completed).

VPLEX Geo solves this by creating an intelligent distributed multi-site coherent cache (AccessAnywhere™) which is able to fetch the most current data even if the underlying disk is asynchronous. The data on the disk can come later (with the real flush of the data from site 1) while maintaining write order consistency.

With me so far?

The problem happens when there are failures in this scenario (either a site goes down or sites partition). Now, the ESX Cluster on the second side is expecting data on the disk to match what was acknowledged (i.e., synchronous) but the underlying disk data has not reached the second site (i.e., asynchronous). This risk is what caused both EMC and VMware to back away from supporting the combination of vSphere and VPLEX Geo.

A second layer to the problem

If you imagine VMs working with shared storage and now stretch that across data centers over asynchronous latencies, one potential way that you can imagine solving the above problem is by having knowledge of which VM is accessing which portions of the data (you can already see VM granular concepts starting to eke their way here). If one is able to make that determination, you can now allow the partition scenario to play out in very interesting ways. So long as you ensure that the data remains current for a given VM on the side that it is active, you have the inside track to avoiding the situation above.

As it stands (in the world of the here and now), vmfs and VMware HA use heartbeat timeouts to help determine the health of the vmdk (even when VMs might not be active on the ESX server). Again, now switching to the view from a VPLEX perspective, it appears to the VPLEX Geo instance as if both sides are writing and therefore, both sides of the VPLEX Geo instance are active. Furthermore, the VM boundaries are not known at the storage layer. This prevents the storage from doing anything intelligent with the writes received.

Bottom line, when site failures or partitions happen, the failures cannot be limited to the VMs on the failing side (in a site failure scenario) or to the VMs on the non-preferred side (as would be the case with VPLEX Metro for instance). Rather all VMs are impacted.

Okay, I get it – VPLEX Geo is not supported with VMware. What are you doing to fix that?

That is probably the immediate follow on question after the details above are unwillingly accepted by most customers I interact with. As you can imagine, prior to VMworld 2012 Barcelona, a lot of it was ‘yes we are working on it’. But, as VMware has gone public with VM granular storage as a tech preview, this allows partners such as EMC to be a bit more open about what we are cooking.

Both VMware and EMC recognized this gap a while back. A team of product managers, architects and developers from both companies have been working very closely with each other over the last two years vetting the use cases, understanding the potential technical options and finally, what is needed to bring this solution to the market. (To all the customers and partners who participated in giving us input, answering our annoying questions, our ‘what if’ scenarios, THANK YOU!)

The solution is built using the VM granular storage infrastructure that is built to resolve other problems which have a similar symptom (i.e. impedance mismatch between LUN and the storage needed by a VM). Spelling out where a VM lies via vvols allows VPLEX Geo to understand where a particular VM is active. Even if the volume is distributed, since the vvol will be uniquely used by a particular VM, only one side of the vvol will continue to be accessed. As a Geo vMotion gets initiated, VPLEX Geo can now start to optimize the availability of the complete data on the disk on the other side. What this means is that a vvol based solution for Geo vMotion is no longer subject to the failure conditions that were described above. Before the engineers jump all over this post – Naturally, I am oversimplifying. There is a TON of work that needs to happen on both the VMware and EMC sides to deliver this.

The coolest part of the demo for me is decidedly the least ‘unsexy’ part of the demo. If you have used vMotion before, doing this over Geo latencies is pretty underwhelming. You do EXACTLY what you did before. You right click and migrate the VM and underneath vSphere and VPLEX Geo weave their magic and the VM is transported live to the remote side. Good stuff!

Finally, a BIG shout out to the vMotion and vvol team at VMware (Jennifer, Patrick, Haripriya, Gabe and the rest) and the VPLEX Project Baltimore team at EMC (Mike, Brad, Roel, Ranjit, Bill, Amir, Brian, Kurt, Thomas, Justin, Rob and several others). Great job guys in being able to pull the demo off!

As one of the VMware PMs remarked at VMworld, ‘If and when this GAs, it will be awesome!’ 😉

New VPLEX / VMware qualifications

With the latest release of the E-Lab Simple Support Matrix (ESSM) for VPLEX, there are some new VMware related qualifications that you should be aware of:

  1. VPLEX support for Metro vMotion: With the latest qualification, vMotion with VPLEX Metro is now supported up to 10 msec RTT (up until now, VPLEX was officially supported up to 5 msec RTT with vMotion and you had to file an RPQ for any additional distance support). This now means that you get official VPLEX support for the same latency as Metro vMotion (with your Enterprise Plus license). Please note that per the guidelines of vMSC that while vMotion is supported up to 10 msec latency, VMware HA is supported up to 5 msec ONLY. For environments other than Metro vMotion, RPQs will still be needed.
  2. VMware FT with VPLEX Metro: If you were dreaming about 0 RTO / 0 RPO end-to-end across data centers (and who doesn’t :-)), we just enabled that with VPLEX Metro and VMware FT. Consistent with the bounds of VMware FT, this is supported up to 1 msec RTT only. At that latency, we think it is far more likely that you will want to take advantage of the cross connect topology. So, the official support for VMware FT is with VPLEX Metro up to 1 msec with a cross-connect topology. NOTE: this is outside of the bounds of vMSC (which only covers HA and vMotion).
  3. VPLEX Witness with VMware FT: We had also received questions from customers about providing protection for the VPLEX Witness. This is now also supported with the VPLEX Witness VM being protected by VMware FT. So, if the witness VM or the machine running it fails, you can continue without missing a heartbeat (no pun intended ;-)).

If you are interested in the details of (2) and (3), please refer to the white paper published (Using VPLEX Metro with VMware HA and Fault Tolerance for Ultimate Availability). You should definitely read up on this white paper before deploying this topology.

And, oh yeah, vSphere 5.1 (vMSC and all) is now officially supported and posted on the VMware HCL.

VMworld 2012: VM granular storage aka vvols – Bridging the gap between VMs and storage

The second session at VMworld that in my mind introduced another game changer was the tech preview session introducing VM granular storage.

This post is a circuitous one that starts with a problem (especially if you are an ESX administrator who prefers to deal with block storage). Let me set this up a bit.

You love what VMware is offering in terms of features and function. You are leading the crusade towards more and more virtualization. Heck, some days it even feels like you are winning the battle. Through your evangelization you have been able to get your organization to adopt a `virtual first` stance (In other words, as new applications are provisioned, they are deployed on virtual instances by default). The burden of proof is on the application admin on why they would like to deploy on a non virtualized environment. You have been able to drive tremendous efficiency within your organization and now have an environment that is 60+% virtualized. So where is the problem?

When you reach this level of virtualization, an interesting thing happens. You start to think of your basic unit of operation as a virtual machine. Since you are now at a common platform across the majority of your data center, you would like to operate at that level since it introduces a lot of operational efficiency for you. It works great until you think about your storage and network. In other words, of the three pillars of your data center infrastructure, you have solved compute and are now raring to solve the other two.

If this describes you, then the rest of the post is for you (at least for the storage piece anyways).

Let me next walk through your provisioning process today. If you need to allocate a virtual machine, the first task is whether there space available on your vmdk. Let us assume there isn’t. You then have to work with the admin to figure out what their application characteristics are and what SLA levels they require for this app. Next you go to the storage admin and start to translate those to LUNs, pools and esx servers that need access to those. Deprovisioning is a similar story. Multiply this for the 500+ VMs (much much larger for that you are provisioning and deprovisioning in the course of a year. Doesn’t sound like fun does it? Other things that are problematic in this world view:
1. There is an impedance mismatch between the storage characteristics and the VM requirements. Think about operations creating a copy of some sort of the datastore for a VM. Unless you do it as a clone from the VM, you are creating a copy of the entire LUN.
2. Application requirements are more dynamic than a one time allocation. That change does not get expressed without another round of coordination between the application admin, you and the storage admin.

So how do you solve this? In comes VM granular storage. Imagine a world in which the VM was directly expressed to the storage. So, the application admin says `Create a virtual machine with 200 GB of storage with permission to grow up to 500 GB and provisioned on silver storage`. In your organization, silver translates to 5% flash, 45% fibre channel and 50% SATA with a local snap taken once two hours, DR protection with an RPO of 8 hours. Based on this, there is a charge back to the line of business. This (through the magic of VMware and storage platforms) is passed directly onto the storage which then allocates the corresponding storage. As the application requirements change, those characteristics are passed through to the storage layer and the adjustment is automatic. When the VM is deprovisioned, the storage is AUTOMATICALLY freed up and returned back to the pool. Who would have thunk that?

So how can this be achieved. Last year, VMware introduced vStorage API for Array Awareness (VASA). This is an out-of-band management interface which creates a standard protocol through which vCenter can talk to the storage platform. This layer is being expanded to allow precisely the kind of communication that I described here. On the storage side, what this allows us to do is to realize what the bounds of the VM are (without truly changing the SCSI characteristics of the platform). Being aware of the bounds of the VM allows the services that the storage platform to be provided at the level of a VM. The snap or clone is no longer at the level of a LUN, it is at the level of a VM. In case you are wondering the same paradigm applies to other services – backup, encryption, dedupe, etc. Have I whetted your appetite? Want to learn more – here are ways that you can:

  1. Session from VMworld 2011: Link here
  2. Attend session INF-STO2223 at VMworld 2012: Link here

What do you think? Does this solve problems you are encountering? What does implementing something like this mean for you?