[DISCLAIMER] This is about the future – everything here is being looked at / worked on but there is no guarantee if or when this capability will become available. This does not impact any existing support statements. So there, you have been warned. On to the coolness ..
When one of my prior posts talked about VM granular storage, this was what I could not talk about. But, now that the curtains are off at Barcelona, I am able to post this. Chad has posted the demo on his blog here (“VMworld 2012 – Psst… Want to see the future of storage with VMware and EMC?”).
Here is the demo itself.
What EMC and VMware demonstrated at VMworld Barcelona is a proof of concept displaying virtual machines moving non-disruptively across asynchronous latencies and under load, using VM granular storage from VMware and VPLEX Geo from EMC.
This demo won the partner demo challenge in the Steve Herrod keynote:
Chad presented this at the 47 min mark – thank you to all those that voted!)
Can’t I do this today? aka What’s the big deal?
In a word, NO. You can move VMs from one side to another with VPLEX Metro (synchronous latencies now up to 10 msec). However, going asynchronous is a whole different ball of wax. Why is that? (By the way, this is a topic of discussion in ~100% of my VPLEX Geo conversations so this post is long overdue).
The answer lies in the interaction between vmfs and the asynchronous behavior of VPLEX Geo.
When vmfs was originally designed, it was a file system expecting disk attached to a server. It was extensible to storage coming from the SAN. Then a technology like VPLEX Metro extended vmfs across data centers. However, the common thread running through all of this is that the disk underlying vmfs is ‘synchronous’. In other words, when a write is issued from the host, before success is returned to the host, the write is on the media (yes, I understand that it is on the cache in the array but it is ‘on the box’ and will be on media should a failure happen).
This paradigm breaks when you go to disks that are asynchronously replicated. In this case, the big difference is that when a write is acknowledged on one side, the peer (asynchronously replicated) leg(s) of the disk, will not have access to the data until such time as the write is flushed from one side to the other. This should have made active / active on asynchronous disks impossible (after all, you should not have been able to maintain a single consistent disk image and be able to read on the second side the data that you just wrote on the first side until the flush time has completed).
VPLEX Geo solves this by creating an intelligent distributed multi-site coherent cache (AccessAnywhere™) which is able to fetch the most current data even if the underlying disk is asynchronous. The data on the disk can come later (with the real flush of the data from site 1) while maintaining write order consistency.
With me so far?
The problem happens when there are failures in this scenario (either a site goes down or sites partition). Now, the ESX Cluster on the second side is expecting data on the disk to match what was acknowledged (i.e., synchronous) but the underlying disk data has not reached the second site (i.e., asynchronous). This risk is what caused both EMC and VMware to back away from supporting the combination of vSphere and VPLEX Geo.
A second layer to the problem
If you imagine VMs working with shared storage and now stretch that across data centers over asynchronous latencies, one potential way that you can imagine solving the above problem is by having knowledge of which VM is accessing which portions of the data (you can already see VM granular concepts starting to eke their way here). If one is able to make that determination, you can now allow the partition scenario to play out in very interesting ways. So long as you ensure that the data remains current for a given VM on the side that it is active, you have the inside track to avoiding the situation above.
As it stands (in the world of the here and now), vmfs and VMware HA use heartbeat timeouts to help determine the health of the vmdk (even when VMs might not be active on the ESX server). Again, now switching to the view from a VPLEX perspective, it appears to the VPLEX Geo instance as if both sides are writing and therefore, both sides of the VPLEX Geo instance are active. Furthermore, the VM boundaries are not known at the storage layer. This prevents the storage from doing anything intelligent with the writes received.
Bottom line, when site failures or partitions happen, the failures cannot be limited to the VMs on the failing side (in a site failure scenario) or to the VMs on the non-preferred side (as would be the case with VPLEX Metro for instance). Rather all VMs are impacted.
Okay, I get it – VPLEX Geo is not supported with VMware. What are you doing to fix that?
That is probably the immediate follow on question after the details above are unwillingly accepted by most customers I interact with. As you can imagine, prior to VMworld 2012 Barcelona, a lot of it was ‘yes we are working on it’. But, as VMware has gone public with VM granular storage as a tech preview, this allows partners such as EMC to be a bit more open about what we are cooking.
Both VMware and EMC recognized this gap a while back. A team of product managers, architects and developers from both companies have been working very closely with each other over the last two years vetting the use cases, understanding the potential technical options and finally, what is needed to bring this solution to the market. (To all the customers and partners who participated in giving us input, answering our annoying questions, our ‘what if’ scenarios, THANK YOU!)
The solution is built using the VM granular storage infrastructure that is built to resolve other problems which have a similar symptom (i.e. impedance mismatch between LUN and the storage needed by a VM). Spelling out where a VM lies via vvols allows VPLEX Geo to understand where a particular VM is active. Even if the volume is distributed, since the vvol will be uniquely used by a particular VM, only one side of the vvol will continue to be accessed. As a Geo vMotion gets initiated, VPLEX Geo can now start to optimize the availability of the complete data on the disk on the other side. What this means is that a vvol based solution for Geo vMotion is no longer subject to the failure conditions that were described above. Before the engineers jump all over this post – Naturally, I am oversimplifying. There is a TON of work that needs to happen on both the VMware and EMC sides to deliver this.
The coolest part of the demo for me is decidedly the least ‘unsexy’ part of the demo. If you have used vMotion before, doing this over Geo latencies is pretty underwhelming. You do EXACTLY what you did before. You right click and migrate the VM and underneath vSphere and VPLEX Geo weave their magic and the VM is transported live to the remote side. Good stuff!
Finally, a BIG shout out to the vMotion and vvol team at VMware (Jennifer, Patrick, Haripriya, Gabe and the rest) and the VPLEX Project Baltimore team at EMC (Mike, Brad, Roel, Ranjit, Bill, Amir, Brian, Kurt, Thomas, Justin, Rob and several others). Great job guys in being able to pull the demo off!
As one of the VMware PMs remarked at VMworld, ‘If and when this GAs, it will be awesome!’ 😉