Virtualize Windows Storage Replica cluster on vSphere

I have been testing Windows Storage Replica cluster (available in the upcoming Windows 10 Server) on VMware recently for the purposes of testing  geographically dispersed windows failover clusters without depending on shared SAN disk .

I created a two node Windows failover cluster with the following configuration.  Both VM’s are running in two separate datacenters with stretched VLANS, plus a windows fileshare witness is used for quorum located in a third witness site.  Both sites are using separate vCenter servers.

VM Configuration

  • ESXi 5.5 update 2
  • VMX-10
  • Guest OS – Set to Windows Server 2012

Networking

  • Production Network using VMXNET3
  • Cluster Heartbeat Network using E1000 – Separate physical network with no l3 routing

Storage

  • 1x LSI Logic SAS SCSI controller for guest OS VMDK
    • 1x thin provisioned VMDK for OS
  • 1x LSI Logic SAS SCSI controller for Storage Replica with Physical Bus sharing enabled.
    •  1x 40GB Eager Zero Thick VMDK Configured for Storage Replica Data
    • 1x 10GB Eager Zero Thick VMDK Configured for Log

VM config

One problem that I discovered, was that the virtualized SCSI controller must be set to physical bus sharing mode as storage replica uses SCSI3 persistant reservations to lock access to the disk on each node, even though replication happens in-guest using SMB3.  If SCSI bus sharing is not enabled, you will not be able to import the disks into cluster manager and therefore not be able to enable storage replica.

Note: If you have problemscreating an eager zero thick VMDK, enable fault tolerance on the VM.  It will inflate the disks automatically for you.

If the disk configuration is configured correctly as above, you should then be able to successfully import the disks to windows failover cluster manager, and configure storage replica between the data VMDK’s. (40GB in the below screenshot)The 10GB VMDK’s are used for storage replica log.

adddisks

To set the actual storage replica cluster up, I recommend following the excellent storage replica guide created by the Windows storage team located below http://go.microsoft.com/fwlink/?LinkID=514902

One thing I noticed was that the SOURCE replica disk must be a member of a cluster role before replication can be enabled, or Cluster Shared volumes must be enabled on the disk.

enablereplication

Once the replica is configured, you will notice the “replication role” in failover cluster manager becomes active. This specifies which disks are replication source and destination

replicationstatus

Initial testing looks good, task manager was showing 1.6Gb/s a second network throughput while the initial sync was run.

One other issue I discovered was I was unable to reverse replication by failing over the CSV from one node to another, however using the new “Set-SRpartnership” powershell cmdlets  to reverse replication worked fine.  This should be fixed in a later beta of Windows server

I see some really good use cases for Storage Replica in the future, one example would be for SQL Server clustering. Now, SQL clustering without shared SAN requires database replication using always on availability groups which are only available at a high cost with the enterprise edition of SQL server.

SQL Standard edition still supports clustering, but only with shared disk. So with storage replica, you could potentially create SQL failover clusters using the cheaper SQL standard edition and place the databases on storage replica volumes instead of using database replication.   SQL 2014 standard also supports cluster shared volumes which would add a few more good improvements when paired with storage replica.

Why modern enterprises should move from a virtual first to a virtual only strategy

Virtualization has changed how modern enterprises are run. Most companies by now should have completed, or are currently finishing off a virtualization program where all legacy physical servers are migrated to a virtualized infrastructure, both increasing efficiency and lowering operational costs.

Most companies, after the virtualization program has been completed and the people and processes are mature and reliable, switch to a virtual first policy. This is where all new services and applications are delivered by virtual machines by default, and physical machines are only offered if the requests match certain exception criteria. This exception criteria is often, but not limited to, certain physical boundaries such as specialized hardware, monster VM’s, and huge amounts of storage. Often the argument is, that is an application consumes an entire host’s resource, what is the point in virtualizing it? Doesn’t it get expensive to purchase virtualization licenses for just a single VM?

However, I see issues with physical servers still been issued under exception criteria, especially if the enterprise has reached higher than around 95% virtualization. Yes, the initial costs will be higher from day one to virtualize monster VM’s, but if a proper TCO calculation is run you will begin to see why physical servers end up costing more towards the middle/end of the physical hardware’s lifecycle.

Here are some examples of operational boundaries and limitations seen by introducing physical servers back to a highly virtualized infrastructure, along with reasons of why virtualization should be default for all workloads.

  • You will need to put the responsibility back on service and application owners for maintaining physical hardware lifecycle. Synchronization between application and physical hardware must be re-coupled. As the application lifecycle is stuck to the physical hardware lifecycle, any delays in application upgrades or replacements will mean you are forced to keep renewing the physical hardware support, or taking the risk of running physical hardware without support.
  • Patching Hardware and Driver updates become complex. Separate driver packages and firmware updates along with routines for physical servers must be maintained and managed separately of the virtual environment. Since you cannot vMotion a physical server, maintenance windows between operational teams and service owners must be re-established, often with work that requires overtime.
  • Failover procedures must be maintained separately. Any datacenter failovers must involve separate procedures and tests independent of the rest of the virtualized environment. High availability must be 100% solid and handled separately to the rest of your standardized infrastructure. VMware HA does not become a second option with physical servers if application high availability fails.
  • Backup and restore procedures must be maintained and operated separately. Backup agents need to be installed and managed in physical servers with separate backup schedules and policies. Restore procedures become complex if the entire server fails.
  • Different Server deployment procedures must be maintained for both the Physical and Virtual Environment. Many companies deploy VM’s using templates, whilst deploying physical servers using PXE. This means both deployment methods must continue to be managed separately, sometimes even by different teams.
  • The monster VM’s of today will not be the monster VM’s of tomorrow. Performance of modern x86 CPU’s continues to grow according to Moore’s law. A typical large SQL database server, five years ago, was typically running a dual socket, quad core configuration and 64-128GB of RAM. You wouldn’t think twice about virtualizing that kind of workload today.
  • Virtualization enables a faster hardware refresh lifecycle. Once application decoupling has being completed, you will see many enterprises begin to move to a much faster hardware refresh cycle in their virtual environment. Production Virtualization hosts will typically be moved to test environments faster, and VM’s will be migrated without application owners realizing. Applications will see increase in performance during their normal lifecycle period, which will not happen on physical hardware.
  • Everything can be virtualized with proper design. Any claims about virtualization creating performance impacts have no real technical basis today if the application and underlying hardware is properly sized and tuned. The overhead imposed by hypervisors, especially with paravirtualized SCSI and Network adapters is negligible. Low latency voice applications can be virtualized using the low latency function in vSphere 5.5. If the application requires such high performance that somehow exceeds the limits of modern hypervisors, consider scaling out instead of scaling up the application. Consider hiring expert consultants to analyze your most demanding applications before they are considered to be run physical.
  • Have applications that require huge amounts of storage such as MS Exchange 2013? Consider smarter storage solutions that enable compression, and/or dedupe. You could see considerable savings in required capacity and datacenter space when this functionality is moved to the array level. Properly evaluate the TCO, risks and operational overhead of maintaining and managing cheap storage in DAS cabinet’s vs enterprise storage with lower failure rates.

As with everything, a proper TCO calculation must be run early at the project phase to determine the true cost of introducing physical servers to a highly virtualized environment. Make sure all the stakeholders are involved and are aware of the extra operational costs in maintaining a separate non standardized physical silo of infrastructure.

Eliminating RDM complexity with storage replica in the next version of Windows Server

Recently, the new features available in the next version of Windows Server were announced along with a public preview.  One hot feature that caught my attention was storage replica. Storage replica enables block level synchronous or asynchronous replication with two storage agnostic volumes over SMB3.

If Synchronous replication is used, you can create Metro clusters using Windows Server Failover cluster manager.  You select two volumes that support SCSI3 persistent reservations, create the replica, and the volume will appear as a standard clustered disk resource in failover cluster manager which can be failed over to other nodes in the cluster.

Asynchronous replication can be used for scenarios such as data migration, as you can create replication partners between other servers or even to other volumes on the same server.  Since the replication is block and not file based, open files such as SQL databases are not a problem.

Many VMware customers, including myself, utilize in-guest virtualized metro clusters to create high availability across two or more datacenters for mission critical tier-1 applications.  These applications require four or more nines of availability, which cannot be dependent on a single VM for HA.

Unfortunately, not all applications that require high availability support application based replication and instead depend on shared clustered disk for this functionality.  So instead designs are based on SAN disk that is virtualized and replicated to two geo locations at the back end by products such as EMC VPLEX, and then presented to the guest as an RDM device.

You can create a cluster in a box scenario with a single shared VMDK, however unless the multi-writer flag is disabled you cannot run the two cluster VM’s across more than a single host.   Windows failover cluster requires SCSI persistent reservations to lock access to the disk, so unfortunately this solution what is common utilized for Oracle RAC also won’t work for Microsoft clusters.

So, in hindsight, the only way to create virtualized Windows based Metro clusters that require shared cluster disk is to use RDM devices across two or more guests.

I have the following issues with RDM’s used for in-guest clustering

  • They create operational dependencies between your virtualization and storage departments. To resize an RDM, it requires the virtualization administrator to coordinate with the storage administrator to resize the backend LUN.  This is difficult to automate without third party products.
  •  They create operational dependencies between application owners, OS administrators, and virtualization teams. RDM’s using SCSI bus sharing requires the virtualized SCSI adapter to be configured in physical bus sharing mode. If physical bus sharing is enabled, live vmotion is disabled.  Therefore any maintenance on the VMware hosts requires coordination between all these teams to agree on downtime as cluster resources are failed over.  Unfortunately, storage replica in synchronous mode still requires SCSI reservations, so one way around this is to use the Windows in-guest iSCSI initiator and target in loopback mode to get around this limitation.  Hopefully in future VMware versions we can vMotion with physical bus sharing enabled.
  • SAN migrations become more complex. Yes, with solutions like VPLEX you can present another SAN behind the VPLEX controller and migrate data on the fly, but what if you want to move to another vendor’s mirroring product entirely?  This requires potential downtime as data is manually copied in-guest from one array vendor to another, unless another third party block level replication software is used.  Clusters demand high uptime by design, so receiving the OK for these outage windows can take weeks of negotiation.
  • The 256 LUN limit per VMware host allows less consolidation of VM’s to host, and can cause you to reach this LUN limit faster. Especially if you use products like VERITAS storage foundation with in-guest mirroring, as this will require a minimum of two RDM’s per logical volume.
  • RDM’s are complex to manage. Unless this can be orchestrated in some way, it can be difficult to identify and select the correct RDM when adding disks to a new VM.

With storage replica, managing virtualized metro clusters are simplified as we can use VMDK’s the same as all other virtual machines. Replication dependency is moved away from the underlying hardware and closer to the application level, where it belongs.   I have demoed and automated the creation of virtualized metro clusters running on VMware in my lab, so I will share these guides in upcoming blog posts.   If you want to get started yourself, the following microsoft resources have good information.

Windows Server Technical Preview Storage Replica Guide –

http://go.microsoft.com/fwlink/?LinkID=514902

Whats New in Storage Services in Windows Server Technical Preview

http://technet.microsoft.com/en-us/library/dn765475.aspx

What to look out for when comparing Hyper-converged solutions. A customers viewpoint.

Over the past nine or so months me and my colleagues have been investigating Hyper-converged systems running software-defined storage to take the next step in our virtualization initiative.

We are very close to reaching 100% x86 virtualization across all our datacenters by decommissioning and P2Ving all legacy physical servers, and I believe hyper-converged infrastructure and software-defined storage is the next logical step for us to reach our goal of the totally software-defined and automated datacenter. The main driver for this journey was to lower the cost of maintaining and running storage, and eliminate unneeded complexity in our infrastructure.  Enterprise storage is going to have a shake-up over the next upcoming years.

During this phase we investigated many, if not all, the solutions out there in various forms of testing and I would like to share what you should look for when running your own investigations.    I will leave the list vendor agnostic until the end, these are my own opinions as I do not work for a vendor and this is based purely on my thoughts as a customer.

The list is long, but hopefully it will save you some time and help you during your investigation and POC phase.

  • Gartner is key to making decisions, so how is the solution represented on Gartner’s new magic quadrant for converged systems? If they are listed as a leader, are they a true hyper-converged solution or are they what I would consider version 1 of converged systems? Eg a rack of pre built and tested legacy components that work together well, but do not prevent fork-lift updates.
  • Will the vendor allow you to power off a node while testing? How about two or even three nodes simultaneously? How about pulling disks out while doing this also? What happens to running VM’s while this happens? How hesitant do they look when you suggest these tests? Remember, data integrity is ALWAYS more important than performance. You do not want to be restoring from backups due to corrupt data because you lost power to an entire rack or datacentre. Even if the restore process is easy, there are still applications that require an RPO of zero and data should never be at risk of corruption on enterprise storage
  • Are customer references available on request at the same scale or industry that you work in? Take these calls. Do the customer’s feel enthusiastic about the product?
  • How easy is it to install new nodes? Can you easily scale up or even down, when new projects come on board or systems are decommissioned? Can the entire node installation or cluster installation process be automated? Or are there a set of manual tasks that must be completed after the first cluster is configured?
  • How easy is it to train your operations team on the product? Does it require a lengthy set of complex standard operating procedures?
  • How many manual tasks can be eliminated with the product? A business case presented to the stakeholders showing savings in operational hours and tasks should be included during a decision to show true TCO. Make sure to involve the teams responsible for managing your existing infrastructure, and have them list daily, weekly and monthly tasks on current infrastructure.
  • How well is the support team rated? Is it easy to open support tickets and keep track of the nodes you have purchased? Does the support team have experts with experience on the entire converged infrastructure stack? , Eg hypervisor, network, even application tuning?
    Are multiple support tiers available? Consider the need for a four hour response time if the solution is robust and has no single points of failure.
  • Can the solution automatically notify the vendor of any hardware or software issues like a traditional SAN? Eg disk failure that has not been replaced after a set amount of days. Does the solution support standard monitoring protocols such as SNMP?
  • How is logging handled? Can logs be forwarded to a centralised logging solution if this is required for compliance?
  • How well does the product work with existing enterprise backup solutions? are features such as CBT and NBD transport supported?
  • Ask if you can sign an NDA and ask to look at the companies product roadmap. Do you see innovation coming in the future from the company? Also, don’t just check the future roadmap. Check the roadmap from previous years to see if upcoming features were actually delivered on time. If they’re weren’t, ask why they were delayed. This is great if the features required extra QA, as I believe a product should be stable before release. If features in previous roadmaps have been removed completely in future roadmaps, ask why. Especially if they were announced as hot features in product marketing material.
  • Does the solution support multiple hypervisors? How easy is it switch from one hypervisor to another? This may not be an important feature depending on the customer.
  • Are updates non-disruptive and non-destructive? Is the product dependant on underlying hardware which could exclude you from future updates? How automated is the update process? Do updates require hunting down firmware and upgrade packages from various websites?
  • Can storage pools be grown and shrunk independently of compute clusters? This is important for infrastructure workloads as storage capacity, compute, and memory requirements will grow differently depending on the application.
  • Can different node types be mixed and matched in clusters? Eg storage heavy nodes mixed with compute heavy nodes.
  • Does the solution support dedupe and compression? Can these features be disabled or tuned per VM or per datastore? Remember some applications will benefit more from data locality over distributed deduplication.
  • How does the product communicate with the underlying hypervisor? Does it require dependencies on underlying infrastructure components such as vcenter or scvmm? This can cause issues when planning hypervisor updates.
  • How easily can the product be managed and monitored? Is the gui intuitive and does it not require separate documentation just to manage it? How easy is it to troubleshoot any performance related issues?
  • What is the impact If a node fails? How long will you be unprotected for as the cluster rebalances itself?  Are there any performance implications that must be designed for?
  • Can the solution easily be sized as a common platform for ALL infrastructure workloads? What about monster VM’s with extremely large storage requirements that exceed the aggregate capacity of a single node?
  • Does the solution have API support so you can tie it into your existing automation initiatives?

 

So, in case you didn’t guess we decided to go with Nutanix for our new Hyper-converged platform.  I have had my eye on them for a while after a lot of top VCDX’s began working for them, and I wondered what they were all about.  After some research I discovered Steven Poitras’ (Solutions Architect at Nutanix) blog and began reading, it is here if you want a technical deepdive into how Nutanix works.

http://stevenpoitras.com/the-nutanix-bible/ 

This got me thinking, if a company is willing to show me as a technical guy how the product actually works in the open and how data integrity and availability is maintained, while risking other companies following their formula, they must be on to something good and are confident in the reliability of their product. This is something I have never seen in the industry before.

So our reasons for choosing Nutanix were simple, it was the only solution so far that ticked all the boxes in the requirements we had.  So far, I been impressed with the level of support that we have had from Nutanix, and we have some big upcoming infrastructure projects running purely on Nutanix that I hope to blog my experience  on.   I recommend if you want to try them out, to give your local SE a call and ask for a POC.  They will be happy to help.