Nutanix Foundation Scale-out Using Nutanix Community Edition on an Intel NUC

In my current role at Nutanix as a Nutanix Services Consultant, me and my colleagues are often tasked with assisting our customers in large 100+ host deployments of new Nutanix Nodes.  Since Nutanix hosts are shipped from factory with Nutanix AHV Hypervisor, it is required to use a tool called Nutanix Foundation to reinstall the nodes with either ESXi, Hyper-V or XenServer Hypervisor depending on the customer’s requirements. Nutanix Foundation is available in two separate downloadable packages.  The first one is known as the “Foundation Applet” which is a Java Applet you download and run on your laptop. The foundation applet discovers Nutanix Nodes using IPV6 multi-cast and then calls the foundation process running on the Nutanix CVM.  A webbrowser can then be launched and you configure the nodes with the target HyperVisor and AOS Package for installation.  The foundation applet first installs one node with the target hypervisor and AOS package, and then transfers the imaging process to the rest of the nodes.  One of the requirements for the foundation applet is that your laptop must be on the same subnet as the target Nutanix Cluster.

The second method is known as the “Foundation VM” and is available to Nutanix Staff and Partners.  The Foundation VM run’s as a Virtual Machine on either VMware Workstation/Fusion, ESXi, Virtual Box, or Nutanix AHV and deploys multiple nodes sequentially and also allows you to configure multi-homing which is used if initially deploying nodes on a flat switch for later connectivity to the customers target network. Using Multi-Homing, you can configure the foundation VM with multiple IP addresses to match the target nodes IP address range for Hypervisor/CVM and IPMI.

Typically, I have been deploying Nutanix Nodes using the foundation VM, but in some larger deployments, customers have different requirements such as different IP ranges for some clusters, different HyperVisors, or different AOS versions.  Whilst I could deploy these one-by-one by running a single foundation instance at a time, I have found it to be more efficient and faster to run multiple foundation instances at the same time.  To achieve this, I have turned an Intel NUC into a mobile Nutanix Foundation deployment server running Nutanix Community Edition. Nutanix Community edition runs Nutanix AHV and can be downloaded free of charge from here https://www.nutanix.com/products/community-edition/

Community Edition allows you to configure either a single, three or four node Nutanix Cluster on your own hardware. In this instance I have configured a single node nutanix cluster

The specs and BOM I am using for the Intel NUC are below,

 

 

 

So quite a powerful, yet compact device that I can easily take with me to customers sites for deployments.

Screen Shot 2017-08-30 at 1.31.06 pm     Untitled 4

To allow me to run four different foundation instances on the Intel NUC, I have configured each foundation VM with 2x vCPU and 3GB of RAM.  The Nutanix Controller VM is configured with 2x vCPU and 16GB of RAM.  Whilst you could potentially lower the memory requirements of the CVM, four Foundation VM’s was more than enough for my requirements.

20615879_146760099239798_1417052543558323646_o

The following diagram shows the  IP address and physical configuration of the Intel NUC (AHV + CVM), Laptop (I’m using a Macbook Pro), and corresponding Foundation VM Instances along with the clusters each foundation instance will be building.  Since Nutanix Community Edition requires internet access to start, I also have a router running a linux distro called “Zentyal” running as a VMware Fusion VM on my Macbook. Zentyal is configured with the default gateway address for the NUC (192.168.25.1) and a secondary interface tethered to Wifi.  Zentyl also provides NTP services for new clusters, so the time is set correctly when the cluster is built.

Capture

Each Foundation VM is configured with three IP addresses as follows.

  • IP address on the same subnet as my macbook, using the 192.168.25.x/24 range
  • IP address on the same subnet as the Hypervisor Hosts/CVM
  • IP address on the same subnet as the IPMI interfaces.

I have also duplicated the thunderbolt adapter on my macbook and set an IP address in the same range as each target cluster, so each cluster can be configured directly from my macbook once it is built using the nutanix prism interface.

Screen Shot 2017-08-30 at 11.20.34 am.png

Here is an overview of how to setup your own NUC to run Nutanix Community Edition as a Nutanix Foundation Server.

 

Screen Shot 2017-08-30 at 11.41.51 am

  • Upload the foundation qcow image to Nutanix CE using the image upload service

Screen Shot 2017-08-30 at 11.49.37 am

  • Create a new three new VM’s, use whatever naming standard you want. Set the following hardware specifications
    1. vCPU = 2vCPU, cores per vCPU = 1
    2. RAM = 3GB

Screen Shot 2017-08-30 at 11.54.57 am

  • Add a hardisk to the VM, ensure “Clone from image service” is selected and select the qcow image uploaded earlier.

 

  • Start each foundation VM, when it boots up run the “Set Foundation IP address” shortcut on the desktop, and set a static IP address for the foundation VM on the same subnet you have configured for Nutanix AHV and the CVM.

Screen Shot 2017-08-30 at 12.02.01 pm

  • Once each foundation VM is configured with a Static IP address, you can launch foundation straight from a webbrowser on your laptop over port 8000

Screen Shot 2017-08-30 at 12.04.02 pm

  • After the foundation VM is configured, you can begin cabling all the nodes to your deployment switches.  In this example, 58 nodes have been cabled simultaneously to x3 24 port switches and can be discovered from a single foundation instance. If you are having trouble determining what nodes should be foundationed from one instance, you can temporarily uncable nodes you don’t want to be discovered.

58-Nodes-AvailableToFoundation

 

  • Once you get to the node imaging page, you can upload your target hypervisor and AOS packages directly to the foundation VM by clicking the “Manage” button                   Screen Shot 2017-08-30 at 1.59.50 pm
  • Browse to the image on your local laptop, and click upload.                                              Screen Shot 2017-08-30 at 2.03.20 pm
  • Next, start foundation and repeat on each foundation VM for any nodes which require a different AOS version, Subnet, or Hypervisor. The below screenshot shows a successful run of 38 nodes on one single foundation instance running on the NUC.

38 nodes1

Using an Intel NUC running Nutanix Community Edtion for large Nutanix Deployments has definitely made things easier for larger Nutanix Installs, and it will be accompanying me when time is a constraint for getting nodes deployed.  I have been impressed with the performance of the Intel NUC and Nutanix community edition, in the below screenshot you can see one of the foundation VM’s pulling 1195 read IOP’s during the image validation phase that happens at the start of the foundation imaging process.

20543622_146760092573132_6378805662166625505_o

 

 

 

 

Automated Storage Reclaim on Nutanix Acropolis Hypervisor (AHV)

Storage reclaim can be a major painpoint for organizations running traditional three tier architecture.  With companies trying to deal with an ever increasing storage footprint, many of them turn to manual storage reclaim procedures to get storage capacity back after files or virtual disk images are deleted.  These problems are exhuberated if you are running vSphere as your Hypervisor on traditional VMFS formatted LUN’s, as any Storage vMotion or Snapshot operation’s results in space not automatically being reclaimed from the source LUN’s.  An example of the manual procedures required in running reclaim on traditional infrastructure running vSphere 6 is outlined below.

  • If EnableBlockDelete advanced feature is not enabled on the ESXi host to identify disks as thin, block zeroing software must be run at the guest level using software such as sdelete. This will create I/O overhead so should be run during a maintenance window
  • Next, after the storage is reclaimed at the guest level, the VMFS LUN must be reclaimed by running esxcli storage vmfs unmap on each LUN manually via the esxi command line. Due to performance overheads, this task is not scheduled automatically.
  • Finally, the storage needs to be re-claimed at the array level by executing a command to reclaim zero pages. On some arrays this is automated, but not all.

As you can see this is a manual three step process that creates operational overhead.  Whilst these procedures could potentially be automated, I don’t really see the point in automating operations on legacy infrastructure when you instead have the options today of replacing your infrastructure with smarter software-defined solutions that automate themselves.  This provides one less thing for your system administrators or automation team to worry about.

Earlier, I blogged about SCSI unmap capabilities using Acropolis Block Services on Acropolis Base Software ver 4.7 HERE.  The good new’s is this functionality is also built into Nutanix’s own free Hypervisor, AHV.  This works by the virtual disks issuing SCSI unmap’s directly from the guest level through to the Nutanix Distributed Storage Fabric.

As you can see below, using acropolis base software less than version 4.7 shows the disks as Standard Hard Disks in Windows 2012 R2.  The reason for this is previously the unmap commands were not translated from the guest OS down to the virtual disk layer.

Screen Shot 2016-07-27 at 2.47.38 PM

After you upgrade to Acropolis 4.7,  all disks are shown as Thin Provisioned in disk optimizer as it now supports unmap.  For the disks to show as thin, the VM first requires a reboot after acropolis base software is updated to 4.7.  If a disk is hot-added to the VM before reboot, the new disk will show as thin. Either way, the change is transparent.

Screen Shot 2016-07-27 at 4.46.37 PM

To show SCSI UNMAP in action on Acropolis Hypervisor, I tested across three Jetstress Exchange 2013 VM’s full of 4.7TB of Databases

image002

After the databases were deleted, Windows converted the file delete operation into a corresponding SCSI UNMAP request as per the following MSDN article.  https://msdn.microsoft.com/windows/hardware/drivers/storage/thin-provisioning

To hurry it up, I also manually ran an optimize operation using the optimize drives utility to reclaim space.  Whilst file delete worked, I noticed the process was slightly faster using disk optimizer as I assume file delete operations queues SCSI unmap’s in the background.

image001

After optimizer was run you can see the used storage is back to a couple of GB’s per VM.

image003

This is just another way Nutanix makes everyday operations invisible, as your admins no longer need to worry about manually reclaiming storage.  It just works. As there are no LUN’s, you also don’t need to worry about doing a LUN level reclaim either.

Together with space saving measures such as Compression, De-Duplication, Erasure coding and now SCSI unmap, this allows you to make the most out of the usable capacity on your Nutanix Infrastructure without wasting capacity due to dead space.

Special thanks to Josh Odgers for letting me delete files on his test Exchange Server VM’s to tryout the unmap capabilities available in Acropolis Hypervisor.

Testing SCSI Unmap on Acropolis Block Services

One of the new features that is available with Acropolis Block Services on Nutanix Acropolis 4.7 is SCSI unmap support.  SCSI unmap (similar to TRIM used on SSD’s) is part of the T10 SCSI spec that commands a disk region to deallocate its deleted space so it can be reused when data is removed.  The only way to reclaim storage at the virtualized guest level with virtual disks that do not support unmap, is to manually write a string of zero’s at the guest level. Depend on your storage vendor, this would potentially cause a performance impact due to the large amount of writes required for this operation.  With Windows Server 2012 R2, unmap is scheduled weekly using disk defragmenter, or “optimize drives” as it is now called.

Screen Shot 2016-07-26 at 4.59.36 PM

As you can see in the Optimize Drives Screenshot, the Media type is different according to if the disk supports unmap or not.

  • Hard disk Drive – In this case, this is a traditional VMDK stored on a NFS volume since my Hypervisor is vSphere. Even though the VMDK is thin, traditional defragmentation will be performed during optimization as SCSI unmap capabilities are not translated to NFS commands.  The only to reclaim storage is this case is to write a string of zeroes in the free space of the disk which will be picked up and reclaimed during a Nutanix Curator Scan (Every six hours)
  • Thin Provisioned drive – This is a Nutanix vdisk presented to the Guest via in-guest iSCSI using Acropolis block services. Since the disk reports itself as Thin, defrag is disabled and the disk will send unmap commands instead of defrag when the “Optimize” button is clicked.  This space will then be reclaimed on the Nutanix distributed storage fabric.

To see these capabilities in action, I have created a single 200GB vdisk in a volume group and attached it to a Windows 2012 R2 Test VM running vSphere 6.0.  I formatted the disk as “F:” and then filled the disk up with 100GB of data as you can see in the below screenshots.

Screen Shot 2016-07-26 at 9.14.33 PM

Screen Shot 2016-07-26 at 5.48.17 PM

You can see the virtual disk now reports it’s physical usage as 99.59 GiB in Nutanix Prism

Screen Shot 2016-07-26 at 9.17.08 PM

Next I deleted the files, and performed an optimize operation using disk defragmenter.  Unmap is then performed on the thin disks, You can track progress by seeing how much the disk is “Trimmed” in the current status tab.

Screen Shot 2016-07-26 at 9.38.16 PM

It only took around 10 seconds to unmap 100GB of data, much quicker and less resource intensive than manually writing zero’s to the disk.

Now you can see the disk physical usage is back to it’s original capacity in Nutanix Prism.

Screen Shot 2016-07-26 at 5.01.00 PM

As you can see there is not much effort to reclaim storage when using acropolis block services.  This functionality also works if you are using Acropolis Hypervisor (AHV), as all vdisks are presented using iSCSI that supports unmap.  I will be testing this in a future blog post.

 

 

 

Extend your Nutanix ROI and save licencing costs by converting existing nodes to storage only

Storage only Nutanix nodes have now been available for over a year now, and have seen great adoption with Nutanix customers looking to solve the issue of an ever increasing storage footprint.

This solves one of the false arguments that some people have against hyper-converged infrastructure stating that it is required to scale compute at the same time as storage.  Not the case with Nutanix.  The storage only nodes run Acropolis Hypervisor (AHV) with a single CVM to form the storage pool, so do not require separate vSphere licencing.

Adding storage nodes are simple.  Simply connect them to the Network, power them on, wait for them to be auto discovered, add IP addresses, then click expand in Nutanix Prism. The storage container presented to vSphere via NFS is then automatically extended with no downtime. No knowledge of KVM is required as the Storage Only nodes are managed via Nutanix Prism. Much less complex than expanding a legacy storage array by adding disk shelves and carving out new LUNS.

One of the new announcements amoung many others that came out of the Nutanix .NEXT 2016 conference in Las Vegas was the ability to convert any Nutanix Nodes to Storage Only.  This functionality is expected to be released in a future Nutanix software update around October of 2016. My colleague Josh Odgers gives a good overview of this capability on his blog here http://www.joshodgers.com/2016/06/15/whats-next-2016-any-node-can-be-storage-only/

An interesting scenario you could do with this capability would be to convert your old existing Nutanix nodes to storage only when you are ready for an expansion of your existing Nutanix infrastructure. Since Intel CPU’s get much faster during new release cycles and also support larger amounts of memory,  this allows you to potentially decrease vSphere cluster sizes and run even more VM’s on less hardware without needing to reduce storage capacity.  Since the Storage only nodes still run a CVM that is part of the Nutanix cluster, there is no loss in overall cluster resiliency and this will also allow you to explore enabling new  functionality such as Erasure coding (ECX) to your cluster.

Converting nodes to storage only also has the distinct advantage of not mixing a vSphere cluster with non-homogenous nodes with different CPU  generations and RAM capacity. Whilst this is supported via VMware EVC, it is not always ideal if you want consistent memory capacity and CPU performance across your vSphere cluster.  EVC will need to mask some newer CPU features which could cause a decrease in performance.   Also VMware HA will also need some careful calculations to ensure there is enough capacity in your cluster if one of the larger hosts fails.

A good example of converting nodes to storage only would be if you had previously purchased a cluster of NX-6020 nodes which were orginally designed for storage heavy applications that do not require much compute capacity

The NX-6020 was released back in 2013 with the following per node specifications

Dual Intel Ivy Bridge E5-2630v2
[12 cores / 2.6 GHz]
1x 800GB SSD
5x 4TB HDD
128GB RAM

Now say you had purchased a cluster of four of these models, and are running ESXi as your hypervisor. The cluster was purchased in 2014 with a five year warranty for the purpose of virtualizing an Email Archiving application, that is typically a low IOP’s high storage capacity workload.

After the initial success of virtualizing this application you decide to migrate some more compute and memory heavy workloads, so decide to go with four  NX-3060-G5 node’s with the following specifications

Duel Intel Broadwell E5-2680v4
[28 cores / 2.4 GHz]
2x 800GB SSD
4x 2TB HDD
512GB RAM

With this new purchase, you have the following options for Expansion.  Nutanix supports adding hosts of different capacity to the same cluster.

  • Add the new nodes to the Existing Nutanix Cluster, and existing vSphere Cluster for a total vSphere cluster size of 8 hosts with varying RAM and CPU sizes.  EVC will need to be enabled at the Ivy Bridge generation.
  • Add the new nodes to the Existing Nutanix Cluster, and create a new vSphere cluster of four hosts.  You will have two vSphere clusters consisting of a total of 8 hosts.
  • Add the new nodes to the Existing Nutanix Cluster.  Create a new vSphere cluster and vMotion VM’s to this cluster.  Convert old nodes to storage only. You will have a single vSphere cluster consisting of four hosts of more CPU and RAM capacity than was available on the NX-6020, whilst still re-using your existing storage.

The following grapths show the perfomance and capacity differences between four nodes of NX-6020 vs four nodes of NX-3060-G5. The CPU performance alone is almost three times the capacity when you look at the Combined Spec Int Rating of the CPU’s.  The Ivy Bridge CPU’s have a spec-int rating of 442 per node whilst the new broadwell cpu’s have a spec-int rating of 1301.  As the new nodes are compute heavy, the NX-6020 nodes have more storage capacity due to utilising larger 4TB 3.5inch drives.

size

During sizing calculations you determine you can easily fit the new workloads along with the existing email archiving application on a single vSphere cluster of four nodes, whist the existing NX-6020 cluster has been converted to serve storage only to the cluster.   This will give us a total usable capacity of 52.16 TB as you can see in the following diagram.  Sizing has been done using http://designbrewz.com/ and shows usable, not RAW capacity.

rf2

But wait, theres more! Since the Nutanix Cluster has been upgraded from four nodes to eight, you also decide to now enable Erasure Coding on the cluster for even more capacity savings. Whilst this is supported on four nodes with a 2/1 stripe size, it is recommended to be used on larger Nutanix Clusters of more than six nodes, to take advantage of a 4/1 stripe size. With Erasure Coding Enabled, our usable capacity has now increased from 52.16TB to 83.45TB, calculated at 1.25x RAW capacity overhead for RF2+ECX  vs 2x for RF2 only.

So what have we gained with this addition of four NX-3060-G5 nodes and converting our old nodes to storage only?

  • vSphere Licencing costs have stayed exactly the same with the addition of new Hardware.
  • Our RAM capacity has increased by four times by upgrading each node from 128GB to 512GB
  • Our CPU Capacity has tripled, whilst the socket count is the same.
  • Our storage capacity has more than doubled even though we added compute heavy nodes, by taking advantage of erasure coding delivered entirely in software
  • Cluster resilency and performance has increased by increasing CVM’s from four to eight
  • Rack Space has only increased by 2RU in the datacentrer

Not a bad result, and really show’s the advantages you gain by running software defined infrastructure that increases value and ROI purely via software updates throughout it’s lifecycle.

 

 

 

 

 

vSphere Client timeout after replacing Machine SSL Certificate

The last few days I have been troubleshooting a very strange issue with the C# vSphere client on a new vSphere 6.0 install for a customer.  The vSphere Client was initially working fine, until I replaced the Machine SSL Certificate for vCenter.  After the Machine SSL Certificate was replaced, the vSphere client would timeout on connection. The issue was only connecting to vCenter, if connecting the vSphere client directly to hosts the client worked fine.

If I reverted back to VMCA signed certs, the vSphere client would begin working again. To make it even stranger, sometimes the client would actually connect but it would take upwards of 60 seconds to do so.

This particular customer is using an externally published CA. To clarify, the vSphere webclient was working.  It was just the C# client that was causing issues.

The error that was shown by the vSphere client on login is as follows

error2

To begin troubleshooting, I used Baretail to tail  the vi-client logs whilst the vsphere client was connecting.  This is an excellent tool that is available for free here.

https://www.baremetalsoft.com/baretail/

I created a filter to highlight text with the word “Error” in red and “Warning” in yellow and opened the vi-client log located in the following directory.

C:\Users\user_name\Local Settings\AppData\Local\VMware\vpx\viclient-x.log

The following log snippit shows a socket error whilst the client is connecting, just before the connection fails

error3

Relevant text from the log is here.  I have masked the name of the customers vCenter server.

[viclient:Error :W: 6] 2016-05-28 17:48:19.743 RMI Error Vmomi.ServiceInstance.RetrieveContent – 1
<Error type=”VirtualInfrastructure.Exceptions.RequestTimedOut”>
<Message>The request failed because the remote server ‘SERVER FQDN’ took too long to respond. (The command has timed out as the remote server is taking too long to respond.)</Message>
<InnerException type=”System.Net.WebException”>
<Message>The command has timed out as the remote server is taking too long to respond.</Message>
<Status>Timeout</Status>
</InnerException>
<Title>Connection Error</Title>
<InvocationInfo type=”VirtualInfrastructure.MethodInvocationInfoImpl”>
<StackTrace type=”System.Diagnostics.StackTrace”>
<FrameCount>12</FrameCount>
</StackTrace>
<MethodName>Vmomi.ServiceInstance.RetrieveContent</MethodName>
<Target type=”ManagedObject”>ServiceInstance:ServiceInstance [SERVER FQDN]</Target>
</InvocationInfo>
<WebExceptionStatus>Timeout</WebExceptionStatus>
<SocketError>Success</SocketError>
</Error>

To dig deeper into why I was getting a socket error, I fired up procmon from sysinternals to find out what the client was doing when it failed.  In sysinternals I created a filter to only output activity created by vpxclient.exe

procmnon filter

Whilst procmon was running, I noticed a TCP Reconnect happening to an Akamai URL.procmon error1

Notice the time difference of seven seconds from both TCP Reconnects.    This TCP reconnect would reoccur multiple times until the vSphere client timed out and subsequently failed.

I was curious on the status of this TCP connection, so I started another great sysinternals tools called Process Explorer. Process Explorer allows you to check a corresponding process’s Network status, including remote addresses and ports, along with the status of the connection.  Selecting vpxclient.exe in Process Explorer showed the following under TCP/IPSYN Sent

You can see the same remote connection to Akamai in process explorer.  The status of the connection is SYN_SENT, yet the connection is never established.

I was certain this external connection was causing the vSphere client to timeout.  Since the customer is using a third party issued cert, the client is checking the CRL of the cert on the internet.  This is why I did not see the error using the Self-Signed VMCA vCenter Machine SSL Cert. You can see the cert is using an external CRL distribution point in the screenshot below.

crl

I ran an NSLOOKUP on the CRL distribution point hostname, and the address matched the Akamai address space with a cname pointing to the CRL.

After all this, I began troubleshooting why the vSphere Client could not connect to the CRL distribution point.  Well it turns out after all this the corporate proxy was not configured in Internet Explorer, so the management servers where the vSphere client was installed could not access the CRL address for the certs.

Once I had the details for the proxy and configured it in internet explorer, the vSphere client successfully created a TCP connection to the CRL on login and, then connected successfully to vCenter with no timeout.  This seemed to only need to be configured once. I removed the proxy for subsequent logins and the vSphere Client connected fine.

My recommendation would be if you do replace vSphere certificates, is to use an internal managed enterprise CA with a certificate revocation list that can be accessed internally.  Also add a copy of Procmon, Process Explorer and Baretail to your troubleshooting toolkit if you don’t already. They are all great tools that have helped me multiple times in the past.

 

 

 

 

 

 

Error while installing vCenter 6.0 “Failed to Register Service in Component Manager”

SPOILER: Check your time sync between platform services controller and vCenter Server

I had this strange error while standing up two new vCenter Instances with a single external platform services controller.

vCenter Server Version: vCenter Server Update1b, Build 3343019
Operating System:           Windows 2012 R2
Deployment topology:   vCenter Server with an External Platform Services Controller.

The platform services controller installed fine, however whilst installing the two vCenter Servers I ran into the same error on both of them

"Unable to call to Component Manager: Failed to register service in Component Manager; url:http://localhost:18090/cm/sdk/?hostid=75fc9250-0c07-11e6-ac93-000c290481b4, id:b2646ddb-aa7a-4f2d-a2aa-37be891d6e49"

Unable to call to component manager

After that error, I would get the following.

"Installation of component VCSServiceManager failed with error code "1603". Check the logs for more details.

Capture1

Reviewing the installation logs showed the following

2016-04-27T09:35:25.317+10:00 [main ERROR com.vmware.cis.cli.CisReg] Operation failed
com.vmware.cis.cli.exception.ComponentManagerCallException: Failed to register service in Component Manager; url:http://localhost:18090/cm/sdk/?hostid=75fc9250-0c07-11e6-ac93-000c290481b4, id:b2646ddb-aa7a-4f2d-a2aa-37be891d6e49
at com.vmware.cis.cli.util.CmUtil.cmRegister(CmUtil.java:87)
at com.vmware.cis.cli.CisReg.registerService(CisReg.java:927)
at com.vmware.cis.cli.CisReg.doMain(CisReg.java:776)
at com.vmware.cis.cli.CisReg.main(CisReg.java:709)
Caused by: java.util.concurrent.ExecutionException: (cis.cm.fault.ComponentManagerFault) {
faultCause = null,
faultMessage = null,
errorCode = 0,
errorMessage = UNKNOWN
}
at com.vmware.vim.vmomi.core.impl.BlockingFuture.get(BlockingFuture.java:70)
at com.vmware.cis.cm.client.ComponentManagerClient.registerService(ComponentManagerClient.java:757)
at com.vmware.cis.cli.util.CmUtil.cmRegister(CmUtil.java:77)
... 3 more
Caused by: (cis.cm.fault.ComponentManagerFault) {
faultCause = null,
faultMessage = null,
errorCode = 0,
errorMessage = UNKNOWN
}
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

The logs weren’t exactly clear on why the installation was failing, but after some troubleshooting I noticed the timesync between the Platform Services Controller and the vCenter Servers were slightly out.  Once the timesync issues were resolved, the installation completed successfully.

In my case, I temporarily fixed the timesync issues by moving both the vCenter Servers and the PSC to a single host, and configuring VMware tools to sync time with the host which was configured to point to an external NTP server.

VCAP6 – Datacenter Virtualization Deployment Study Guide

With the VCAP6 Exams about to be released, I have been spending the last couple of months studying for the VCAP6 – Datacenter Virtualization Deployment Exam. Since I already hold a VCAP5 in Datacenter design, I will automatically be upgraded to VCIX6-DCV status when I (hopefully) pass the Deployment exam. An overview of upgrade paths from v5 to v6 is listed here.
VCIX6 Upgrade Paths

One of the key things I do to prepare for any VMware exam is simple, Read the Blueprint! Basically, if its in the Exam, it will be in the Blueprint.

VMware Blueprints use to be available in PDF format for offline reading, but now are only web based. As I get distracted easily I would prefer to study offline, with a hard copy of the Blueprint guide and all related documentation.

I have found the best way to prepare myself is to download the blueprint and put it in Excel format, along with a self assesment of how competent I am at the topics. The simple grading I use is as follows

  • High – Have done the objective without assistance or documentation
  • Medium – Could do the objective by reading it.
  • Low – Need to Lab the Objective first and learn it.

In addition to this, I add a link to the Official VMware Documentation and Page with instructions on how to complete the Objectives.

To save everyone some time, I have created an Excel Copy of the VCAP6 – DCV Deployment Blueprint along with Links to the exact VMware documentation pages on how to complete the Objectives. As the Deployment Exam is purely lab based and focused on Administration and not design, this is what you need to learn.

blueprintexcel1

Even though the Exam is currently in Beta, I don’t believe the objectives will change much for the Final Exam when it is released. Some of the Documentation I have links to are from the GA version of vSphere 6 whilst some are from Update 1. Even though Update 2 is now out along with updates to some documentation, the concepts will be the same in regards to the Lab Exam.

Here is the link to VMware’s beta blueprint for VCAP6 – Datacenter Virtualization Deployment

VMware Certified Advanced Professional 6 – Data Center Virtualization Deployment Beta Exam

And here is a copy of my Excel Guide for the Exam.

VCAP6 – Deploy Blueprint Study Guide

Happy Studying!

Implementing chained certificates for Machine SSL (Reverse Proxy) in vSphere 6.0

Recently I have been implementing a vSphere 6 design for a customer that is using the hybrid certificate implementation (Machine SSL) across multiple vCenter servers and Platform Services Controllers.  This VMware blog here describes the general concept behind this

Custom certificate on the outside, VMware CA (VMCA) on the inside – Replacing vCenter 6.0’s SSL Certificate

One of the design decisions behind this choice of a hybrid deployment was the customer relies on a third party managed PKI service, and do not wish to change standard operating procedures to issue a subordinate CA template which is required to provision the VMCA as a subordinate CA.

The third party managed PKI requires the full certificate chain including an Intermediate and Root CA to be installed along with the web (Machine) certificate on all platform services controllers and vCenter Servers.

Whilst you can use and it is recommended to use the VMware Certificate Manager Utility to do this, I recommend researching the two KB articles and workarounds before you attempt this process to see if it applies to your environment

Using the Certificate Manager Utility in vSphere 6.0 does not utilize the Certool.cfg for CSR generation (2129706)

Replacing certificates using VMware vSphere 6.0 Certificate Manager fails at 0% with the error: Operation failed, performing automatic rollback (2111571)

These are the steps I used to implement CA signed Machine SSL certificates.  This is using Windows based vCenter servers but the process is similar if you are using the vSphere appliance.

Start the certificate replacement first on your platform services controller if it is external to vCenter. If it is embedded, the process is the same

Step 1. Create a CSR Configuration File

Browse to C:\Program Files\VMware\vCenter Server\vmcad\certool.cfg and edit as follows according to your environment (back it up first)

Country = country code
Name = FQDN or short name of relevant vCenter or PSC
Organization = Organization name
OrgUnit = Organizational unit
State = state
Locality = your city
Email = Email address (optional)
Hostname = FQDN of vCenter or PSC 

1

Step 2. Generate a CSR

Browse to C:\Program Files\VMware\vCenter Server\vmcad and run the following certool command to generate a CSR

certool.exe --initcsr --privkey=priv.key --pubkey=pub.key --csrfile=csr.csr --config=certool.cfg

The CSR, public key and private key will be exported to
C:\Program Files\VMware\vCenter Server\vmcad

You can check your generated CSR at https://cryptoreport.websecurity.symantec.com/checker/views/csrCheck.jsp  to see if it is valid and formatted correctly before submitting to your CA.

2

Step 3. Submit your CSR to your relevant certificate authority.

Submit your certificate request to your certificate authority. Once you get your signed certificate back, make sure you also grab a copy of the intermediate and root CA along with the machine certificate. The certificates must be in PEM format (CER, PEM, or CRT) using Base64 encoding.

If you are missing a copy of the intermediate or root CA’s you can automatically generate a chain from your machine certificate at  https://whatsmychaincert.com/

3

Make sure you tick the box “Include Root Certificate”

Step 4.  Join the Machine, Intermediate and Root Certificates

Once you have a copy of all the certificates, open them up in Notepad and join together in the following order.

  • Machine Certificate
  • Intermediate Certificate
  • Root Certificate

Make sure no white space exists in the certs and they are in the correct order, otherwise the certificate replacement will fail.  Save a copy as fullchain.cer

The chained certificate should look as follows.

4

Step 5. Replace the the Machine SSL Certificates with chained CA certificates using vecs-cli

Next, you can replace the certificates with two methods, Manually using vecs-cli or using VMware’s certificate management script .  I prefer using vecs-cli myself.
This method is supported and documented in the vSphere 6.0 Security Documentation here

Replace Machine SSL Certificates With Custom Certificates

If you are running an external platform services controller, remember to replace its certificate first before any vCenter Servers.

First, copy both the previously created fullchain.cer and the priv.key that was created from step two, to C:\Program Files\VMware\vCenter Server\vmafdd

Next, Stop all services, and then start only the services that handle certificate creation, propagation, and storage from an administrative command prompt.
For Windows this is run using service-control from an administrative command prompt at
C:\Program Files\VMware\vCenter Server\bin

service-control --stop --all
service-control --start VMWareAfdService
service-control --start VMWareDirectoryService
service-control --start VMWareCertificateService

The commands are case sensitive so you need to spell VMware incorrectly with a large W

Next, Switch to C:\Program Files\VMware\vCenter Server\vmafdd and run the following to delete the old Machine SSL Certificate entry in VECS..

vecs-cli entry delete --store MACHINE_SSL_CERT --alias __MACHINE_CERT

Then, add your custom chained certificate to the store along with the private key you generated previously  in Step Two.

vecs-cli entry create --store MACHINE_SSL_CERT --alias __MACHINE_CERT --cert fullchain.cer --key priv.key 

Not that the __MACHINE_CERT alias has two underscores. If you type it incorrectly, it will add multiple entries to VECS

You can check the certificate was added correctly to the store by running

vecs-cli entry list --store MACHINE_SSL_CERT -–text

Finally, start the services again by running  service-control –start –all

After the services have started, browse to your vSphere webclient GUI and check the certificate.  It should include the full certificate chain like the below screenshot

5

If you would prefer to use the VMware Certificate Manager rather than vecs-cli, for replacing the Machine SSL Certificate, refer to the following KB article.

Replacing a vSphere 6.0 Machine SSL certificate with a Custom Certificate Authority Signed Certificate (2112277)

You should provide the full certificate chain on the step ‘Please provide valid custom certificate for Machine SSL’.   If you don’t, you won’t get an intermediate cert on the machine certificate that is provided to clients.

UPDATE: 14/04/2016

If you get issues with the vSphere Thick Client taking a long time to connect after SSL replacement, check that the certificate request included a subjectaltname that matches the hostname of the vCenter server.   This is only a issue if your Certificate Common Name is different to the hostname.  This problem does not affect the webclient.

Migrate homelab NFS storage from Nexenta CE to Nutanix CE

I have been running Nexenta CE as my primary homelab storage for the past year or so with various results.  I use a combination of both NFS and iSCSI, and present a single NFS mount and a single iSCSI datastore to two ESXi hosts. I have been fairly happy with it, but have had a few issues where the NFS services will sometimes fail every few months or so due to locking issues, therefore bringing down my nfs mounts which is not ideal.  Only way, as far as I have discovered to fix it is to clear NFS locks with the following commands in Nexenta expert mode

  • svcadm clear nlockmgr
  • svcadm enable nlockmgrnds.

I was getting a little tired of troubleshooting this, and so decided to rebuild my homelab storage as a single node Nutanix CE cluster (Can’t really call it a cluster??) and export the storage as a NFS mount to my VMware hosts.  This way I get to both play with Nutanix CE, and also (hopefully) have a more reliable and performant storage environment.

The hardware I am currently running Nexenta CE on is as follows

  • ASRock C2750D4I with 8 core Intel Atom Avoton Processor
  • 16GB DDR3 RAM PC12800/1600MHz with ECC (2x8GB DIMMS)
  • 2x 3TB WD Reds
  • 1x 240GB Crucial M500 SSD for ZFS L2ARC
  • 1x 32GB HP SSD for ZFS ZIL
  • 1x Dual Port Intel NIC Pro/1000 PT

The dual port NIC is LACP bonded and used for my iSCSI datastore, while the onboard NIC’s on the ASROCK C2750D4I is used for NFS and Nexenta Management.

I decided for the Nutanix CE Rebuild I would do the following changes

  • Remove the 32GB ZIL Device
  • Remove the Dual port NIC since I won’t be using iSCSI
  • Use the two onboard nics for Nutanix CE and export a NFS mount to my ESXi hosts

Since I run NFS storage on a separate vlan to the rest of my lab environment, I decided to keep this design and install Nutanix CE on my NFS vlan.  Since my switch doesn’t support L3 routing, I added an additional NIC to my virtualised router (PFsense) and configured a OPT1 interface on my storage network.

GATEWAYS

Now my storage network is routable, and I can manage Nutanix CE from any device on my network.

To prevent a chicken and egg situation if my pfsense VM goes down, I also added a separate management nic in my desktop on the same NFS vlan for out-of-band management.

One issue I had was the requirement to have internet access out from my Nutanix CE installation before you can do manage it, this is required to register with your Nutanix Community account and connect to Nutanix Pulse (phone home).  Since my upgrade was distruptive and my PFsense router VM was offline during the migration, I had no internet connection while installing Nutanix CE.  I fixed this by temporarily reconfiguring my cable modem in NAT mode and moving it to the same nfs vlan and configuring it as the default gateway.

Perhaps a way Nutanix could fix this in the future is have some kind of unique serial number generated after install that uses your hardware as a thumbprint, so you can then register the cluster on a separate internet connected device which could provide you with a code to access the cluster.  Then phone home will begin working after say 24 hours.  Just a suggestion to any Nutanix people reading this 🙂

To migrate to Nutanix CE from nexenta I used the following migration steps

  • Power down all VM’s
  • Export VMware VM’s on Nexenta datastores to OVF templates
  • Write the Nutanix CE Image (ce-2015.06.08-beta.img) to a USB3 Device using Win32DiskImager
  • Remove old Nexenta iSCSI datastores and NFS mounts
  • Boot up Nutanix CE from the USB disk, Nexenta CE disks are wiped automatically
  • Configure a nutanix storage pool and container
  • Create a NFS Whitelist and present to my ESXi hosts.
  • Re-import OVF Templates.

The Nutanix installation was extremely simple, just boot from the USB3 device created earlier and began the install.

Use the “install” username, one cool thing about the mainboard I am using (ASROCK C2750D4I) is it has onboard IPMI

pre-install

Select Keyboard layout

keybaord

Configure IP addresses of both the host and the CVM.  I selected the “create single node cluster option”

Read the EULA, by far the longest part of the install as you need to scroll down and cant skip it.

eula

The install will then run, let it be and go take a coffee.  Disks format automatically.

install

After prism is installed, the cluster should be created automatically.  If not, SSH onto the CVM and run the following commands to create a single node cluster.

Username: nutanix, PW: nutanix/4u

#create cluster

cluster –s CVM-IP –f create

#add DNS servers (required for first logon for internet out, as stated earlier)

ncli cluster add-to-name-servers servers=”your dns server”

You can then start a web browser and will be prompted for a cluster admin username and password.

prism

Nutanix CE will then check for pulse connectivity and a registered NEXT account

pulse

 

After this, create both a Nutanix storage pool and Nutanix container using prism.

first logon

Since I wanted the single node cluster to present storage to my ESXi hosts, I configured two filesystem whitelists.  My ESXi hosts access NFS storage on 192.168.55.10 and 192.168.55.2

NFS whitelist

Then mount these NFS mounts on each ESXi host.  Use the container name, in my case it is simply “container”

ESXi NFS mount

NFS mount successfully created.

nfs mount created

Finally, redeploy all the OVF templates you exported previously from Nexenta.  Lucky for me, all OVF’s imported successful

import vm

So far I have been happy with the performance and stability of Nutanix CE, I don’t have any data to back this up but performance wise I have noticed an increase in Read Performance over Nexenta, with a slight decrease in write performance.  Read performance increase is probably due to the Extent Cache (RAM Cache) design in Nutanix, and the write performance is reduced since I removed the 32GB ZIL device from Nexenta.

I also noticed the performance is more consistent, with ZFS and Nexenta, Reads and Writes would be good until the cache fills up, and then performance would drop off.

However, this setup is not a performance powerhorse.  The Atom CPU’s I am using are pretty limited, and the CPU’s do not support VT-D so I need to use the rather slow onboard SATA controllers instead of a dedicated LSI controller which I have sitting around.

In the future I hope to upgrade my lab equipment and get a three node cluster setup, and migrate most of my VMware installation to Nutanix Acropolis.  Possibly with 10GbE if switch prices become lower.  Since I have a thing for Mini-ITX, one board I have my eye on is the Supermicro X10SDV-TLN4 which has a intregrated 8 Core Xeon D-1540, 2x 10GbE, and supports up to 128GB DDR4 RAM.

If you want to give Nutanix CE a try yourself, you can download the beta version here.   http://www.nutanix.com/products/community-edition/

An install guide which I used is available here

http://www.technologyug.co.uk/How-To-Guides/Nutanix-Community-Edition-Primer-TechUG-Labs-Q2-20.pdf

You can also run Nutanix CE nested on other hypervisors now if you dont have spare lab hardware, a good guide is here from Fellow Nutanix Technical Champion Joep Piscaer

https://www.virtuallifestyle.nl/2015/06/nextconf-running-nutanix-community-edition-nested-on-fusion/

 

 

 

 

 

How to change CVM resources on Nutanix CE using virsh

I setup nutanix community edition today in my homelab, and I was interested in looking at how to change both CVM vCPU and vRAM count.  By default my CVM was running 4vcpu and 12GB of RAM, and I wanted to change this to 8vcpu and 15GB RAM (my CE whitebox is running a 8 core Atom CPU and 16GB RAM)

KVM is very new to me, so for my own documentation sake here are the steps I used.

SSH onto the KVM hypervisor using your favorite SSH client, in my case KVM was running on IP address 192.168.55.4, whilst the CVM was using 192.168.55.3

Logon using default nutanix KVM credentials of

Username: root

Password: nutanix/4u

First run virsh list to get the name of your Nutanix CVM, in my case it is NTNX-72c234e3-A-CVM

virsh list

Next, run virsh dominfo NTNX-72c243e3-A-CVM to confirm number of CPU’s and RAM

To change the amount of RAM, in my case I increased from 12GB to 15GB, run the following commands and substitute the approriate CVM name

#Shutdown CVM

virsh shutdown NTNX-72c243e3-A-CVM

#Set vRAM (There should be two  – -dashes before config)

virsh setmaxmem NTNX-72c243e3-A-CVM 15G – -config

virsh setmem NTNX-72c243e3-A-CVM 15G – -config

#start VM again

virsh start NTNX-72c243e3-A-CVM

To change the amount of CPU’s, edit the virsh XML file.

#edit virsh xml

virsh edit NTNX-72c243e3-A-CVM

cpu change

This will open the VM XML file using vi editor, use the following commands to edit the file (I always forget how to edit in vi, so I will show the steps here for my own sake)

  1. Press “i” to enter insert mode
  2. Use the arrow keys to move to the following line <vcpu placement=’static’>4</vcpu>
  3. Change the 4 to whatever you want, in my case I did 8
  4. Press “esc” to exit insert mode
  5. Type “:wq” to write out and save the file

#Shutdown the Nutanix CVM

virsh shutdown NTNX-72c243e3-A-CVM

#Start the Nutanix CVM again

virsh start NTNX-72c243e3-A-CVM

Run virsh dominfo again to confirm the changes were successful

virsh dominfo NTNX-72c243e3-A-CVM

virsh dom info

In most cases with Nutanix CE, the defaults are 99% OK for most people.  So test to see if it’s really required to increase or decrease CVM resources according to required workload and hardware specifications.  In my case I saw no difference so I set the defaults back