How to change vSAN service subnet with Zero downtime

Table of Contents

Introduction

In this blog we go through changing our vSAN service subnet without any downtime. In our case we also want to keep the vSAN vlan we are currently using.

vSAN is a sensitive part of a cluster. With a small cluster (3/4 hosts) or fault domains it could be difficult to change the subnet of a production vSAN environment.

In my environment we have a total of 3 hosts in the cluster. 3 hosts are all three in use because vSAN uses a 2 components and 1 witness structure. This means we can not easily change the ip address or Portgroup of the vSAN network.

Solution

Fortunately we have an solution for this. We can create a second vlan and subnet on our switch and create a temporary portgroup. After the portgroup is created we can create one extra VMkernel on every host.
This VMKernel will be used for vSAN services but for the new vlan/temporary subnet.

With this structure we can move our hosts to our temporary portgroup and change our primary portgroup. After we changed our primary portgroup we will move the hosts back to the primary portgroup and clean up our temporary portgroup.

Preparation

Let’s begin with the preparation of the following parts:

  • Create new temporary vlan
  • Create new temporary portgroup
  • Configure a new VMKernel on all hosts of the cluster

Create new vlan and subnet on Switch

1. Login to your switch and create a new vlan with a new subnet. The size of the subnet does not matter.
2. After creating the vlan tag the vlan on the interfaces where the hosts are connected.
3. Once we have the vlans tagged we can create a portgroup on the DVS switch in vCenter.

Create temporary portgroup

4. Login to vCenter and create a new portgroup.

5. Give the Portgroup a name. Personally I name the portgroup something with “Temp”, this makes the cleanup afterwards easier.
6. Click on “Next”

7. Configure the VLAN part. Use the VLAN configured on the switches in step 1.
8. Click “Next”.

9. Click “Finish”

Create VMKernel port on the hosts in the cluster

10. Navigate to the hosts in the cluster.
11. Choose one host and go to “Configure” tab.
12. Under networking you will see “VMkernel adapters” click on it.
13. Click on “Add Networking”.

14. For the connection type Choose “VMkernel Network Adapter” and click “Next”.
15. For the target device search for the temporary portgroup you just created.
16. Select the temporary portgroup and click “Next”.

17. On the Portgroup properties choose the vSAN service and click “Next”.

18. On the IPv4 settings page choose “Use static IPv4 settings”.
19. Fill in the new subnet IP Address, Subnet mask and Default gateway for the first host and click “Next”.

20. Click on “Finish”.
21. Repeat the steps for the other hosts in the cluster. Every host should have a unique IP address.
22. Once we have all the hosts configured with the new VMKernel we can proceed with the actual move.

Move hosts to temporary VMKernel

Warning! Before starting with the solution, always make sure to put the hosts via vCenter in maintenance mode!

1. Click on the last host in the cluster.
2. Put the host in maintenance mode and choose “Ensure accessibility”.
3. Once the host is in maintenance mode go to “Configure” -> “VMkernel adapters”
4. Now we will disable the vSAN service on our primary vSAN VMKernel adapter.
5. Click the three dots and choose “Edit”.
6. Under “
Enabled services” uncheck “vSAN”.
7. Click on “OK”.
8. Take the host out of maintenance mode.
9. Repeat this step for the other host. From last to first host.

Warning! In the process you will see some alarms. You can ignore those.

10. Once we have disabled the primary vSAN service on the primary VMKernel on all hosts we can start reconfiguring our primary VMKernel with the new subnet.
11. Reconfigure the primary subnet and/or vlan on the switches.

Move hosts back to primary VMKernel

Enable primary VMKernel on all hosts

1. Click on the last host in the cluster.
3. Go to “Configure” -> “VMkernel adapters”
4. Now we will enable 
the vSAN service and reconfigure the IPv4 settings of the primary vSAN VMKernel adapter with our new subnet.
5. Click the three dots and choose “Edit”.
6. Under “Enabled services” check “vSAN”.
7. On the left side click on “IPv4 Settings”.
8. Reconfigure the IPv4 Settings with our new subnet.
9. Click on “OK”.
10. Repeat this step for the other host. From last to first host.
11. now the primary VMKernel has vSAN enabled again we can disable the vSAN service on the temporary VMKernel again.

Disable temporary VMKernel on all hosts

Warning! Before starting with the solution, always make sure to put the hosts via vCenter in maintenance mode!

1. Click on the last host in the cluster.
2. Put the host in maintenance mode and choose “Ensure accessibility”.
3. Once the host is in maintenance mode go to “Configure” -> “VMkernel adapters”
4. Now we will disable the vSAN service on our temporary vSAN VMKernel adapter.
5. Click the three dots and choose “Edit”.
6. Under “
Enabled services” uncheck “vSAN”.
7. Click on “OK”.
8. Take the host out of maintenance mode.
9. Repeat this step for the other host. From last to first host.

Warning! In the process you will see some alarms. You can ignore those.

Cleanup

After we have moved all hosts back to the primary vSAN service VMKernel and everything is running without issues we can start the cleanup.

We will clean up the following components:

  • Our temporary VMKernel on our hosts.
  • Our temporary Portgroup.
  • Our temporary vlan on our switches (Including vlan tags on our interfaces).
Awid Dashtgoli
Awid Dashtgoli

Fix Cloud Director 10.6 upgrade fails with error Failed dependencies

Table of Contents

Introduction

In this short blog we will show to how to fix a error when upgrading to Cloud Director 10.6.

Issue

When upgrading from Cloud Director version 10.4 or 10.5 to version 10.6 you can face a error with the following message: “Failed dependencies”. This issue prevent Cloud Director from upgrading.

If we look at /opt/vmware/var/log/vami/updatecli.log you will see the following:

error: Failed dependencies:
        libcrypto.so.1.0.0()(64bit) is needed by (installed) xml-security-c-1.7.3-4.ph2.x86_64
        libssl.so.1.0.0()(64bit) is needed by (installed) xml-security-c-1.7.3-4.ph2.x86_64
05/07/2024 05:56:45 [ERROR] Failed with exit code 65024
05/07/2024 05:56:45 [INFO] Update status: Running post-install scripts
Failed with status of 2 while installing version 10.6.0.11510

Understanding the Cause

This issue is caused by  a earlier upgrade of Photon OS upgrades that are not properly cleaned up.

In Cloud Director 10.6 the version of Photon OS is 4.0, while in Cloud Director 10.4/10.5 version 3.0 of Photon OS is used.

Solution

To solve this issue we first need to find the dependencies mentioned in the updatecli.log. After verification we need to delete the dependency and rerun the upgrade.

Warning! Before starting with the solution, always make sure to make a snapshot of all Cloud Director cells!

Verify package exists on Cloud Director cells

Let’s start with the verification of the “xml-security-c-1.7.3-4.ph2.x86_64″ file.

1. Open a SSH session to all Cloud Director cells and login with the root account.
2. Login with the root account.
3. To verify if the package is present run the following command:

					rpm -qa | grep xml-security-c-1.7.3-4.ph2.x86_64
				

4. You will see a similar output to this:

5. Run this command on every Cloud Director cell.

Remove the package from Cloud Director cells

Now we have verified the package is present we can start removing the package from all Cloud Director cells.

1. To remove the package run the following command:

					rpm -e xml-security-c-1.7.3-4.ph2.x86_64
				

2. Run this command on every Cloud Director cell.
3. Last part let’s verify is the package is not present anymore by running the following command:

					rpm -qa | grep xml-security-c-1.7.3-4.ph2.x86_64
				

4. The should be no packages named xml-security-c-1.7.3-4.ph2.x86_64″ anymore.
5. Now we can start the upgrade of Cloud Director 10.6 without any issues.
Make sure to remove the snapshots again after the upgrade is complete.

Awid Dashtgoli
Awid Dashtgoli

Commission & Decommission a host with VMware Cloud Foundation

Table of Contents

Introduction

In this blog we will show how commission and decommission hosts with VMware Cloud Foundation.

Commissioning a host

In this section we start with commissioning a host. With host commissioning we need to have our ESXI host prepared with the correct ESXI version and config. The setup and configuration should be identical to the current hosts in the VMware Cloud Foundation cluster.

Prerequisites

Before we can commission the new host we need to make sure the host has the correct ESXI version installed. Sometimes the version is not available as a ISO, in that case we need to create a custom ESXI ISO.

I have a blog where I explain step-by-step how to create a custom ESXI ISO from a depot.
Click 
here to see the blog.

Another very important part is to assign the license to the ESXI host before commissioning the host to VMware Cloud Foundation SDDC manager.

1. Login to the ESXI host via the WEB GUI.

2. Go to “Manage” and “Licensing”.

3. Click “Assign License”.

4. Paste the license key and click “Check License”.

5. If you the message “License key is valid for …” you can click “Assign License”.

6. Your license has been assigned.

Add a host to VMware Cloud Foundation

1. Open VMware Cloud Foundation SDDC manager.

2. Under inventory go to “Hosts” and choose “Commission Hosts”.

3. You will see a checklist. Check all the boxes and click “Proceed”

4. Fill in all the fields and click on “Add”.

5. After the host has been added to the list. Check the “Confirm FingerPrint” icon and click on “Validate All”.

6. After the validation is complete you will see the “Validation Status” as “Valid”. Click “Next”.
If you run in to any errors, make sure to check the checklist again and solve the issues.

7. Make sure all the information is correct and click “Commission”.

8. You will see a task under “Tasks”. Wait until the task is finished.

9. After the task is “Successful” you have successfully added the new host to VMware Cloud Foundation. 

Add a host to the cluster of VMware Cloud Foundation

Now we have commissioned the host to VMware Cloud Foundation, we can add the host to a cluster.

Add a host to the cluster of VMware Cloud Foundation via GUI

1. Open VMware Cloud Foundation SDDC manager.

2. Under inventory go to “Workload Domains” and choose the domain you want to add the host to.

3. Click on “Actions” and “Add Host”.

4. You can add the available hosts to the cluster.

5. After you have added the hosts, you will see a task running. Wait until the task is done.

6. Your host has been successfully added to the cluster.

Add a host to the cluster of VMware Cloud Foundation via API Explorer

In some cases it is necessary to use the API Explorer to add hosts to the cluster. This can be the case for example cluster with multiple VDS switches.

First we will need to obtain Host and Cluster ID’s, before we can add the host to the specific cluster.

1. Open VMware Cloud Foundation SDDC manager.

2. Under Developer Center go to “API Explorer”.

3. Open “APIs for managing Hosts” and open the “GET  /v1/hosts”.

4. In the “Status” parameter we need to add a value named “UNASSIGNED_USABLE”.
The “UNASSIGNED_USABLE” will only show the unassigned available hosts.

5. Click on “Execute”. This will run the API query.

6. Under “Response” you will find the output of the API query. Click on “PageOfHost”, here you will see the available/useable hosts. Click on the host.
You will see all the information about the host. The “ID” is the important part for us.
Save the ID after “ID of the host”, it will look something like “d6259566a-9826-8845-98b9-9a0b445b803c”.

7. Now we will obtain the cluster ID.
Under “APIs for managing Clusters” choose “GET  /v1/clusters”.

8. Click on “Execute”.

9. Click on “Execute”.

10. Under “Response” you will find the output of the API query. Click on “PageOfCluster”, here you will see all the clusters. Click on the cluster you want to add your host to.

You will see all the information about the cluster. The “ID” is the important part for us.
Save the ID after “ID of the cluster”, it will look something like
“d6259566a-9826-8845-98b9-9a0b445b803c”.

11. Now we have all the information we can validate our configuration.
Under “APIs for managing Clusters” choose “POST  /v1/clusters/{id}/validations”.
For the “id” parameter fill in the cluster id copied from the previous step.

12. Now for the “clusterUpdateSpec” we need to create a JSON.
Below you will find a example of the JSON:

					{
    "clusterExpansionSpec": {
      "hostSpecs": [
        {
          "id": "d6259566a-9826-8845-98b9-9a0b445b803c",
          "licensekey": "00000-00000-00000-00000-00000",
          "hostNetworkSpec": {
            "vmNics": [
              {
                "id": "vmnic0",
                "vdsName": "DASH-M-vds01",
                "moveToNvds": false
              },
              {
                "id": "vmnic1",
                "vdsName": "DASH-M-vds02",
                "moveToNvds": false
              },
              {
                "id": "vmnic2",
                "vdsName": "DASH-M-vds01",
                "moveToNvds": false
              },
              {
                "id": "vmnic3",
                "vdsName": "DASH-M-vds02",
                "moveToNvds": false
              }
            ]
          }
        }
      ],
      "interRackExpansion" : false
    }
}
				

13. Copy and paste the JSON under “clusterUpdateSpec” Parameter and click “Execute”.

14. You will get a message “Are you sure?”. Click on “Continue”.

15. Under Response you will see the “Validation”. If you open the “Validation” you will see under “ResultStatus” the message “SUCCEEDED”. This means our JSON is working and the validation is successful.

16. Now we will execute this JSON to start adding the host to the cluster.
Under “APIs for managing Clusters” choose “PATCH  /v1/clusters/{id}”.
For the “id” and “clusterUpdateSpec” parameters fill in the exact same information as the validation step and click on “Execute”.

17. You will again get a message “Are you sure?”. Click on “Continue”.

18. Under Response you will see the “Task”. If you open the “Task” you will see under “status” the message “IN_PROGRESS”. This means the task has been started and the host will be added to the cluster.

19. Under “Tasks” you will see a new task running.

20. Once the task is completed you will see the task with the status “Succesful”.
The host has successfully been added to the cluster.

Decommissioning a host

In this section we decommission a host. The host decommissioning can be done through the GUI.

Prerequisites​

Before we can decommission the host we need to make sure the host is in maintenance mode.

We can perform this on the vCenter WEB GUI.

1. Login to the vCenter Server.

2. Put the host you want to decommission in maintenance mode and choose “Full data migration”.
This will make sure all the data will be moved from the host and redundancy will still be in place for vSAN.

The “Full data migration” can take some time depending on your environment.

Remove a host from VMware Cloud Foundation cluster

1. Open VMware Cloud Foundation SDDC manager.

2. Under inventory go to “Workload Domains” and choose the domain you want to remove the host from.

3. Click on “Hosts” and select the host you want to delete. After selecting the host choose “Remove Selected Hosts”.

4. If necessary check the “Force Remove Host” option. After that click on “Remove”.

5. A task under “Tasks” will start.

6. Once the task is “Successful” the host is removed from the cluster.

7. The host is now in a “Usable” state. This means you can put the host in another VMware Cloud Foundation Cluster if you want.

Decommission a host from VMware Cloud Foundation

To completely remove the host from VMware Cloud Foundation, we need to decommission the host.

1. Open VMware Cloud Foundation SDDC manager.

2. Under inventory go to “Hosts” and choose “Unassigned Hosts”.

3. Select the host you want to decommission and choose “Decommission Selected Hosts”.

4. You will get a message. Select the “Skip failed hosts during decommissioning” and choose “Confirm”.

5. The host will start decommissioning and you will see a task starting under “Tasks”.

6. Once completed the task will show “Successful”.

7. You have successfully decommissioned a host from VMware Cloud Foundation.

Awid Dashtgoli
Awid Dashtgoli

Create Custom VMware ESXI ISO

Table of Contents

Introduction

In this blog we will show how to create a custom ESXI ISO. This process is useful for deploying ESXI and being compliant with the Baseline or commissioning the host in a VCF environment. 

To meet the specific needs of your environment, you can create a custom ISO file for ESXi using VMware PowerCLI or vSphere Lifecycle Manager.

You may need to create a custom ESXi ISO image in these scenarios:

  • The ESXi version listed in the VMware Cloud Foundation BOM lacks an ISO file on VMware Customer Connect, often seen with ESXi patch releases.
  • An asynchronous patch version of ESXi is required.
  • A vendor-specific (OEM) ISO file is necessary.

Prerequisites

  1. Make sure you have a Windows PC with VMware PowerCLI and Python installed.
  2. Download the ESXi Base offline bundle from [here].
  3. Obtain the required drivers or software from the VMware/Product Site to include in the custom ISO.

Before creating the custom ESXi image, understand these key terms:

DepotBaseImages: A DepotBaseImage refers to the core components and base images stored within a software depot. A software depot in VMware is a repository that contains various software packages, including image profiles and VIBs (vSphere Installation Bundles). These depots are used to manage and distribute ESXi software updates and installations.

DepotAddons: A DepotAddon refers to additional software packages or components that can be added to a base ESXi image. These addons are stored in a software depot, similar to the base images, and are used to extend the functionality or customize the ESXi installation.

Preparation

Before we can create the custom ISO we will need to download the ESXI depot from the Broadcom support portal.

1. Navigate to https://broadcom.com/

2. Click on “Support Portal” and “Go to Portal”.

3. Login with your Broadcom credentials

4. Click on the “Product Icon” and choose “VMware Cloud Foundation”.

5. Go to “My Download” and search for “vSphere”.

6. Click on “VMware vSphere”.

7. Go to “Solutions”.

8. Click on “VMware vSphere” and choose your “Product” and “Version”.

In my case this will be: VMware ESXi 7.0U3p (23307199)

9. Download the .zip file under the “Solution Downloads” tab.
The file is called: VMware-ESXi-7.0U3p-23307199-depot.zip

10. Save the file and place in a location of choice.
For me this will be “C:\Temp\VMware-ESXi-7.0U3p-23307199-depot.zip”

11. If you want you can also download driver addons.
In my case this is not 
necessary.

Create the custom ISO

Custom ISO creation with PowerCLI

In this part we will show you how to create a custom ESXI ISO with PowerCLI.

1. Open Powershell

2. Run the following command:

					Get-DepotBaseImages "c:\temp\VMware-ESXi-7.0U3p-23307199-depot.zip"
				

You will get a result like this:

					Version              Vendor       Release date
-------              ------       ------------
7.0.3-0.110.23307199 VMware, Inc. 03/04/2024 23:00:00
				

3. You can also check the “DepotAddons” by running:

					Get-DepotAddons “c:\temp\Dell-i40en-Addon-depot.zip”
				

4. Now we got the information we can to create the software spec .JSON file. The software spec is a JSON file that contains information about the ESXi version and vendor add-on (if applicable).

Here is a example:

					{
    "add_on": {
        "name": "Dell-i40en-Addon",
        "version": "2.5.11.0"
    },
    "base_image": {
        "version": "7.0.3-0.110.23307199"
    },
    "components": null,
    "hardware_support": null,
    "solutions": null
}
				

5. Save this file as a .JSON on the same location as the “DepotBaseImage” and “DepotAddons”.

6. Now we will run the command to create the new ISO by running the following command:

					New-IsoImage -SoftwareSpec “c:\temp\VMware-ESXi-7.0U3p-23307199-depot.JSON”  -Depots “c:\temp\VMware-ESXi-7.0U3p-23307199-depot.zip” , “c:\temp\Dell-i40en-Addon-depot.zip” -Destination “c:\temp\VMware-ESXi-7.0U3p-23307199-custom.iso”
				

7. Your ISO file is saved on the location specified under “-Destination” parameter.

Custom ISO creation with vSphere Lifecycle Manager

In this part we will show you how to create a custom ESXI ISO with vSphere Lifecycle Manager.

1. Login to your vCenter.

2. Create a new cluster.
If your cluster is already using “Manage all hosts in the cluster with a single image”, you can skip to step X.

3. Give the cluster a name, check the box “Manage all hosts in the cluster with a single image” and click on “Next”.

4. Choose your ESXI version and click on “Next”.

5. Go to your newly created cluster, choose “Updates” and “Edit” the image.

6. Click on “Add Components”.

7. Make sure to select show “Independent components and Vendor Addon Components”.

8. Choose your “Addon” and choose the “Version”. After that click on “Select”

9. Save the changes.

10. Click on the “Three dots” and choose “Export”.

11. Select “ISO” and click on “Export”.

12. Your download will start in your browser.

Awid Dashtgoli
Awid Dashtgoli

Extending NSX Security to Additional VLAN-Backed VDS

Table of Contents

Introduction

When NSX is already deployed on a cluster with an existing Virtual Distributed Switch (VDS), you may need to extend NSX security protections to workloads on a separate, VLAN-backed VDS for example when physical uplinks on ESXi hosts are connected to another VDS for different services. This guide outlines two approaches to secure those workloads using VMware NSX.

Configuration

Protect Workloads Using NSX VLAN-Backed Segments

This method leverages NSX segments mapped to VLAN transport zones to secure traffic for workloads on the new VDS.

1. Create a VLAN Transport Zone

  • Define a new VLAN transport zone in NSX Manager.
  • Ensure the transport zone is unique per host switch within your transport node profile.

2. Review the Current Transport Node Profile

  • Examine the existing transport node profile to understand which host switches are already configured.

3. Edit the Host Switch Configuration

  • Modify the host switch section of the transport node profile to include both the original and new VDS definitions.

4. Add the Second VDS

  • In the host switch configuration, add the second VDS connected to the VLAN-backed networks.

5. Create NSX VLAN Segments

  • On the newly created VLAN transport zone, define NSX VLAN segments corresponding to the VLANs used by your workloads.

6. Migrate and Test

  • Move a test VM from the VDS port group into the matching NSX VLAN segment.
  • Validate network connectivity, such as by pinging the gateway or other VMs, to confirm segment functionality.

7. Apply Distributed Firewall (DFW) Rules

  • Define and publish DFW policies to control traffic between VMs attached to the NSX segments.

8. Verify Security

  • Test your policies (e.g., block ICMP between test VMs) to confirm that NSX Distributed Firewall rules are enforced as expected.
Awid Dashtgoli
Awid Dashtgoli

How to Redeploy an NSX Edge Node Safely

Table of Contents

Introduction

In VMware NSX, Edge nodes play a critical role by providing north-south routing, NAT, firewalling, and load-balancing services. In certain situations such as failed upgrades, configuration corruption, or persistent health issues it may be necessary to redeploy an NSX Edge node.

This article explains when a redeploy is appropriate and provides a safe, step-by-step approach to redeploy an Edge node without disrupting your environment more than necessary.

When Should You Redeploy an NSX Edge?

Redeploying an Edge node is typically considered when:

  • An Edge node is stuck in a degraded or failed state

  • Services (routing, BGP, NAT) are not functioning correctly

  • An Edge upgrade failed or was interrupted

  • Edge health checks consistently fail

  • Reboots and service restarts do not resolve the issue

Redeploying effectively rebuilds the Edge VM while preserving its logical configuration.

Important Considerations Before You Start

Before redeploying an Edge node, keep the following in mind:

  • Traffic impact: Redeploying will cause traffic interruption on that Edge

  • HA pairs: In an active/standby setup, redeploy the standby Edge first

  • Maintenance window: Always perform this during a planned window

  • Configuration safety: Logical configuration is preserved, but runtime state is reset

If possible, validate that another Edge is available to handle traffic.

Solution

Verify the NSX Edge Node

  1. Open an SSH session and connect to the NSX Edge node.

  2. From the CLI, verify the logical routers configured on the NSX Edge node by running the appropriate command.

					get logical-routers
				

4. Power off the NSX Edge node.

Verify Disconnection from NSX Manager

Run the following API call to check the transport node state:

					GET /api/v1/transport-nodes/<edgenode>/state
				

The node_deployment_state should display:

					"node_deployment_state": {
    "state": "MPA_Disconnected"
}
				

A state of MPA_Disconnected confirms that the Edge node is disconnected and safe to redeploy.

Important

  • If the node_deployment_state is Node Ready, NSX Manager will block the redeployment.

  • In this case, the following error will be displayed:

					Error 78006: Manager connectivity to Edge node must be down
				

Using the NSX UI (Alternative Method)

You can also verify the connectivity status from the Edge Transport Node page in the NSX UI.
A disconnected NSX Edge node will display the following system message:

					Configuration Error: Edge VM MPA Connectivity is down
				

Redeploying an Auto-Deployed NSX Edge Node

Retrieve the Transport Node Payload

For an auto-deployed NSX Edge node, first retrieve the transport node configuration using the NSX Manager API.

  1. Run the following API call:
					GET https://<NSX-Manager-IPaddress>/api/v1/transport-nodes/<edgenode>
				

2. Save the full output payload.
This payload will be reused during the redeployment process.

Prepare the Redeploy Payload

  1. Copy the previously retrieved payload.

  2. Paste it into the request body of the redeploy API call.

  3. Verify that the deployment_config section contains correct values for:

    • Compute Manager

    • Compute Cluster

    • Datastore

    • Management Network

    • Data Networks

  4. Ensure these values are consistent with the configuration defined under the node_settings section.

NSX Manager relies on the information in the deployment_config section to redeploy the NSX Edge node to the specified location and resources.

Once the payload has been verified, initiate the redeployment by running:

					POST /api/v1/transport-nodes/<transport-node-id>?action=redeploy
				
Awid Dashtgoli
Awid Dashtgoli

Fix: “CloudCell Is Not Active” in VMware Cloud Director

Table of Contents

Introduction

When working with VMware Cloud Director, you may encounter the message:

CloudCell is not active

This issue typically prevents normal operation of Cloud Director services and can impact tenant access, UI availability, or API functionality. In most cases, the problem is related to stopped services, database connectivity issues, or an unhealthy cell state.

This article explains what the error means and how to safely troubleshoot and resolve it.

Issue

When working with VMware Cloud Director, you may encounter the message:

					CloudCell is not active
				

Understanding the Cause

The issue is usually triggered by one or more of the following:

  • Cloud Director services are not running

  • The cell cannot connect to the database

  • A previous upgrade or reboot did not complete cleanly

  • Disk space exhaustion (especially under /opt/vmware/vcloud-director)

  • Certificate or trust issues

  • Network or load balancer misconfiguration (in multi-cell setups)

Solution

  1. Log In to the Cloud Director Cell
  2. Check Cloud Director Service Status
					systemctl status vmware-vcd

				

3. Start or Restart the Services. Wait a few minutes and monitor the startup process.

					systemctl start vmware-vcd
systemctl restart vmware-vcd
				

4. Check Cloud Director Logs

					cd /opt/vmware/vcloud-director/logs
tail -f vcloud-container-debug.log
				

Look for:

  • Database connection errors

  • Certificate validation failures

  • Disk or permission errors

  • Service dependency failures

These logs usually point directly to the root cause.

Awid Dashtgoli
Awid Dashtgoli

Fix: “Error Installing HA Components Failed” in vSphere 8

Table of Contents

Introduction

When enabling vSphere High Availability (HA) on a cluster running vSphere 8, you may encounter the error:

Error installing HA components. Failed to install HA components on the hosts.

This issue prevents HA from being enabled and is commonly caused by certificate trust or agent communication problems between ESXi hosts and vCenter.

In this article, we explain the root cause and walk through a safe, step-by-step resolution.

Issue

					Error installing HA components. Failed to install HA components on the hosts.
				

What Causes This Error?

The HA installation process relies on vCenter deploying and configuring HA agents (fdm) on all ESXi hosts in the cluster. This process can fail due to:

  • Corrupted or outdated ESXi host certificates

  • Certificate trust mismatch between vCenter and ESXi

  • Leftover HA agent files from a previous configuration

  • Management service communication issues

  • Interrupted upgrades or host reprovisioning

As a result, vCenter cannot successfully install or validate the HA components.

Verify the Exact Error

Start by checking the vSphere Client task details when HA fails to enable.
You’ll usually see errors referencing:

  • fdm

  • Install agent

  • Trust relationship

  • Cannot verify host certificate

This confirms the issue is agent- or certificate-related.

Solution

Disable HA on the Cluster (If Partially Enabled)

If HA is in a failed or partial state:

  1. Go to Cluster Settings

  2. Disable vSphere HA

  3. Wait until the configuration fully completes

This ensures a clean baseline before remediation.

Restart ESXi Management Services

On each ESXi host in the affected cluster:

  1. Enable SSH

  2. Log in as root

  3. Restart management agents:

					/etc/init.d/hostd restart
/etc/init.d/vpxa restart
				

This clears stale agent communication issues.

Remove Old HA Agent Files (If Present)

On each ESXi host, check for existing HA configuration remnants:

					ls -l /etc/opt/vmware/fdm

				

If the directory exists, stop services and remove it:

					/etc/init.d/hostd stop
/etc/init.d/vpxa stop
rm -rf /etc/opt/vmware/fdm
/etc/init.d/hostd start
/etc/init.d/vpxa start
				

Refresh Host Certificates (Most Common Fix)

Refresh Certificates via vCenter:

  • In the vSphere Client, select the host

  • Go to Configure → System → Certificate

  • Click Renew or Refresh Certificates

  • Reconnect the host if prompted

Awid Dashtgoli
Awid Dashtgoli

How to Safely Clean Up Logs on a vCenter Server Appliance (VCSA)

Table of Contents

Introduction

A 502 Bad Gateway or 503 Service Unavailable error on an SDDC Manager can often stem from a full disk  most commonly caused by failed log rotation. In this article, we walk through practical steps to identify and fix this issue so you can regain access to the SDDC Manager UI.

Issue

When the root partition of your SDDC Manager VM fills up, critical services — including the web UI — fail to start properly. This often results in browser errors like:

					502 Bad Gateway – nginx
				

Understanding the Cause

Before diving into updates or configuration changes, the first step is to confirm if disk space is the root cause.

1. SSH Into the SDDC Manager
ssh vcf@<SDDC-Manager-IP>
su root

2. Check Disk Usage
Once logged in as root, check if any partitions, especially /, are at or near 100% capacity:

					df -h
				

If the root volume (e.g., /dev/mapper/vg_system_lv_root) is full, you’re likely dealing with a large log or journal file.

Solution

Clear Large Log Files

To identify oversized logs in /var/log, run:

					du -ah /var/log | sort -rh | head -n 20
				

This command shows the largest files for example, messages.1, audit.log, or auth.log. If they are unusually large, log rotation may have failed.

Fix Log Rotation

Now the files have been reset, you need to confirm logrotation configuration is correct, and restart the logrotation service, by using the following command:

					logrotate -f /etc/logrotate.conf
				

Now reboot the SDDC Manager vm with the following command:

					reboot
				
Awid Dashtgoli
Awid Dashtgoli

Fixing a 502 Bad Gateway on VMware SDDC Manager by Freeing Up Disk Space (Log Rotation Issue)

Table of Contents

Introduction

A 502 Bad Gateway or 503 Service Unavailable error on an SDDC Manager can often stem from a full disk  most commonly caused by failed log rotation. In this article, we walk through practical steps to identify and fix this issue so you can regain access to the SDDC Manager UI.

Issue

When the root partition of your SDDC Manager VM fills up, critical services — including the web UI — fail to start properly. This often results in browser errors like:

					502 Bad Gateway – nginx
				

Understanding the Cause

Before diving into updates or configuration changes, the first step is to confirm if disk space is the root cause.

1. SSH Into the SDDC Manager
ssh vcf@<SDDC-Manager-IP>
su root

2. Check Disk Usage
Once logged in as root, check if any partitions, especially /, are at or near 100% capacity:

					df -h
				

If the root volume (e.g., /dev/mapper/vg_system_lv_root) is full, you’re likely dealing with a large log or journal file.

Solution

Clear Large Log Files

To identify oversized logs in /var/log, run:

					du -ah /var/log | sort -rh | head -n 20
				

This command shows the largest files for example, messages.1, audit.log, or auth.log. If they are unusually large, log rotation may have failed.

Fix Log Rotation

Now the files have been reset, you need to confirm logrotation configuration is correct, and restart the logrotation service, by using the following command:

					logrotate -f /etc/logrotate.conf
				

Now reboot the SDDC Manager vm with the following command:

					reboot
				
Awid Dashtgoli
Awid Dashtgoli