Recently, I observed a strange behaviour during an NSX-T upgrade from version 2.5.3 to 3.0.3 at a customer. The NSX-T Manager upgrade failed for some reason while the Edge nodes and ESXi transport nodes had already been upgraded successfully.

The upgrade of the appliance was stuck and cannot proceed. Leveraging get upgrade progress-status on the upgrade orchestrator node shows that the upgrade fails at resume_other_nodes:

Trying to resume the upgrade at that step manually by executing start upgrade-bundle Vmware-NSX-unified-appliance-3.0.3.0.0.17777744 step resume_other_nodes didn’t work and resulted in the same error.

Together with GSS we then rebooted the two non orchestrator nodes and resumed the remaining steps of the upgrade manually from the CLI:

start upgrade-bundle Vmware-NSX-unified-appliance-3.0.3.0.0.17777744 step restore_datastore_cluster
start upgrade-bundle Vmware-NSX-unified-appliance-3.0.3.0.0.17777744 step update_upgrade_status
start upgrade-bundle Vmware-NSX-unified-appliance-3.0.3.0.0.17777744 step finish_upgrade

This time the upgrade finished successfully.