This article is contributed. See the original author and article here.

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here


 


Scenario#5:


Virtual Machine associated with the node is healthy, but Service Fabric Extension being unhealthy could cause node to go down in Service Fabric cluster.


Analysis:


RDP into node, which is down. Open Task manager and Observe the Fabric processes.


reshmav_1-1621576732767.png


If Fabric.exe and FabricHost.exe is crashing and Restarting often, then check Mitigation#1.


If ServiceFabricNodeBootStrapAgent.exe is crashing and Restarting often check Mitigation#2.


If FabricInstallerSvc.exe is crashing and Restarting often check Mitigation#3.


 


Mitigation#1:



  • <path>/Cluster.current.xml

  • Does it match manifest for cluster (compare with the one in SFX)

  • No

    • Does SFX indicate upgrades in progress?



  • No upgrades in progress

    • Go to  <Path>

    • Open Clustermanifest.current.xml

    • Replace contents of Clustermanifest.current with contents of manifest in SFX.

    • Save

    • In task manager, select Fabric.exe if running and click on “End Task” button

    • If Fabric.exe is not running, reboot VM.

    • It will take a few minutes for node to become healthy.

    • Node did not become healthy, start from beginning.                                               




Path: D:SvcFab_Nodename_FabricClusterManifest.current.xml


 


Mitigation#2:


Check if this process listed in list of processes in Task Manager.



  • If “Yes”:

    • Wait a while to see if the node heals itself.

    • This process tries to heal the failure at a coarse level by restarting the VM and reinstalling SF runtime.

    • It waits for 15 minutes after an attempt to heal before taking the next action.

    • Check ServiceFabricNodeBootstrapAgent.InstallLog – Check “From the Node”                                                                                     Path: C:PackagesPluginsMicrosoft.Azure.ServiceFabric.ServiceFabricNode<version>ServiceServiceFabricNodeBootstrapAgent.InstallLog

    • Did not heal, go to “Event Viewer logs” for error details.




 



  • If “No”:

    • Go to Services tab in Task Manager and click on Open Services link at the bottom.

    • Check the startup mode for the bootstrap service, make sure it is Automatic .

    • Start service.

    • If it stays running, go to “Yes” section above.




 


Mitigation#3:


Check if the connectivity of the Node is working.


For more details Refer to Part III – Troubleshooting Node down Scenarios.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.