This article is contributed. See the original author and article here.

Node may go down for several reasons, please find the probable causes for Nodes going down in Service Fabric Cluster.


 


Scenario#1:


Check the Virtual Machine associated with the Node exists or Deleted or Deallocated.


Azure Portal-> VMSS Resource -> Instances


reshmav_4-1620197902798.png


If Virtual machine doesn’t exist, then one must perform either of below to Remove node state from Service Fabric cluster.


From SFX:



  • Go to the service fabric explorer of the cluster.

  • Check the Advanced mode setting check box on the cluster:


reshmav_1-1620197782764.png



  • Then click on Ellipsis (…) of the down nodes to have the “Remove node state” options and click on it. This should remove node state from the cluster. 


 


From PS Command:


PS cmd: Remove-ServiceFabricNodeState -NodeName _node_5 -Force


Reference: https://docs.microsoft.com/en-us/powershell/module/servicefabric/remove-servicefabricnodestate?view=azureservicefabricps


 


Scenario#2:


Check if Virtual machine associate with the node is healthy in VMSS.


Go to Azure Portal-> VMSS Resource -> Instances -> Click on the Instance -> Properties


reshmav_5-1620197937089.png


If Virtual Machine Guest Agent is “Not Ready” then reach out to Azure VM Team for the RCA.


 


Possible Mitigation:



  • Restart the Virtual machine from VMSS blade.

  • Re-image the Virtual Machine.


 


Scenario#3:


Check the performance of the Virtual Machine-like CPU and Memory.


reshmav_3-1620197782789.png


 


If the CPU or Memory is High, then Fabric related process will not be able to establish any instances/start the instances causing the node to go down.


 


Mitigation:



  • Check which process is consuming high CPU/Memory from the Task Manager to investigate the root cause and fix the issue permanently.


Collect the dumps using below tool to determine the root cause:


DebugDiag:


Download Debug Diagnostic Tool v2 Update 3 from Official Microsoft Download Center


 


(or) Procdump:


ProcDump – Windows Sysinternals | Microsoft Docs



  • Restart the Virtual machine from VMSS blade.


 


Scenario#4:


Check the Disk usage of the Virtual Machine, no space is the disk could lead to Node down issues.


For disk space related issues, we recommend to use ‘windirstat’ tool mentioned in the article: https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Cluster/Out%20of%20Diskspace.md to understand which folders are consuming more space.


 


Mitigation:


Free up the space to bring the Node Up.


 

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.