This article is contributed. See the original author and article here.

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here.


 


Scenario#6:


Check the Network connectivity between the nodes:



  • Open a command prompt

  • Ping <IP Address Of Other Node>


reshmav_0-1621564521804.png


If request times out.


Mitigation:


Check if any NSG blocking the connectivity.


 


Scenario#7:


Node-to-Node communication failure due to any of the below reason could lead to Node down issue.



  • If Cluster Certificate has expired.

  • If SF extension on the VMSS resource is pointing to expired certificate, On VM reboot node may go down due to this expired certificate.


“extensionProfile”: {


                “extensions”: [


                {


                    “properties”: {


                    “autoUpgradeMinorVersion”: true,


                    “settings”: {


                        “clusterEndpoint”: “https://xxxxx.servicefabric.azure.com/runtime/clusters/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx“,


                        “nodeTypeRef”: “sys”,


                        “dataPath”: “D:SvcFab”,


                        “durabilityLevel”: “Bronze”,


                        “enableParallelJobs”: true,


                        “nicPrefixOverride”: “10.0.0.0/24”,


                        “certificate”: {


                        “thumbprint”: “XXXXXXXXXXXXXXXXXXXXXXXXXXXXX”,


                        “x509StoreName”: “My”


                        }


 



  • Make sure certificate is ACL’d to network service.

  • If Reverse Proxy certificate has expired.

  • If above are taken care, Go to Scenario#8.


 


Scenario#8:


Node1 is not able to establish lease with a Neighboring node2 could cause node1 to do down.


From the SF traces:


For example in the logs we see a node with Node ID “e4eac25286f23859b79b5483964ab0c8” (Node1) failed to establish lease with a node with Node ID “c196867202638ea43655614031736e9” (Node2)–


reshmav_1-1621564521807.png


Now the focus should be on the node with which the lease connectivity is failing rather than the node which is down.


reshmav_5-1621564936418.png


From above traces, we get the Error code: c0000017


To understand what this Error code means, please download Microsoft Error Lookup Tool.


And execute the exe by passing error code as Parameter:


reshmav_6-1621564988198.png


 


Mitigation:


Restart the node (Node2) which could free up the Virtual Memory and start establishing the lease with Node1 to bring the Node1 Up.


 


 

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.