Azure Redis Timeouts - Network Issues - Dr. Ware Technology Services

This article is contributed. See the original author and article here.

Overview

There are many reasons that may cause timeouts on Redis client side, due to client, network or server side causes, and the error message also may differ based on Client library used.

Timeouts in Azure Cache for Redis occurs on client side when client application cannot receive the response from Redis server side timely.
Anytime, when client application doesn’t receive response before one of the timeout values expire, a timeout will occur and will be logged on client application logs as Redis timeout error message.

On Network, Redis timeouts may occur anytime due to connection failure, network blips, network failures, Client or Server Network bandwidth limit exceeded, or any other network issues may degrade Redis connectivity and may cause Redis timeouts on client side.

Order by most common issues, below are the most common Network causes:

1- Network blips

2- Client Network bandwidth limit exceeded

3- Server Network bandwidth limit exceeded

For Client or Server side Redis timeout causes, please see these Tech Community articles:

Azure Redis Timeouts – Client Side Issues
Azure Redis Timeouts – Server Side Issues

Microsoft recommendation:

Microsoft recommendation is to use Azure Cache for Redis in the same Azure region as Redis client application.

The main reason for that is to take advantage of the large bandwidth used in the network backbone on each Azure region, with low latency, avoiding to use public networks with smaller bandwidth.

Azure Network backbone have many different ways to making it resilient to any failure.

In each Azure region, any eventual failure on Network backbone would affect all Azure services and all Customers, and not only the Azure Cache for Redis connections.

This is the main reason to avoid using Azure Cache for Redis service in a different Azure region than client application.

1- Network blips

Network blips can be caused by Azure Load Balancer operations, failovers due to Redis or Host updates/patches, or some other reasons.

Because of the transient nature of the network blips, it’s not always possible to identify the cause of that network blips on client side.

Network blips can be expected and Microsoft recommends to use some retry policy on client side application to deal with these transient network blips.
With that, network blips should be recovered by client application with a retry, avoiding any end user experience downtime.

StackExchange.Redis client library has a setting named AbortConnect that controls how it handles connectivity errors like this. The default value for this setting is “True”, meaning that it will not reconnect automatically in some cases. Microsoft recommendation is to set AbortConnect to false, to let the ConnectionMultiplexer reconnect automatically.

2- Client Network bandwidth limit exceeded

To identify if the reason for Redis timeouts are Client Network bandwidth limit exceeded, network usage on Client environment (AppService, VM, VMSS, AKS, etc) used by Redis Client application should be investigated, to see if is reaching any network interface limit or any network bandwidth limit (inbound or outbound).
Keep in mind that all applications in client side environment may have high network utilization and not only the Redis Client application; also, Redis timeouts can be caused not only by 100% network bandwidth utilization, but values reaching 100% can also cause higher network latency and Redis timeouts.
To troubleshoot this, some network tools on client side may needed, depending of client environment (AppService, VM, VMSS, AKS, etc).

3- Server Network bandwidth limit exceeded

To identify if the reason for Redis timeouts are caused by Server Network bandwidth limit exceeded, Azure portal have two important metrics on Redis service or Azure Monitor blades: Cache Read and Cache Write. These shows the amount of data read/written from/to the cache in Megabytes per second (MB/s), during any specified reporting interval: Available metrics and reporting intervals
Cache Read and Cache Write metrics are available to the Redis instance or per shard, in case of Redis clustered instances. On that case you may need to investigate each shard network usage to be able to identify any Network bandwidth limit exceeded specifically in some shard.
With that metrics, you can compare with the Azure Redis cache Network performance for your Redis tier: Azure Cache for Redis performance

Any other network causes for Redis network connectivity issues should be investigated on client side using some network trace to better understand where is the source of the problem.
A network packet trace from client side may help network team to identify the source of the problem.

Also, if needed specifically troubleshoot network connectivity, please see how to Troubleshooting Azure Redis Connectivity Issues (Tech Community article)

Conclusion:

Despite having different reasons to have Redis timeouts on connections to Azure Cache for Redis, some different causes on Client, Server side or Network can cause Redis timeouts.

For Client or Server side Redis timeout causes, please see these Tech Community articles:

Azure Redis Timeouts – Client Side Issues
Azure Redis Timeouts – Server Side Issues

Related documentation:

Timeout issues
Redis Client handling – TCP keepalive

Best practices for Azure Cache for Redis (general and client library specific)

StackExchange.Redis best practices

Available metrics and reporting intervals

Azure Cache for Redis performance

Troubleshooting Azure Redis Connectivity Issues (Tech Community article)

I hope this can be useful !!!

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.