Investigation of a Cross-regional Network Performance Issue | by Netflix Technology Blog

10 min learn

Apr 24, 2024

Hechao Li, Roger Cruz

Netflix operates a extremely environment friendly cloud computing infrastructure that helps a wide selection of functions important for our SVOD (Subscription Video on Demand), dwell streaming and gaming providers. Using Amazon AWS, our infrastructure is hosted throughout a number of geographic areas worldwide. This world distribution permits our functions to ship content material extra successfully by serving site visitors nearer to our clients. Like several distributed system, our functions sometimes require information synchronization between areas to keep up seamless service supply.

The next diagram exhibits a simplified cloud community topology for cross-region site visitors.

Our Cloud Community Engineering on-call group acquired a request to deal with a community subject affecting an utility with cross-region site visitors. Initially, it appeared that the appliance was experiencing timeouts, doubtless on account of suboptimal community efficiency. As everyone knows, the longer the community path, the extra gadgets the packets traverse, rising the probability of points. For this incident, the shopper utility is situated in an inside subnet within the US area whereas the server utility is situated in an exterior subnet in a European area. Due to this fact, it’s pure accountable the community since packets have to journey lengthy distances by means of the web.

As community engineers, our preliminary response when the community is blamed is usually, “No, it may well’t be the community,” and our process is to show it. Given that there have been no latest adjustments to the community infrastructure and no reported AWS points impacting different functions, the on-call engineer suspected a loud neighbor subject and sought help from the Host Community Engineering group.

On this context, a loud neighbor subject happens when a container shares a bunch with different network-intensive containers. These noisy neighbors devour extreme community assets, inflicting different containers on the identical host to endure from degraded community efficiency. Regardless of every container having bandwidth limitations, oversubscription can nonetheless result in such points.

Upon investigating different containers on the identical host — most of which had been a part of the identical utility — we shortly eradicated the potential for noisy neighbors. The community throughput for each the problematic container and all others was considerably beneath the set bandwidth limits. We tried to resolve the difficulty by eradicating these bandwidth limits, permitting the appliance to make the most of as a lot bandwidth as mandatory. Nevertheless, the issue endured.

We noticed some TCP packets within the community marked with the RST flag, a flag indicating {that a} connection needs to be instantly terminated. Though the frequency of those packets was not alarmingly excessive, the presence of any RST packets nonetheless raised suspicion on the community. To find out whether or not this was certainly a network-induced subject, we carried out a tcpdump on the shopper. Within the packet seize file, we noticed one TCP stream that was closed after precisely 30 seconds.

SYN at 18:47:06

After the 3-way handshake (SYN,SYN-ACK,ACK), the site visitors began flowing usually. Nothing unusual till FIN at 18:47:36 (30 seconds later)

The packet seize outcomes clearly indicated that it was the shopper utility that initiated the connection termination by sending a FIN packet. Following this, the server continued to ship information; nevertheless, for the reason that shopper had already determined to shut the connection, it responded with RST packets to all subsequent information from the server.

To make sure that the shopper wasn’t closing the connection on account of packet loss, we additionally carried out a packet seize on the server aspect to confirm that each one packets despatched by the server had been acquired. This process was difficult by the truth that the packets handed by means of a NAT gateway (NGW), which meant that on the server aspect, the shopper’s IP and port appeared as these of the NGW, differing from these seen on the shopper aspect. Consequently, to precisely match TCP streams, we wanted to determine the TCP stream on the shopper aspect, find the uncooked TCP sequence quantity, after which use this quantity as a filter on the server aspect to seek out the corresponding TCP stream.

With packet seize outcomes from each the shopper and server sides, we confirmed that all packets despatched by the server had been accurately acquired earlier than the shopper despatched a FIN.

Now, from the community perspective, the story is evident. The shopper initiated the connection requesting information from the server. The server saved sending information to the shopper with no downside. Nevertheless, at a sure level, regardless of the server nonetheless having information to ship, the shopper selected to terminate the reception of information. This led us to suspect that the difficulty is likely to be associated to the shopper utility itself.

With the intention to totally perceive the issue, we now want to grasp how the appliance works. As proven within the diagram beneath, the appliance runs within the us-east-1 area. It reads information from cross-region servers and writes the information to shoppers inside the similar area. The shopper runs as containers, whereas the servers are EC2 situations.

Notably, the cross-region learn was problematic whereas the write path was easy. Most significantly, there’s a 30-second application-level timeout for studying the information. The applying (shopper) errors out if it fails to learn an preliminary batch of information from the servers inside 30 seconds. After we elevated this timeout to 60 seconds, every thing labored as anticipated. This explains why the shopper initiated a FIN — as a result of it misplaced endurance ready for the server to switch information.

Might or not it’s that the server was up to date to ship information extra slowly? Might or not it’s that the shopper utility was up to date to obtain information extra slowly? Might or not it’s that the information quantity grew to become too massive to be fully despatched out inside 30 seconds? Sadly, we acquired unfavorable solutions for all 3 questions from the appliance proprietor. The server had been working with out adjustments for over a 12 months, there have been no vital updates within the newest rollout of the shopper, and the information quantity had remained constant.

If each the community and the appliance weren’t modified not too long ago, then what modified? In reality, we found that the difficulty coincided with a latest Linux kernel improve from model 6.5.13 to six.6.10. To check this speculation, we rolled again the kernel improve and it did restore regular operation to the appliance.

Actually talking, at the moment I didn’t consider it was a kernel bug as a result of I assumed the TCP implementation within the kernel needs to be stable and secure (Spoiler alert: How flawed was I!). However we had been additionally out of concepts from different angles.

There have been about 14k commits between the nice and unhealthy kernel variations. Engineers on the group methodically and diligently bisected between the 2 variations. When the bisecting was narrowed to a few commits, a change with “tcp” in its commit message caught our consideration. The ultimate bisecting confirmed that this commit was our perpetrator.

Apparently, whereas reviewing the e-mail historical past associated to this commit, we discovered that one other person had reported a Python check failure following the identical kernel improve. Though their resolution was indirectly relevant to our state of affairs, it recommended that an easier check may also reproduce our downside. Utilizing strace, we noticed that the appliance configured the next socket choices when speaking with the server:

[pid 1699] setsockopt(917, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
[pid 1699] setsockopt(917, SOL_TCP, TCP_NODELAY, [1], 4) = 0

We then developed a minimal client-server C utility that transfers a file from the server to the shopper, with the shopper configuring the identical set of socket choices. Throughout testing, we used a 10M file, which represents the amount of information usually transferred inside 30 seconds earlier than the shopper points a FIN. On the outdated kernel, this cross-region switch accomplished in 22 seconds, whereas on the brand new kernel, it took 39 seconds to complete.

With the assistance of the minimal replica setup, we had been finally in a position to pinpoint the basis reason for the issue. With the intention to perceive the basis trigger, it’s important to have a grasp of the TCP obtain window.

TCP Obtain Window

Merely put, the TCP obtain window is how the receiver tells the sender “That is what number of bytes you possibly can ship me with out me ACKing any of them”. Assuming the sender is the server and the receiver is the shopper, then now we have:

The Window Measurement

Now that we all know the TCP obtain window measurement might have an effect on the throughput, the query is, how is the window measurement calculated? As an utility author, you possibly can’t resolve the window measurement, nevertheless, you possibly can resolve how a lot reminiscence you wish to use for buffering acquired information. That is configured utilizing SO_RCVBUF socket choice we noticed within the strace end result above. Nevertheless, notice that the worth of this feature means how a lot utility information may be queued within the obtain buffer. In man 7 socket, there’s

SO_RCVBUF
Units or will get the utmost socket obtain buffer in bytes.
The kernel doubles this worth (to permit house for
bookkeeping overhead) when it’s set utilizing setsockopt(2),
and this doubled worth is returned by getsockopt(2). The
default worth is ready by the
/proc/sys/web/core/rmem_default file, and the utmost
allowed worth is ready by the /proc/sys/web/core/rmem_max
file. The minimal (doubled) worth for this feature is 256.

This implies, when the person provides a price X, then the kernel shops 2X within the variable sk->sk_rcvbuf. In different phrases, the kernel assumes that the bookkeeping overhead is as a lot because the precise information (i.e. 50% of the sk_rcvbuf).

sysctl_tcp_adv_win_scale

Nevertheless, the idea above is probably not true as a result of the precise overhead actually is dependent upon numerous elements equivalent to Most Transmission Unit (MTU). Due to this fact, the kernel supplied this sysctl_tcp_adv_win_scale which you should use to inform the kernel what the precise overhead is. (I consider 99% of individuals additionally don’t know the best way to set this parameter accurately and I’m undoubtedly one in every of them. You’re the kernel, if you happen to don’t know the overhead, how will you count on me to know?).

In keeping with the sysctl doc,

tcp_adv_win_scale — INTEGER
Out of date since linux-6.6 Depend buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), whether it is <= 0.
Potential values are [-31, 31], inclusive.
Default: 1

For 99% of individuals, we’re simply utilizing the default worth 1, which in flip means the overhead is calculated by rcvbuf/2^tcp_adv_win_scale = 1/2 * rcvbuf. This matches the idea when setting the SO_RCVBUF worth.

Let’s recap. Assume you set SO_RCVBUF to 65536, which is the worth set by the appliance as proven within the setsockopt syscall. Then now we have:

SO_RCVBUF = 65536
rcvbuf = 2 * 65536 = 131072
overhead = rcvbuf / 2 = 131072 / 2 = 65536
obtain window measurement = rcvbuf — overhead = 131072–65536 = 65536

(Word, this calculation is simplified. The actual calculation is extra complicated.)

In brief, the obtain window measurement earlier than the kernel improve was 65536. With this window measurement, the appliance was in a position to switch 10M information inside 30 seconds.

The Change

This commit obsoleted sysctl_tcp_adv_win_scale and launched a scaling_ratio that may extra precisely calculate the overhead or window measurement, which is the precise factor to do. With the change, the window measurement is now rcvbuf * scaling_ratio.

So how is scaling_ratio calculated? It’s calculated utilizing skb->len/skb->truesize the place skb->len is the size of the tcp information size in an skb and truesize is the overall measurement of the skb. That is certainly a extra correct ratio primarily based on actual information quite than a hardcoded 50%. Now, right here is the following query: in the course of the TCP handshake earlier than any information is transferred, how will we resolve the preliminary scaling_ratio? The reply is, a magic and conservative ratio was chosen with the worth being roughly 0.25.

Now now we have:

SO_RCVBUF = 65536
rcvbuf = 2 * 65536 = 131072
obtain window measurement = rcvbuf * 0.25 = 131072 * 0.25 = 32768

In brief, the obtain window measurement halved after the kernel improve. Therefore the throughput was lower in half, inflicting the information switch time to double.

Naturally, it’s possible you’ll ask, I perceive that the preliminary window measurement is small, however why doesn’t the window develop when now we have a extra correct ratio of the payload later (i.e. skb->len/skb->truesize)? With some debugging, we finally discovered that the scaling_ratio does get up to date to a extra correct skb->len/skb->truesize, which in our case is round 0.66. Nevertheless, one other variable, window_clamp, is just not up to date accordingly. window_clamp is the utmost obtain window allowed to be marketed, which can also be initialized to 0.25 * rcvbuf utilizing the preliminary scaling_ratio. Consequently, the obtain window measurement is capped at this worth and might’t develop greater.

In idea, the repair is to replace window_clamp together with scaling_ratio. Nevertheless, as a way to have a easy repair that doesn’t introduce different sudden behaviors, our last repair was to extend the preliminary scaling_ratio from 25% to 50%. It will make the obtain window measurement backward appropriate with the unique default sysctl_tcp_adv_win_scale.

In the meantime, discover that the issue is just not solely brought on by the modified kernel habits but in addition by the truth that the appliance units SO_RCVBUF and has a 30-second application-level timeout. In reality, the appliance is Kafka Join and each settings are the default configurations (obtain.buffer.bytes=64k and request.timeout.ms=30s). We additionally created a kafka ticket to vary obtain.buffer.bytes to -1 to permit Linux to auto tune the obtain window.

This was a really fascinating debugging train that lined many layers of Netflix’s stack and infrastructure. Whereas it technically wasn’t the “community” accountable, this time it turned out the perpetrator was the software program parts that make up the community (i.e. the TCP implementation within the kernel).

If tackling such technical challenges excites you, take into account becoming a member of our Cloud Infrastructure Engineering groups. Discover alternatives by visiting Netflix Jobs and trying to find Cloud Engineering positions.

Particular due to our gorgeous colleagues Alok Tiagi, Artem Tkachuk, Ethan Adams, Jorge Rodriguez, Nick Mahilani, Tycho Andersen and Vinay Rayini for investigating and mitigating this subject. We might additionally prefer to thank Linux kernel community professional Eric Dumazet for reviewing and making use of the patch.

Source link

What's Hot

The Top 5 Clinics to Get Mounjaro in Abu Dhabi

Nicola Peltz Beckham breaks silence following Brooklyn’s cryptic birthday message from parents

Sarah Ferguson Essentially Homeless Amid Epstein Scandal – Friends & Even Her Daughters Are Shutting Her Out!

Investigation of a Cross-regional Network Performance Issue | by Netflix Technology Blog

LITTLE HOUSE ON THE PRAIRIE Series Renewed for Season 2 at Netflix Ahead of the Season 1 Premiere — GeekTyrant

Bringing real-time AI assistants into live calls – natively inside the telecom network

Optimizing Recommendation Systems with JDK’s Vector API | by Netflix Technology Blog | Mar, 2026

Skip ‘Wuthering Heights’ and Watch This 21st Century Period Romance Before It Leaves Netflix

Subscribe to Updates

What's Hot

The Top 5 Clinics to Get Mounjaro in Abu Dhabi

Nicola Peltz Beckham breaks silence following Brooklyn’s cryptic birthday message from parents

Sarah Ferguson Essentially Homeless Amid Epstein Scandal – Friends & Even Her Daughters Are Shutting Her Out!

Investigation of a Cross-regional Network Performance Issue | by Netflix Technology Blog

TCP Obtain Window

The Window Measurement

sysctl_tcp_adv_win_scale

The Change

Related Posts

LITTLE HOUSE ON THE PRAIRIE Series Renewed for Season 2 at Netflix Ahead of the Season 1 Premiere — GeekTyrant

Bringing real-time AI assistants into live calls – natively inside the telecom network

Optimizing Recommendation Systems with JDK’s Vector API | by Netflix Technology Blog | Mar, 2026

Skip ‘Wuthering Heights’ and Watch This 21st Century Period Romance Before It Leaves Netflix