22 hours in the past
By: Hechao Li and Marcelo Mayworm
With particular due to our beautiful colleagues Amer Ather, Itay Dafna, Luca Pozzi, Matheus Leão, and Ye Ji.
At Netflix, the Analytics and Developer Expertise group, a part of the Information Platform, gives a product known as Workbench. Workbench is a distant growth workspace primarily based on Titus that permits knowledge practitioners to work with huge knowledge and machine studying use instances at scale. A typical use case for Workbench is working JupyterLab Notebooks.
Lately, a number of customers reported that their JupyterLab UI turns into gradual and unresponsive when working sure notebooks. This doc particulars the intriguing strategy of debugging this situation, all the way in which from the UI all the way down to the Linux kernel.
Machine Studying engineer Luca Pozzi reported to our Information Platform crew that their JupyterLab UI on their workbench turns into gradual and unresponsive when working a few of their Notebooks. Restarting the ipykernel course of, which runs the Pocket book, would possibly quickly alleviate the issue, however the frustration persists as extra notebooks are run.
Whereas we noticed the difficulty firsthand, the time period “UI being gradual” is subjective and tough to measure. To analyze this situation, we would have liked a quantitative evaluation of the slowness.
Itay Dafna devised an efficient and easy methodology to quantify the UI slowness. Particularly, we opened a terminal through JupyterLab and held down a key (e.g., “j”) for 15 seconds whereas working the person’s pocket book. The enter to stdin is shipped to the backend (i.e., JupyterLab) through a WebSocket, and the output to stdout is shipped again from the backend and displayed on the UI. We then exported the .har file recording all communications from the browser and loaded it right into a Pocket book for evaluation.
Utilizing this method, we noticed latencies starting from 1 to 10 seconds, averaging 7.4 seconds.
Now that we’ve an goal metric for the slowness, let’s formally begin our investigation. If in case you have learn the symptom rigorously, you need to have observed that the slowness solely happens when the person runs sure notebooks however not others.
Due to this fact, step one is scrutinizing the precise Pocket book experiencing the difficulty. Why does the UI at all times decelerate after working this specific Pocket book? Naturally, you’d assume that there have to be one thing fallacious with the code working in it.
Upon intently analyzing the person’s Pocket book, we observed a library known as pystan , which gives Python bindings to a local C++ library known as stan, seemed suspicious. Particularly, pystan makes use of asyncio. Nevertheless, as a result of there’s already an present asyncio occasion loop working within the Pocket book course of and asyncio can’t be nested by design, to ensure that pystan to work, the authors of pystan suggest injecting pystan into the prevailing occasion loop by utilizing a bundle known as nest_asyncio, a library that turned unmaintained as a result of the creator sadly handed away.
Given this seemingly hacky utilization, we naturally suspected that the occasions injected by pystan into the occasion loop had been blocking the dealing with of the WebSocket messages used to speak with the JupyterLab UI. This reasoning sounds very believable. Nevertheless, the person claimed that there have been instances when a Pocket book not utilizing pystan runs, the UI additionally turned gradual.
Furthermore, after a number of rounds of dialogue with ChatGPT, we realized extra in regards to the structure and realized that, in principle, the utilization of pystan and nest_asyncio mustn’t trigger the slowness in dealing with the UI WebSocket for the next causes:
Regardless that pystan makes use of nest_asyncio to inject itself into the principle occasion loop, the Pocket book runs on a toddler course of (i.e., the ipykernel course of) of the jupyter-lab server course of, which implies the principle occasion loop being injected by pystan is that of the ipykernel course of, not the jupyter-server course of. Due to this fact, even when pystan blocks the occasion loop, it shouldn’t affect the jupyter-lab important occasion loop that’s used for UI websocket communication. See the diagram beneath:
In different phrases, pystan occasions are injected to the occasion loop B on this diagram as an alternative of occasion loop A. So, it shouldn’t block the UI WebSocket occasions.
You may additionally assume that as a result of occasion loop A handles each the WebSocket occasions from the UI and the ZeroMQ socket occasions from the ipykernel course of, a excessive quantity of ZeroMQ occasions generated by the pocket book might block the WebSocket. Nevertheless, once we captured packets on the ZeroMQ socket whereas reproducing the difficulty, we didn’t observe heavy site visitors on this socket that might trigger such blocking.
A stronger piece of proof to rule out pystan was that we had been in the end in a position to reproduce the difficulty even with out it, which I’ll dive into later.
The Workbench occasion runs as a Titus container. To effectively make the most of our compute assets, Titus employs a CPU oversubscription function, that means the mixed digital CPUs allotted to containers exceed the variety of obtainable bodily CPUs on a Titus agent. If a container is unlucky sufficient to be scheduled alongside different “noisy” containers — those who eat lots of CPU assets — it might undergo from CPU deficiency.
Nevertheless, after analyzing the CPU utilization of neighboring containers on the identical Titus agent because the Workbench occasion, in addition to the general CPU utilization of the Titus agent, we rapidly dominated out this speculation. Utilizing the highest command on the Workbench, we noticed that when working the Pocket book, the Workbench occasion makes use of solely 4 out of the 64 CPUs allotted to it. Merely put, this workload isn’t CPU-bound.
The subsequent principle was that the community between the online browser UI (on the laptop computer) and the JupyterLab server was gradual. To analyze, we captured all of the packets between the laptop computer and the server whereas working the Pocket book and repeatedly urgent ‘j’ within the terminal.
When the UI skilled delays, we noticed a 5-second pause in packet transmission from server port 8888 to the laptop computer. In the meantime, site visitors from different ports, similar to port 22 for SSH, remained unaffected. This led us to conclude that the pause was attributable to the appliance working on port 8888 (i.e., the JupyterLab course of) reasonably than the community.
As beforehand talked about, one other sturdy piece of proof proving the innocence of pystan was that we might reproduce the difficulty with out it. By regularly stripping down the “dangerous” Pocket book, we finally arrived at a minimal snippet of code that reproduces the difficulty with none third-party dependencies or advanced logic:
import time
import os
from multiprocessing import Course ofN = os.cpu_count()
def launch_worker(worker_id):
time.sleep(60)
if __name__ == '__main__':
with open('/root/2GB_file', 'r') as file:
knowledge = file.learn()
processes = []
for i in vary(N):
p = Course of(goal=launch_worker, args=(i,))
processes.append(p)
p.begin()
for p in processes:
p.be a part of()
The code does solely two issues:
- Learn a 2GB file into reminiscence (the Workbench occasion has 480G reminiscence in complete so this reminiscence utilization is sort of negligible).
- Begin N processes the place N is the variety of CPUs. The N processes do nothing however sleep.
There is no such thing as a doubt that that is essentially the most foolish piece of code I’ve ever written. It’s neither CPU certain nor reminiscence certain. But it could possibly trigger the JupyterLab UI to stall for as many as 10 seconds!
There are a few fascinating observations that elevate a number of questions:
- We observed that each steps are required in an effort to reproduce the difficulty. In the event you don’t learn the 2GB file (that’s not even used!), the difficulty isn’t reproducible. Why utilizing 2GB out of 480GB reminiscence might affect the efficiency?
- When the UI delay happens, the jupyter-lab course of CPU utilization spikes to 100%, hinting at competition on the single-threaded occasion loop on this course of (occasion loop A within the diagram earlier than). What does the jupyter-lab course of want the CPU for, on condition that it isn’t the method that runs the Pocket book?
- The code runs in a Pocket book, which implies it runs within the ipykernel course of, that may be a baby strategy of the jupyter-lab course of. How can something that occurs in a toddler course of trigger the mum or dad course of to have CPU competition?
- The workbench has 64CPUs. However once we printed os.cpu_count(), the output was 96. Which means the code begins extra processes than the variety of CPUs. Why is that?
Let’s reply the final query first. In actual fact, for those who run lscpu and nproc instructions inside a Titus container, additionally, you will see totally different outcomes — the previous offers you 96, which is the variety of bodily CPUs on the Titus agent, whereas the latter offers you 64, which is the variety of digital CPUs allotted to the container. This discrepancy is because of the lack of a “CPU namespace” within the Linux kernel, inflicting the variety of bodily CPUs to be leaked to the container when calling sure capabilities to get the CPU rely. The idea right here is that Python os.cpu_count() makes use of the identical operate because the lscpu command, inflicting it to get the CPU rely of the host as an alternative of the container. Python 3.13 has a brand new name that can be utilized to get the correct CPU rely, however it’s not GA’ed but.
It will likely be confirmed later that this inaccurate variety of CPUs is usually a contributing issue to the slowness.
Subsequent, we used py-spy to do a profiling of the jupyter-lab course of. Be aware that we profiled the mum or dad jupyter-lab course of, not the ipykernel baby course of that runs the replica code. The profiling result’s as follows:
As one can see, lots of CPU time (89%!!) is spent on a operate known as __parse_smaps_rollup. Compared, the terminal handler used solely 0.47% CPU time. From the stack hint, we see that this operate is contained in the occasion loop A, so it could possibly undoubtedly trigger the UI WebSocket occasions to be delayed.
The stack hint additionally exhibits that this operate is in the end known as by a operate utilized by a Jupyter lab extension known as jupyter_resource_usage. We then disabled this extension and restarted the jupyter-lab course of. As you could have guessed, we might now not reproduce the slowness!
However our puzzle isn’t solved but. Why does this extension trigger the UI to decelerate? Let’s preserve digging.
From the title of the extension and the names of the opposite capabilities it calls, we will infer that this extension is used to get assets similar to CPU and reminiscence utilization info. Inspecting the code, we see that this operate name stack is triggered when an API endpoint /metrics/v1 known as from the UI. The UI apparently calls this operate periodically, in response to the community site visitors tab in Chrome’s Developer Instruments.
Now let’s take a look at the implementation ranging from the decision get(jupter_resource_usage/api.py:42) . The complete code is right here and the important thing traces are proven beneath:
cur_process = psutil.Course of()
all_processes = [cur_process] + cur_process.youngsters(recursive=True)for p in all_processes:
data = p.memory_full_info()
Principally, it will get all youngsters processes of the jupyter-lab course of recursively, together with each the ipykernel Pocket book course of and all processes created by the Pocket book. Clearly, the price of this operate is linear to the variety of all youngsters processes. Within the replica code, we create 96 processes. So right here we could have no less than 96 (sleep processes) + 1 (ipykernel course of) + 1 (jupyter-lab course of) = 98 processes when it ought to truly be 64 (allotted CPUs) + 1 (ipykernel course of) + 1 (jupyter-lab course of) = 66 processes, as a result of the variety of CPUs allotted to the container is, in actual fact, 64.
That is really ironic. The extra CPUs we’ve, the slower we’re!
At this level, we’ve answered one query: Why does beginning many grandchildren processes within the baby course of trigger the mum or dad course of to be gradual? As a result of the mum or dad course of runs a operate that’s linear to the quantity all youngsters course of recursively.
Nevertheless, this solves solely half of the puzzle. In the event you keep in mind the earlier evaluation, beginning many baby processes ALONE doesn’t reproduce the difficulty. If we don’t learn the 2GB file, even when we create 2x extra processes, we will’t reproduce the slowness.
So now we should reply the following query: Why does studying a 2GB file within the baby course of have an effect on the mum or dad course of efficiency, particularly when the workbench has as a lot as 480GB reminiscence in complete?
To reply this query, let’s look intently on the operate __parse_smaps_rollup. Because the title implies, this operate parses the file /proc/<pid>/smaps_rollup.
def _parse_smaps_rollup(self):
uss = pss = swap = 0
with open_binary("{}/{}/smaps_rollup".format(self._procfs_path, self.pid)) as f:
for line in f:
if line.startswith(b”Private_”):
# Private_Clean, Private_Dirty, Private_Hugetlb
s uss += int(line.break up()[1]) * 1024
elif line.startswith(b”Pss:”):
pss = int(line.break up()[1]) * 1024
elif line.startswith(b”Swap:”):
swap = int(line.break up()[1]) * 1024
return (uss, pss, swap)Naturally, you would possibly assume that when reminiscence utilization will increase, this file turns into bigger in measurement, inflicting the operate to take longer to parse. Sadly, this isn’t the reply as a result of:
- First, the variety of traces on this file is fixed for all processes.
- Second, this can be a particular file within the /proc filesystem, which needs to be seen as a kernel interface as an alternative of an everyday file on disk. In different phrases, I/O operations of this file are dealt with by the kernel reasonably than disk.
This file was launched on this commit in 2017, with the aim of bettering the efficiency of person packages that decide combination reminiscence statistics. Let’s first give attention to the handler of open syscall on this /proc/<pid>/smaps_rollup.
Following by the single_open operate, we’ll discover that it makes use of the operate show_smaps_rollup for the present operation, which might translate to the learn system name on the file. Subsequent, we take a look at the show_smaps_rollup implementation. You’ll discover a do-while loop that’s linear to the digital reminiscence space.
static int show_smaps_rollup(struct seq_file *m, void *v) {
…
vma_start = vma->vm_start;
do {
smap_gather_stats(vma, &mss, 0);
last_vma_end = vma->vm_end;
…
} for_each_vma(vmi, vma);
…
}This completely explains why the operate will get slower when a 2GB file is learn into reminiscence. As a result of the handler of studying the smaps_rollup file now takes longer to run the whereas loop. Principally, regardless that smaps_rollup already improved the efficiency of getting reminiscence info in comparison with the previous methodology of parsing the /proc/<pid>/smaps file, it’s nonetheless linear to the digital reminiscence used.
Regardless that at this level the puzzle is solved, let’s conduct a extra quantitative evaluation. How a lot is the time distinction when studying the smaps_rollup file with small versus massive digital reminiscence utilization? Let’s write some easy benchmark code like beneath:
import osdef read_smaps_rollup(pid):
with open("/proc/{}/smaps_rollup".format(pid), "rb") as f:
for line in f:
go
if __name__ == “__main__”:
pid = os.getpid()
read_smaps_rollup(pid)
with open(“/root/2G_file”, “rb”) as f:
knowledge = f.learn()
read_smaps_rollup(pid)
This program performs the next steps:
- Reads the smaps_rollup file of the present course of.
- Reads a 2GB file into reminiscence.
- Repeats step 1.
We then use strace to seek out the correct time of studying the smaps_rollup file.
$ sudo strace -T -e hint=openat,learn python3 benchmark.py 2>&1 | grep “smaps_rollup” -A 1openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 <0.000023>
learn(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 <0.000259>
...
openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 <0.000029>
learn(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 <0.027698>
As you’ll be able to see, each occasions, the learn syscall returned 670, that means the file measurement remained the identical at 670 bytes. Nevertheless, the time it took the second time (i.e., 0.027698 seconds) is 100x the time it took the primary time (i.e., 0.000259 seconds)! Which means if there are 98 processes, the time spent on studying this file alone might be 98 * 0.027698 = 2.7 seconds! Such a delay can considerably have an effect on the UI expertise.
This extension is used to show the CPU and reminiscence utilization of the pocket book course of on the bar on the backside of the Pocket book:
We confirmed with the person that disabling the jupyter-resource-usage extension meets their necessities for UI responsiveness, and that this extension isn’t essential to their use case. Due to this fact, we supplied a approach for them to disable the extension.
This was such a difficult situation that required debugging from the UI all the way in which all the way down to the Linux kernel. It’s fascinating that the issue is linear to each the variety of CPUs and the digital reminiscence measurement — two dimensions which are usually seen individually.
Total, we hope you loved the irony of:
- The extension used to watch CPU utilization inflicting CPU competition.
- An fascinating case the place the extra CPUs you could have, the slower you get!
In the event you’re excited by tackling such technical challenges and have the chance to unravel advanced technical challenges and drive innovation, contemplate becoming a member of our Information Platform groups. Be a part of shaping the way forward for Information Safety and Infrastructure, Information Developer Expertise, Analytics Infrastructure and Enablement, and extra. Discover the affect you may make with us!