The sched_wakeup and sched_wakeup_new hooks are invoked when a course of adjustments state from ‘sleeping’ to ‘runnable.’ They allow us to establish when a course of is able to run and is ready for CPU time. Throughout this occasion, we generate a timestamp and retailer it in an eBPF hash map utilizing the method ID as the important thing.
struct {
__uint(kind, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
} runq_enqueued SEC(".maps");SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)
{
struct task_struct *activity = (void *)ctx[0];
u32 pid = task->pid;
u64 ts = bpf_ktime_get_ns();
bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
return 0;
}
Conversely, the sched_switch hook is triggered when the CPU switches between processes. This hook offers tips that could the method at the moment using the CPU and the method about to take over. We use the upcoming activity’s course of ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the method entered the queue, which we had beforehand saved. We then calculate the run queue latency by merely subtracting the timestamps.
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
struct task_struct *prev = (struct task_struct *)ctx[1];
struct task_struct *subsequent = (struct task_struct *)ctx[2];
u32 prev_pid = prev->pid;
u32 next_pid = next->pid;// fetch timestamp of when the subsequent activity was enqueued
u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
if (tsp == NULL) {
return 0; // missed enqueue
}
// calculate runq latency earlier than deleting the saved timestamp
u64 now = bpf_ktime_get_ns();
u64 runq_lat = now - *tsp;
// delete pid from enqueued map
bpf_map_delete_elem(&runq_enqueued, &next_pid);
....
One of many benefits of eBPF is its capability to offer tips that could the precise kernel information constructions representing processes or threads, often known as duties in kernel terminology. This characteristic allows entry to a wealth of knowledge saved a couple of course of. We required the method’s cgroup ID to affiliate it with a container for our particular use case. Nevertheless, the cgroup info within the course of struct is safeguarded by an RCU (Learn Copy Replace) lock.
To soundly entry this RCU-protected info, we are able to leverage kfuncs in eBPF. kfuncs are kernel capabilities that may be referred to as from eBPF packages. There are kfuncs obtainable to lock and unlock RCU read-side essential sections. These capabilities be sure that our eBPF program stays secure and environment friendly whereas retrieving the cgroup ID from the duty struct.
void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;u64 get_task_cgroup_id(struct task_struct *activity)
{
struct css_set *cgroups;
u64 cgroup_id;
bpf_rcu_read_lock();
cgroups = task->cgroups;
cgroup_id = cgroups->dfl_cgrp->kn->id;
bpf_rcu_read_unlock();
return cgroup_id;
}
As soon as the info is prepared, we should package deal it and ship it to userspace. For this goal, we selected the eBPF ring buffer. It’s environment friendly, high-performing, and user-friendly. It could deal with variable-length information information and permits information studying with out necessitating further reminiscence copying or syscalls. Nevertheless, the sheer variety of information factors was inflicting the userspace program to make use of an excessive amount of CPU, so we applied a charge limiter in eBPF to pattern the info.
struct {
__uint(kind, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, RINGBUF_SIZE_BYTES);
} occasions SEC(".maps");struct {
__uint(kind, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u64));
__uint(value_size, sizeof(u64));
} cgroup_id_to_last_event_ts SEC(".maps");
struct runq_event {
u64 prev_cgroup_id;
u64 cgroup_id;
u64 runq_lat;
u64 ts;
};
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
// ....
// The earlier code
// ....
u64 prev_cgroup_id = get_task_cgroup_id(prev);
u64 cgroup_id = get_task_cgroup_id(subsequent);
// per-cgroup-id-per-CPU rate-limiting
// to stability observability with efficiency overhead
u64 *last_ts =
bpf_map_lookup_elem(&cgroup_id_to_last_event_ts, &cgroup_id);
u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;
// verify the speed restrict for the cgroup_id in consideration
// earlier than doing extra work
if (now - last_ts_val < RATE_LIMIT_NS) {
// Price restrict exceeded, drop the occasion
return 0;
}
struct runq_event *occasion;
occasion = bpf_ringbuf_reserve(&occasions, sizeof(*occasion), 0);
if (occasion) {
event->prev_cgroup_id = prev_cgroup_id;
event->cgroup_id = cgroup_id;
event->runq_lat = runq_lat;
event->ts = now;
bpf_ringbuf_submit(occasion, 0);
// Replace the final occasion timestamp for the present cgroup_id
bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
&now, BPF_ANY);
}
return 0;
}
Our userspace software, developed in Go, processes occasions from the ring buffer to emit metrics to our metrics backend, Atlas. Every occasion features a run queue latency pattern with a cgroup ID, which we affiliate with containers working on the host. We categorize it as a system service if no such affiliation is discovered. When a cgroup ID is related to a container, we emit a percentile timer Atlas metric (runq.latency) for that container. We additionally increment a counter metric (sched.swap.out) to watch preemptions occurring for the container’s processes. Entry to the prev_cgroup_id of the preempted course of permits us to tag the metric with the reason for the preemption, whether or not it is as a consequence of a course of throughout the similar container (or cgroup), a course of in one other container, or a system service.
It is necessary to focus on that each the runq.latency metric and the sched.swap.out metrics are wanted to find out if a container is affected by noisy neighbors, which is the purpose we goal to realize — relying solely on the runq.latency metric can result in misconceptions. For instance, if a container is at or over its cgroup CPU restrict, the scheduler will throttle it, leading to an obvious spike in run queue latency as a consequence of delays within the queue. If we have been solely to contemplate this metric, we would incorrectly attribute the efficiency degradation to noisy neighbors when it is usually because the container is hitting its CPU quota. Nevertheless, simultaneous spikes in each metrics, primarily when the trigger is a unique container or system course of, clearly point out a loud neighbor problem.
Under is the runq.latency metric for a server working a single container with ample CPU capability. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Though there are some spikes reaching 400µs, the latency stays inside acceptable parameters.
