Optimizing Recommendation Systems with JDK’s Vector API | by Netflix Technology Blog

By Harshad Sane

Ranker is likely one of the largest and most complicated providers at Netflix. Amongst many issues, it powers the personalised rows you see on the Netflix homepage, and runs at an unlimited scale. Once we checked out CPU profiles for this service, one characteristic saved standing out: video serendipity scoring — the logic that solutions a easy query:

“How totally different is that this new title from what you’ve been watching up to now?”

This single characteristic was consuming about 7.5% of complete CPU on every node operating the service. What began as a easy concept — “simply batch the video scoring characteristic” — changed into a deeper optimization journey. Alongside the way in which we launched batching, re-architected reminiscence structure and tried numerous libraries to deal with the scoring kernels.

Learn on to learn the way we achieved the identical serendipity scores, however at a meaningfully decrease CPU per request, leading to a decreased cluster footprint.

Drawback: The Hotspot in Ranker

At a excessive degree, serendipity scoring works like this: A candidate title and every merchandise in a member’s viewing historical past are represented as embeddings in a vector house. For every candidate, we compute its similarity in opposition to the historical past embeddings, discover the utmost similarity, and convert that right into a “novelty” rating. That rating turns into an enter characteristic to the downstream suggestion logic.

The unique implementation was simple however costly. For every candidate we fetch its embedding, loop over the historical past to compute cosine similarity one pair at a time and observe the utmost similarity rating. Though it’s straightforward to motive about, at Ranker’s scale, this ends in vital sequential work, repeated embedding lookups, scattered reminiscence entry, and poor cache locality. Profiling confirmed this.

Flamegraph exhibiting inefficient scoring

A flamegraph made it clear: One of many high hotspots within the service was Java dot merchandise contained in the serendipity encoder. Algorithmically, the hotspot was a nested loop construction of M candidates × N historical past objects the place every pair generates its personal cosine similarity i.e. O(M×N) separate dot product operations.

Resolution

The Authentic Implementation: Single video cosine loop

In simplified type the code regarded like this:

for (Video candidate : candidates) {
Vector c = embedding(candidate); // D-dimensional
double maxSim = -1.0;for (Video h : historical past) 
double serendipity = 1.0 - maxSim;
emitFeature(candidate, serendipity);
}

The nested for loop with O(M×N) separate dot merchandise introduced upon its personal overheads. One fascinating element we discovered by instrumenting site visitors shapes: most requests (about 98%) have been single-video, however the remaining 2% have been giant batch requests. As a result of these batches have been so giant, the full quantity of movies processed ended up being roughly 50:50 between single and batch jobs. This made batching value pursuing even when it didn’t assist the median request.

Step 1 : Batching, from Nested Loops to Matrix Multiply

The primary concept was to cease considering by way of “many small dot merchandise” and as an alternative deal with the work as a matrix operation. i.e. For batch candidates, implement an information structure to parallelize the mathematics in a single operation i.e. matrix multiply. If D is the embedding dimension:

Pack all candidate embeddings right into a matrix A of form M x D
Pack all historical past embeddings right into a matrix B of form N x D
Normalize all rows to unit size.
Compute: cosine similarities as
[ C = A x B^T ]; the place C is an M x N matrix of cosine similarities.

In pseudo‑code:

// Construct matrices
double[][] A = new double[M][D]; // candidates
double[][] B = new double[N][D]; // historical pastfor (int i = 0; i < M; i++) {
A[i] = embedding(candidates[i]).toArray();
}
for (int j = 0; j < N; j++) {
B[j] = embedding(historical past[j]).toArray();
}
// Normalize rows to unit vectors
normalizeRows(A);
normalizeRows(B);
// Compute C = A * B^T
double[][] C = matmul(A, B);
C[i][j] = cosine(candidates[i], historical past[j])
// Derive serendipity
for (int i = 0; i < M; i++) {
double maxSim = max(C[i][0..N-1]);
double serendipity = 1.0 - maxSim;
emitFeature(candidates[i], serendipity);
}

This turns M×N separate dot merchandise right into a single matrix multiply, which is precisely what CPUs and optimized kernels are constructed for. We built-in this into the prevailing framework by supporting each, encode()for single movies and batchEncode() for batches, whereas sustaining backward compatibility. At this level it appeared like we have been “carried out”, however we weren’t.

Step 2: When Batching Isn’t Sufficient

As soon as we had a batched implementation, we ran canaries and noticed one thing stunning: a few 5% efficiency regression. The algorithm wasn’t the problem — turning M×N separate dot merchandise right into a matrix multiplication is mathematically sound. The issue was the overhead we launched within the first implementation.

Our preliminary model constructed double[][] matrices for candidates, historical past, and outcomes on each batch. These giant, short-lived allocations created GC strain, and the double[][] structure itself is non-contiguous in reminiscence, which meant further pointer chasing and worse cache conduct.
On high of that, the first-cut Java matrix multiply was an easy scalar implementation, so it couldn’t reap the benefits of SIMD. In different phrases, we paid the price of batching with out getting the compute effectivity we have been aiming for.

The lesson was quick: algorithmic enhancements don’t matter if the implementation particulars—reminiscence structure, allocation technique, and the compute kernel—work in opposition to you. That arrange the subsequent step for making the information structure cache-friendly and eliminating per-batch allocations earlier than revisiting the matrix multiply kernel.

Step 3: Flat Buffers & ThreadLocal Reuse

We reworked the information structure to be cache-friendly and allocation-light. As a substitute of double[m][n], we moved to flat double[] buffers in row-major order. That gave us contiguous reminiscence and predictable entry patterns. Then we launched a ThreadLocal<BufferHolder> that owns reusable buffers for candidates, historical past, and another scratch house. Buffers develop as wanted however by no means shrink, which avoids per-request allocation whereas holding every thread remoted (no rivalry). A simplified sketch:

class BufferHolder {  
double[] candidatesFlat = new double[0];  
double[] historyFlat = new double[0];  double[] getCandidatesFlat(int required) {  
if (candidatesFlat.size < required) {  
candidatesFlat = new double[required];  
}  
return candidatesFlat;  
}  
double[] getHistoryFlat(int required) {  
if (historyFlat.size < required) {  
historyFlat = new double[required];  
}  
return historyFlat;  
}  
}  
personal static remaining ThreadLocal<BufferHolder> threadBuffers =  
ThreadLocal.withInitial(BufferHolder::new);

This transformation alone made the batched path way more predictable: fewer allocations, much less GC strain, and higher cache locality.

Get Netflix Know-how Weblog’s tales in your inbox

Be part of Medium at no cost to get updates from this author.

Keep in mind me for sooner sign up

Now the remaining query was the one we initially thought we have been answering: what’s the easiest way to do the matrix multiply?

Step 4: BLAS: Nice in Checks, Not in Manufacturing

The apparent subsequent step was BLAS (Primary Linear Algebra Subprograms). In isolation, microbenchmarks regarded promising. However as soon as built-in into the true batch scoring path, the positive aspects didn’t materialize. Just a few issues have been working in opposition to us:

The default netlib-java path was utilizing F2J (Fortran-to-Java) BLAS moderately than a very native implementation.
Even with native BLAS, we paid overhead for setup and JNI transitions.
Java’s row-major structure doesn’t match the column-major expectations of many BLAS routines, which may introduce conversion and short-term buffers.
These further allocations and copies mattered within the full pipeline, particularly alongside TensorFlow embedding work.

BLAS was nonetheless a helpful experiment — it clarified the place time was being spent, but it surely wasn’t the drop-in win we needed. What we would have liked was one thing that stayed pure Java, match our flat-buffer structure, and will nonetheless exploit SIMD.

Step 5: JDK Vector API to the rescue

A Quick Be aware on the JDK Vector API: The JDK Vector API is an incubating characteristic that gives a conveyable approach to categorical data-parallel operations in Java — suppose “SIMD with out intrinsics”. You write by way of vectors and lanes, and the JIT maps these operations to one of the best SIMD directions accessible on the host CPU (SSE/AVX2/AVX-512), with a scalar fallback when wanted. Extra crucially for us, it’s pure Java: no native dependencies, no JNI transitions, and a growth mannequin that appears like regular Java code moderately than platform-specific meeting or intrinsics.

This was a very good match for our workload as a result of we had already moved embeddings into flat, contiguous double[] buffers, and the new loop was dominated by giant numbers of dot merchandise. The ultimate step was to interchange BLAS with a pure-Java SIMD implementation utilizing the JDK Vector API. By this level we already had the correct form for top efficiency — batching, flat buffers, and ThreadLocal reuse. So the remaining work was to swap out the compute kernel with out introducing JNI overhead or platform-specific code. We did that behind a small manufacturing facility. At class load time, MatMulFactory selects one of the best accessible implementation:

If jdk.incubator.vector is accessible, use a Vector API implementation.
In any other case, fall again to a scalar implementation with a extremely optimized loop-unrolled dot product (applied by my colleague Patrick Strawderman, impressed by patterns utilized in Lucene)

Within the Vector API implementation, the internal loop computes a dot product by accumulating a * b right into a vector accumulator utilizing fma() (fused multiply-add). DoubleVector.SPECIES_PREFERRED lets the runtime decide an applicable lane width for the machine. Right here’s a simplified sketch of the internal loop:

// Vector API path (simplified)  
for (int i = 0; i < M; i++) {  
for (int j = 0; j < N; j++) {  DoubleVector acc = DoubleVector.zero(SPECIES);  
int okay = 0;  
// SPECIES.size() (e.g. usually 4 doubles on AVX2 and eight doubles on AVX-512). 
for (; okay + SPECIES.size() <= D; okay += SPECIES.size()) {  
DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + okay);  
DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat,   j*D + okay);
acc = a.fma(b, acc);  // fused multiply-add  
}  
double dot = acc.reduceLanes(VectorOperators.ADD);  
// deal with tail okay..D-1  
similaritiesFlat[i*N + j] = dot;  
}  
}

Determine beneath exhibits how the Vector API makes use of SIMD {hardware} to course of a number of doubles per instruction (e.g., 4 lanes on AVX2 and eight lanes on AVX‑512). What was many scalar multiply-adds turns into a smaller variety of vector fma() operations plus a discount—similar algorithm, a lot better use of the CPU’s vector models.

Fallbacks & Security: When the Vector API Isn’t Accessible

As a result of the Vector API remains to be incubating, it requires a runtime flag: --add-modules=jdk.incubator.vector We didn’t need correctness or availability to depend upon that flag. So we designed the fallback conduct explicitly: At startup, we detect Vector API assist and use the SIMD batched matmul when accessible; in any other case we fall again to an optimized scalar path, with single-video requests persevering with to make use of the per-item implementation.

That offers us a clear operational story: providers can choose in to the Vector API for max efficiency, however the system stays secure and predictable with out it.

Ends in Manufacturing:

With the complete design in place with batching, flat buffers, ThreadLocal reuse, and the Vector API, we ran canaries that run manufacturing site visitors. We noticed a ~7% drop in CPU utilization and ~12% drop in common latency. To normalize throughout any small throughput variations, we additionally tracked CPU/RPS (CPU consumed per request-per-second). That metric improved by roughly 10%, which means we might deal with the identical site visitors with about 10% much less CPU, and we noticed related numbers maintain after full manufacturing rollout.

On the perform operator degree, we noticed the CPU drop from the preliminary 7.5% to a merely ~1% with the optimization in place. On the meeting degree, the shift was clear: from loop-unrolled scalar dot merchandise to a vectorized matrix multiply on AVX-512 {hardware}.

Closing Ideas

This optimization ended up being much less about discovering the “quickest library” and extra about getting the basics proper: selecting the best computation form, holding knowledge structure cache-friendly, and avoiding overheads that may erase theoretical wins. As soon as these items have been in place, the JDK Vector API was an awesome match, because it allow us to categorical SIMD-style math in pure Java, with out JNI, whereas nonetheless holding a secure fallback path. One other bonus was the low developer overhead: in comparison with lower-level approaches, the Vector API allow us to change a a lot bigger, extra complicated implementation with a comparatively small quantity of readable Java code, which made it simpler to assessment, preserve, and iterate on.

Have you ever tried the Vector API in an actual service but? I’d love to listen to what workloads it helped (or didn’t), and what you discovered about benchmarking and rollout in manufacturing.

Source link

What's Hot

Why Rolling Stones Struggled With ‘Paint It Black’

Hulu Officially Unleashes Perfect First Trailer for 2026’s Messiest New Rom-Com

The Boys – Season 5

Optimizing Recommendation Systems with JDK’s Vector API | by Netflix Technology Blog | Mar, 2026

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph | by Netflix Technology Blog | May, 2026

Narnia Release Date Delayed, Netflix & Greta Gerwig Movie Gets Major Update

State of Routing in Model Serving | by Netflix Technology Blog | May, 2026

Netflix makes content discovery easy with new Instagram like vertical video feed

Subscribe to Updates