By Nipun Kumar, Rajat Shah, Peter Chng
Introduction
That is the primary weblog put up in a multi-part collection that shares technical insights into how our ML mannequin serving infrastructure powers a number of customized experiences at scale throughout numerous domains (e.g., title suggestions, commerce). On this introductory weblog put up, we’ll dive into our domain-independent API abstraction and its site visitors routing capabilities that the central ML mannequin serving platform exposes to a number of domain-specific microservices for mannequin inference. This singular API, or entry level, into the ML mannequin serving platform has considerably elevated the pace of innovation for iterating on newer variations of current ML experiences, in addition to enabling fully new product experiences with ML.
Machine Studying use circumstances powering member experiences on Netflix require speedy iteration and evolution in response to new learnings. The success of our ML mannequin serving infrastructure largely will depend on enabling researchers to quickly experiment with new hypotheses and safely, at scale, launch their fashions into manufacturing. Equally necessary is enabling a number of microservices at Netflix to seamlessly get mannequin inference with out exposing the complexities of ML mannequin inference. To realize this in a uniform and scalable method, we created a centralized ML serving platform. As of 2025, the platform serves a whole bunch of mannequin sorts and variations, netting 1 million requests per second. On this put up, we’ll zoom in on a core problem of any large-scale ML serving system: How one can route site visitors to the suitable mannequin occasion, on the suitable cluster shard, for the suitable consumer and use case, whereas preserving a easy abstraction for each consumer providers and mannequin researchers.
Background
Fashions at Netflix
To correctly body our dialogue, let’s first make clear the excellence between mannequin serving and mannequin inference. At Netflix, the definition of an ML mannequin has traditionally been considerably distinctive. Whereas mannequin inference usually focuses solely on an infer(options) -> rating functionality, fashions at Netflix act as self-contained workflows that rework inputs to outputs. A “mannequin” encapsulates pre- and post-processing, function computation logic, and an non-compulsory ML-trained part, all packaged in a normal format appropriate to be used throughout a number of contexts. We discuss with the end-to-end execution of this workflow as mannequin serving. This distinction issues as a result of our routing and API abstractions function on the stage of workflows, not simply particular person scoring features.
A number of simplified examples of mannequin serving use circumstances:
Use case: Personalised Proceed Watching row on Netflix Homepage
- Enter: UserId, Nation, System ID
- Output: Ranked Listing of films and exhibits (aka title): [titleId1, titleId2, titleId3,…]
Use case: Fee Fraud Detection
- Enter: UserId, Nation, Fee Transaction particulars
- Output: Chance of the transaction being fraudulent
A typical move of this serving workflow is depicted beneath:
To realize this increased stage of abstraction, the mannequin definition comprises a listing of information (uncooked, unprocessed information or observations constructed as states in several enterprise workflows) that it must compute options, and it depends on the mannequin serving platform to provide these information at serving time by calling a number of different microservices. Likewise, throughout offline coaching, Netflix’s ML reality retailer gives snapshots for bulk entry to facilitate function computation.
The necessary takeaway from this mannequin definition is that the calling providers solely want to offer customary request context (reminiscent of userId, nation, system), and the related area context (reminiscent of titles to rank, or fee transaction for fraud detection), and the mannequin can itself compute options and carry out inference as a part of the execution move. This frequent set of request contexts throughout domains allows them to share a normal API abstraction and standardizes how numerous consumer microservices can uniformly combine with the serving app. Moreover, shoppers are shielded from the mannequin choice and execution, permitting the mannequin structure and information inputs to evolve with minimal consumer coordination.
This put up focuses on showcasing the technical particulars to assist this design paradigm. We’ll first describe how we applied this abstraction with Switchboard, a centralized routing service, after which talk about the operational challenges we encountered at scale and the way they led us to the Lightbulb structure.
ML Mannequin Serving Platform Ideas
We envisioned a central mannequin serving platform for all of Netflix’s member-facing ML Mannequin serving wants. This bold effort required principled pondering to offer the suitable stage of abstraction for each the researchers and consumer functions. The next concepts, that are related to the subject of this weblog put up, ensured that the platform acts as an enabler of speedy ML innovation and limits the publicity of ML mannequin iterations to the consumer apps:
- Mannequin innovation impartial of consumer apps: There ought to be solely a one-time integration effort by the calling app with the ML serving platform for a brand new use case. After that, virtually all mannequin iterations, together with intermediate mannequin A/B experiments, ought to be principally opaque to the calling apps. This suggests that the platform ought to deal with duties reminiscent of mannequin choice primarily based on a consumer’s A/B allocation, fetching extra information wanted by experimental fashions, logging for additional coaching or observability, and extra. This additionally advantages the ML researcher, as they solely have to coordinate with one platform for mannequin innovation.
- Decouple shoppers from mannequin sharding: Fashions are distributed throughout a number of serving compute cluster shards, every with its personal Digital IP (VIP) Handle. Varied elements, reminiscent of site visitors patterns, SLAs, mannequin structure, and CPU/Reminiscence availability, have an effect on model-to-cluster mapping, and adjustments to this mapping end in adjustments to the VIP deal with at which a mannequin is reachable. The serving platform ought to make shoppers agnostic to such frequent VIP deal with adjustments whereas making certain excessive availability.
- Versatile site visitors routing guidelines: Assist versatile mechanisms to introduce new site visitors routing guidelines. This contains supporting site visitors routing primarily based on A/B experiments, offering a knob to slowly shift site visitors to new fashions and VIP addresses, and permitting consumer overrides.
Introducing Switchboard
Normal out-of-the-box API Gateway options (reminiscent of AWS API Gateway, a standalone Service Mesh proxy) didn’t meet all our necessities. Particularly, we would have liked first-class integration with Netflix’s experimentation platform, the flexibility to reveal gRPC endpoints to shoppers, and the flexibility to make use of wealthy domain-specific context for routing customizations, which generic proxies weren’t designed to deal with. Moreover, the platform required customizations to model-specific lifecycle levels (shadow mode, canaries, rollbacks) to allow protected rollouts and migrations.
Therefore, we launched into constructing a customized service that serves as a versatile proxy layer for all site visitors, dealing with over 1 million requests per second whereas sustaining excessive availability and reliability. We named it Switchboard.
Switchboard serves because the central entry level for the system, performing as a compulsory interface for all shoppers to entry the suitable mannequin primarily based on their context. Its position is to carry out context-aware routing and to use any configured context enrichment to the mannequin inputs.
Here’s a visible illustration of the request move from completely different shoppers to completely different serving clusters:
Goal Abstraction
To assist this technique design, we introduce the idea of an “Goal”. It’s an Enumeration outlined by the serving platform that each request into the system should present. It has three key functions:
Briefly, an Goal is the serving platform’s identify for a selected enterprise use case (e.g., ContinueWatchingRanking), which decouples shoppers from concrete fashions and guides the platform’s routing and mannequin choice choices.
Key Capabilities of Switchboard
To summarize, these are the important thing capabilities of Switchboard:
- Frequent Shopper Abstraction: Switchboard gives a single level of contact for all our shoppers’ mannequin wants. When shoppers want to devour extra fashions for brand new ML functions addressing the identical enterprise want, there isn’t a new service dependency to introduce or new shoppers to handle to make requests to the fashions. From an ML Ops perspective, this additionally offers us knobs to manage consumer price limits throughout mannequin variations and handle central concurrency limits to take care of dangerous shoppers.
- Context-Conscious Routing: Switchboard can route a request primarily based on a wealthy set of contextual options, such because the consumer’s present system, locale, rating floor kind (e.g., house web page vs. search outcomes), or the present A/B check a consumer is in.
- Dynamic Visitors Splitting: It allows real-time site visitors splitting for canary deployments and experimentation. This permits engineers to securely roll out a brand new mannequin model to a small, managed share of customers earlier than a full launch.
- Mannequin Versioning and Lifecycle Administration: Switchboard inherently manages concurrent request site visitors to a number of variations of the identical mannequin. That is essential for:
- Shadow Mode Testing: Routing manufacturing site visitors to a brand new mannequin model with out affecting the consumer expertise, enabling efficiency comparisons.
- Instantaneous Rollback: Speedy switching of site visitors away from a problematic new mannequin model again to a steady one.
However is that this the entire story? Not fairly. Introducing this routing layer provides complexity to our mannequin deployment cycles. As well as, we’d like a mechanism to gather the context-based routing info from the researchers after they select to deploy mannequin variants.
The Glue — Switchboard Guidelines
Provided that Targets function the contract between shoppers and the serving platform, we would have liked a manner for researchers to connect mannequin variants, experiments, and site visitors splits to these Targets with out altering consumer code. That is the place Switchboard Guidelines is available in.
The first UX for mannequin researchers to outline fashions related to an goal in a versatile method is a JavaScript configuration, which we name Switchboard Guidelines. It’s used to provide a algorithm (usually a JSON file) that primarily dictate the next issues to the serving platform:
- The default mannequin to make use of for a given Goal
- A/B experiments to configure for a set of Targets and the corresponding fashions to load for these experiments
- Customizations to regularly shift site visitors to a brand new mannequin
Right here is an instance of an A/B check rule within the context of the Proceed Watching row:
/**
Configuration rule written by a Mannequin Researcher so as to add an A/B experiment within the Mannequin Serving system.
Cell 1: Makes use of the default, presently productized mannequin
Cell 2 and Cell 3: Use completely different experimental (candidate) fashions
**/perform defineAB12345Rule() {
const abTestId = 12345;
const aims = Targets.ContinueWatchingRanking;
const abTestCellToModel = {
1: {identify: "netflix-continue-watching-model-default"},
2: {identify: "netflix-continue-watching-model-cell-2"},
3: {identify: "netflix-continue-watching-model-cell-3"}
};
return {
cellToModel: abTestCellToModel,
abTestId: abTestId,
targetObjectives: [objectives],
modelInputType: constants.TITLE_INPUT_TYPE,
modelType: 'SCORER'
};
}
These guidelines are consumed by each the Switchboard and the Mannequin Serving clusters. Given these guidelines, the serving platform parts can take numerous actions, some detailed beneath:
Get Netflix Expertise Weblog’s tales in your inbox
Be a part of Medium without cost to get updates from this author.
Management Aircraft Stream:
- Task: Produce model-to-cluster shard task.
- Validation: Load all specified fashions into the Serving Cluster Shard and validate mannequin dependencies to make sure profitable execution.
- Mapping: Present the model-to-shard VIP deal with mapping to Switchboard.
Information Aircraft Stream:
- Allocation: If the request is for Goal=ContinueWatchingRanking, question the Experimentation Platform for the userId’s cell allocation.
- Mannequin Choice: Use the allocation and A/B check rule to pick out the suitable mannequin.
- Request Routing: Route the request to the serving cluster shard with the chosen mannequin and context.
- Mannequin Execution (on the serving host): Run the mannequin workflow steps and return the response.
A key spotlight of this setup is the decoupling of the experimentation config from the serving platform code. This contains having an impartial launch cycle for the principles, separate from the code deployments. Netflix’s Gutenberg system gives a wonderful ecosystem that permits a versatile pub-sub structure, facilitating correct versioning, dynamic loading, simple rollbacks, and extra. Each Switchboard and the Serving Cluster Host subscribe to the identical Switchboard Guidelines configuration.
To stop race situations and guarantee correct sync of the dynamic Switchboard Guidelines configuration, the next move is taken into account:
Evolving Challenges
Switchboard solved the first drawback of enhancing mannequin iteration and innovation velocity, and offered a wonderful ML serving abstraction to over 30 service shoppers. Nevertheless, because the system scale elevated, just a few challenges and issues with this design turned obvious:
- Single level of failure: The presence of Switchboard within the important request path clearly highlights the dangers of shutting down entry to all serving hosts in excessive circumstances, reminiscent of unintentional bugs or noisy neighbors sending extreme site visitors.
- Why this issues: Switchboard turned a shared dependency whose failure would degrade or disable a number of ML-powered experiences at Netflix.
- Added latency as a consequence of extra community hop: Switchboard within the request path provides between 10–20ms of latency as a consequence of serialization-deserialization operations, relying on payload dimension. Moreover, it additional exposes a request to tail latency amplification.
- Why this issues: The added latency is unacceptable for some latency-sensitive shoppers, leading to end-user influence as a consequence of service timeouts.
- Lowered Shopper flexibility: Switchboard obscures visibility into consumer request origins from the serving clusters. Consequently, distinguishing information logged for actual vs synthetic site visitors, which is important for mannequin coaching, is tough and requires ongoing customization and elevated MLOps overhead.
- Why this issues: It makes it tougher to do tenant separation and check site visitors isolation.
What Subsequent? — Lightbulb
The aforementioned challenges of working Switchboard at scale compelled us to rethink the core implementation whereas retaining its key options. Our objective was to not throw away Switchboard’s design, however to refactor the place and the way its tasks had been executed, retaining the advantages whereas decreasing danger and latency. Notably:
- Frequent Shopper Abstraction
- Decouple shoppers from mannequin sharding
- Versatile site visitors routing guidelines
- Light-weight system consumer
- Single place to outline mannequin and experimentation config
- Quick experimentation config propagation
- Fallback and client-side caching in case of failures
Nevertheless, we did wish to deal with a number of the earlier design selections to maneuver ahead with:
- Take away the routing service from the direct request path: Having a single service within the lively request path introduces one other failure mode and limits fallback flexibility. Whereas routing guidelines change sometimes, sustaining consistency comes at the price of elevated availability dangers.
- Separate mannequin inputs from the request metadata: In sure circumstances, the request payload may very well be fairly massive. Needing to deserialize after which re-serialize the payload because it flowed via Switchboard to make a routing choice was a big contributor to latency and elevated serving prices.
- Present higher isolation for the routing layer: Consolidating a number of use circumstances (tenants) right into a single routing cluster poses two predominant challenges. First, error propagation posed a danger, as a surge of problematic requests from one tenant may cascade errors again to Switchboard, probably impacting different customers. Second, the cluster needed to accommodate various latency necessities as a result of the requests from completely different use circumstances diversified considerably in complexity.
This required some adjustments in our setup move: Whereas it largely remained unchanged, nonetheless, we created separate parts for Routing and Mannequin Choice (Lightbulb):
We now take the principles for an Goal and break them into distinct units of configuration:
- Mannequin Serving Configuration: This permits us to find out which mannequin ought to be used at request time, together with the required metadata
- Routing Guidelines: Given a mannequin we wish to serve at request time, this tells us which VIP the request ought to be routed to.
The Information Aircraft adjustments additionally replicate this separation, as we now depend on Envoy to handle the routing particulars:
Envoy is already used for all egress communication between apps at Netflix, and it could route requests to completely different clusters (VIPs) primarily based on the configurable Routing Guidelines revealed from our management airplane. Nevertheless, it lacks the knowledge wanted to make routing choices and the flexibility to counterpoint the request physique with extra serving parameters required for A/B testing mannequin variants. We launched Lightbulb to cowl this hole:
- Lightbulb consumes the minimal request context, which comprises use-case info, and gives the metadata mapping required for routing on the Envoy layer.
- Lightbulb resolves the request context to find out a routingKey configuration together with the ObjectiveConfig — that is the place we place the mannequin id together with different request-specific configurations required for mannequin execution. That is achieved to separate the config decision related to the request from the location and routing info wanted to succeed in it on the inference cluster.
- Whereas the routingKey is added to the headers for Envoy proxy to devour, the consumer provides the ObjectiveConfig parameters to the request itself. That is achieved to keep away from bloating the request headers whereas passing extra parameters for the mannequin to course of the request appropriately.
- The routing of the particular request is carried out by the Envoy proxy, which has the metadata to map the routingKey to the precise cluster VIP operating the mannequin. As a result of the routingKey is in a header, this willpower may be made with minimal overhead.
These adjustments retain the benefits of Switchboard, reminiscent of a single integration level, abstraction of mannequin id from use case, context-aware routing, whereas addressing the challenges we noticed over time.
Conclusion
The evolution from Switchboard to Lightbulb marks a big architectural refinement in our ML mannequin serving infrastructure. Whereas Switchboard offered the preliminary abstraction layer important for speedy innovation, its latency and single-point-of-failure danger posed scaling hurdles. The next adoption of Lightbulb, a decoupled service targeted solely on routing metadata, and its integration with Envoy efficiently resolved these challenges. This subtle new structure preserves the important thing advantages — seamless consumer integration and versatile experimentation — whereas making certain dependable, environment friendly, and scalable supply of customized member experiences, positioning us properly for future ML development.
In future posts on this collection, we’ll dive deeper into different points of our ML serving platform, together with inference and have fetching, and the way they work together with the routing structure described right here.
