Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Saish Sali, Nipun Kumar, Sura Elamurugu

Introduction

As Netflix has grown, machine studying continues to assist our means to ship worth to members and drive excellence throughout a number of areas of our enterprise. When Netflix started investing in machine studying over a decade in the past, it was primarily targeted on a single area: personalization. Scala was the business commonplace, our ML groups had been comparatively small, and optimizing member engagement was our major use case. Quick ahead to immediately, and machine studying has turn out to be the spine of Netflix’s enterprise transformation. We now apply ML throughout varied enterprise domains, together with:

Personalization: Optimizing engagement and serving to members uncover content material they’ll love
Studio: Pre and post-production workflows
Funds: Fraud detection, cost routing, and recurring billing optimization
Advertisements: Our latest area, requiring real-time decisioning and concentrating on

… and a rising variety of extra use circumstances throughout the firm

Every area operates with a unique tech stack, totally different enterprise metrics, and a definite organizational construction. Whereas this variety is a testomony to how machine studying has developed to drive worth throughout many verticals at Netflix, this development introduces a brand new problem: enabling cross-pollination of fashions and knowledge throughout domains.

The Problem: A Fragmented ML Panorama

As our ML investments scaled throughout these domains, a vital drawback emerged: the fashions produced largely grew to become black containers. With none discovery infrastructure, ML practitioners couldn’t simply collaborate or share work throughout enterprise verticals.

Think about a concrete instance: content material embeddings. Our Studio groups create subtle embeddings that establish scene boundaries, detect visible transitions, and perceive content material construction. These embeddings had been initially constructed for manufacturing workflows.

However those self same embeddings might be extremely useful elsewhere. Advertisements may hypothetically use content material embeddings for context matching (making certain commercials align with the tone and content material of what’s at the moment taking part in). Personalization may leverage them for episodic merchandising and proposals (matching the subject or temper of an episode with a consumer’s most well-liked viewing preferences). But making this cross-pollination occur is very tough.

Why? Our ML instruments exist in silos, every with its personal backend providers and consumer interface. The mannequin registry is unaware of which A/B checks had been utilizing its fashions, and the pipeline orchestrator is unaware of downstream mannequin dependencies. ML practitioners must traverse a number of programs to reply fundamental questions on their work. Discovering a mannequin requires opening the mannequin registry, understanding its lineage means switching to the pipeline orchestrator, and monitoring which A/B checks use that mannequin requires navigating to the experimentation platform. This fragmentation prevents practitioners from answering vital questions:

Discovery: What options exist? What knowledge sources can be found for producing options for a mannequin?
Lineage: Which pipeline is producing knowledge for a particular mannequin? What knowledge sources feed these options?
Affect: Which A/B checks are working this mannequin? Which fashions will break if I alter this function? Who owns each bit of this chain?

The Laborious Drawback: Connecting the whole lot

The true problem wasn’t simply constructing a consolidated UI. We would have liked to attach the totally different items of infrastructure our ML practitioners had been utilizing to carry out totally different elements of the ML lifecycle.

Our ML ecosystem generates metadata from dozens of sources:

Pipeline orchestration programs emit execution particulars, stage dependencies, and knowledge transformations
Deployed mannequin registry tracks mannequin variations, artifacts, staleness, and deployment historical past
Experimentation platform manages A/B checks and their configurations
Characteristic retailer catalog function definitions and utilization
AI Dataset platform tracks the creation, administration, discovery, and loading of datasets.
Identification platform maintains consumer, crew, and group metadata

Every system employs totally different codecs, identifiers, and psychological fashions. The onerous technical drawback we needed to clear up was: How can we accumulate this heterogeneous metadata, rework it right into a unified entity mannequin, and construct a related graph that allows true exploration and collaboration throughout enterprise domains?

The Answer: Metadata Service and the Mannequin Lifecycle Graph

Our reply was the Metadata Service (MDS), which builds a Mannequin Lifecycle Graph that indexes and connects ML-related entities throughout Netflix. MDS is optimized for real-time ingestion of ML metadata (e.g., fashions, options, pipelines, experiments, datasets) and to reply cross-domain questions similar to “Which experiments are working this mannequin?” or “Which fashions share these options?” It’s the basis that allows discovery, ingesting occasions from numerous sources, enriching them with context, and materializing relationships throughout entities.

Our imaginative and prescient: to make each ML asset at Netflix discoverable, comprehensible, and reusable by each ML practitioner, no matter their crew or area.

Core Abstractions: The Vocabulary of the System

Earlier than diving into the technical implementation, it’s useful to know the conceptual mannequin that underpins MDS. This vocabulary permits constant communication throughout groups and programs:

Element: Any object that’s uniquely addressable utilizing an AI Platform’s (AIP) Uniform Useful resource Identifier (URI). An AIP URI follows the formataip://<componentType>/<platformId>/<resourceId>, making certain international uniqueness. For instance:

Fashions: aip://mannequin/registry/ranking-v5
Customers: aip://consumer/identification/alice
Pipelines: aip://pipeline/orchestrator/weekly-training

Entity: A part throughout the ML ecosystem, characterised by extra properties similar to title, description, creation date, and house owners. Entities signify ML-specific property, similar to fashions, options, and pipelines.

Entity Sort: A bunch of entities that share the identical knowledge form. An information form is a set of property constraints that specify the attributes and relationships an entity should have.

Area: A purposeful grouping of associated entity sorts that defines the summary interface for a class of ML property. For instance, the Fashions area defines what a Mannequin and Mannequin Occasion appear like, whereas the Pipelines area defines Schedules, Requests, and Executions.

Supplier: A concrete implementation of a website, backed by a particular supply system. For instance, the Fashions area is at the moment backed by our inside mannequin registry. This separation permits MDS to assist a number of suppliers for a similar area. If a brand new mannequin registry had been launched, it might be added as a further supplier with out altering the area interface.

We are able to summarize these ideas with a concrete instance:

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

This URI-based addressing scheme is essential because it permits any service to reference any ML asset with a single string, and MDS can resolve that reference again to wealthy, related metadata.

From Occasions to Entities to Graph

The journey from uncooked system occasions to a queryable graph occurs in phases. Let’s stroll by means of every with a concrete instance: connecting a mannequin to its A/B checks by means of relationship inference.

1 Occasion Ingestion

MDS integrates with varied supply programs through Kafka and AWS SNS/SQS, consuming occasions in real-time. Supply programs emit skinny occasions that embody an identifier and an occasion sort.

Instance occasion:

{
  "event_type": "model_instance_created",
  "instance_id": "ranking-model-v5-20XX0101",
  ...
}

This design retains producers easy. Supply programs solely must announce {that a} change occurred, with out constructing full payloads or understanding downstream necessities.

Every supply system has devoted occasion handlers in MDS:

Pipeline Orchestration: Ingests pipeline execution occasions, together with node definitions, schedules, requests, and job makes an attempt
Mannequin Registry: Captures mannequin deployments, configurations, and model updates
Characteristic Retailer: Tracks function definitions and their variations
Experimentation Platform: Screens A/B take a look at configurations and allocations
Datasets: Tracks ML datasets and their variations
Identification Platform: Maintains possession and crew membership info

2 Entity Enrichment

MDS implements a hydration contract for every occasion sort. When an occasion arrives, MDS:

Validates the occasion schema
Calls the supply system’s API to fetch the whole, present state
Transforms the response right into a normalized entity

This design has a vital property: the order of occasions doesn’t matter. MDS at all times fetches the newest details from the supply of reality. This sample decouples the occasion stream from state consistency. If the occasion bus drops a message or delivers it out of order, the subsequent occasion corrects the state. The occasion stream turns into a notification of change somewhat than a log of adjustments.

This notification of change sample has a couple of necessary tradeoffs. On the plus facet, it retains producers easy, makes us strong to out-of-order or dropped occasions, and ensures that MDS can at all times reconcile to the newest state by studying from the supply of reality. The tradeoff is that we place extra learn load on supply programs throughout hydration and have to be deliberate about fee limiting, caching, and backoff in our enrichment staff in order that we don’t overload them.

For our rating mannequin instance, when the model_instance_created occasion arrives, MDS calls the Mannequin Registry API: GET /api/v1/situations/ranking-model-v5-20XX0101

The registry responds with a full descriptor. Instance response (key fields solely):

{
  "id": "ranking-model-v5-20XX0101",
  "pipeline_run_id": "train-weekly-ranking-20XX0101",
  "owner_emails": ["alice@netflix.com"],
  "labels": [{"key": "team", "value": "personalization"}],
  ...
}

3 Knowledge Transformation and Normalization

Uncooked occasions are heterogeneous and every supply system has its personal schema and semantics. MDS staff rework these occasions right into a unified entity mannequin with standardized fields.

With out normalization, downstream shoppers would want to know each supply system’s schema. Normalization creates a constant interface, permitting queries and relationships to work throughout all entity sorts. Right here is an instance.

Normalized MDS entity:

{
  "id": "aip://mannequin/registry/ranking-model-v5-20XX0101",
  "pipeline_run": "aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101",
  "entity_type": "ModelInstance",
  "house owners": ["aip://user/identity/alice"],
  "tags": [{"tag": "team", "value": "personalization"}],
  ...
}

The normalization course of standardizes discipline names and codecs. For instance, platform-specific IDs turn out to be international AIP URIs, owner_emails turns into house owners with resolved consumer URIs, and labels turn out to be tags. International keys like pipeline_run_id are remodeled into entity references. Nonetheless, there’s nonetheless no reference to which A/B checks are utilizing this mannequin. The Mannequin Registry doesn’t observe experiments, and the Experimentation Platform doesn’t observe which pipeline produced a given mannequin. That is the place data enrichment turns into vital.

4 Storage and Indexing

As soon as normalized, entities are persevered to Datomic and instantly listed in Elasticsearch. This occurs synchronously throughout the occasion processing circulation.

Datomic for Caching and Relationships
Normalized entities are first written to Datomic, which serves as each an area cache and a graph database.

Why Datomic? Datomic serves as each the system of document for MDS and the working dataset for enrichment processes. Its immutable reality mannequin means we are able to constantly add relationships with out dropping the unique entity state.

What we retailer:

All entity attributes as details
Entity references (overseas keys that will level to entities not but totally resolved)
All relationships as reified edges (added by enrichment processes)
Entity lifecycle state (monitoring which entities are totally enriched vs awaiting hydration)

This permits:

Complicated graph traversals: Navigate from a mannequin to its options to their knowledge sources in a single question
Entity relationships: Be a part of throughout a number of domains with out N+1 question issues
Versatile schema evolution: Straightforward so as to add new entity sorts and attributes because the catalog grows
Progressive enrichment: Background jobs effectively establish and course of entities requiring extra hydration, enabling gradual graph completion with out reprocessing totally enriched entities

In follow, we use Datomic for relationship-heavy, navigational queries such as:

Ranging from this mannequin occasion, present me all upstream datasets and downstream experiments.
Given this function, checklist all consuming fashions and their proudly owning groups.

These queries typically span a number of hops within the graph and profit from Datomic’s immutable reality mannequin and environment friendly joins throughout entity relationships.

Elasticsearch for Discovery
Instantly after writing to Datomic, entities are listed in Elasticsearch to energy quick, full-text search throughout the catalog.

What we index:

Main fields: Entity title, description, entity sort, proprietor names
Relationship metadata: Names of associated entities (e.g., a mannequin’s options, pipelines, A/B checks) saved within the associated discipline
Tags: Area-specific metadata saved as key-value pairs (e.g., crew::personalization, env::manufacturing, mannequin.state::launched)

Index construction:

Single entities index: All entity sorts (fashions, options, pipelines, and so forth.) are listed in a single unified index, differentiated by the entityType discipline
Separate house owners index: Devoted index for customers and teams to allow cross-entity proprietor searches
Relevance boosting: Precise title matches rating greater than different related matches

This permits:

Multi-field textual content search throughout entity names, descriptions, tags, and associated metadata
Relevance rating with boosting (precise title matches rating considerably greater)
Complicated filtering by entity sort, possession, tags, and domain-specific attributes (saved as tags)
Fuzzy matching to deal with typos and partial queries

Elasticsearch powers the entry level into the system: customers sometimes begin with a free-text search within the AIP Portal (for a mannequin title, a crew, or a website time period), after which swap to graph navigation as soon as they land on an entity web page. Indexing occurs in close to real-time as a part of the ingestion and enrichment workflows, so adjustments are often seen within the Portal with a brief delay that’s acceptable for interactive use.

5 Information Enrichment and Graph Formation

As soon as entity metadata is persevered in Datomic, scheduled background processes take over to find and materialize relationships. These enrichment jobs run periodically, scanning for uncached or partially resolved entities (entities that exist solely as references with out full metadata).

The enrichment workflow:

Establish candidates: Discover entities marked as uncached or with unresolved references
Hydrate relationships: Question source-of-truth programs to fetch associated entity particulars
Materialize edges: Write found relationships again to Datomic
Re-index: Set off Elasticsearch indexing for up to date entities
Mark as enriched: Replace entity standing to forestall redundant processing

This asynchronous strategy permits MDS to deal with the computational price of graph formation with out blocking real-time occasion ingestion. It additionally permits retry logic and gradual enrichment as new entities turn out to be accessible.

As a result of enrichment is asynchronous, newly found relationships could seem with a brief delay after the underlying entities are created (sometimes minutes somewhat than seconds). We observe when every entity was final enriched and floor this timestamp within the AIP Portal, so practitioners can motive about staleness and know when it’s protected to depend on a selected relationship for debugging or influence evaluation.

Why enrich? Supply programs are purpose-built and don’t learn about entities in different domains. Enrichment discovers and materializes cross-system relationships that allow highly effective lineage and influence queries.

Instance: Connecting Fashions to A/B Checks

When MDS processes a brand new mannequin occasion, background enrichment jobs uncover relationships by means of multi-hop inference:

Step 1: Direct hyperlink to pipeline

The mannequin references a pipeline_run_id. An enrichment job hydrates the pipeline and discovers its A/B take a look at associations: GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101

Response:

{
"run_id": "train-weekly-ranking-20XX0101", "pipeline":  "weekly-ranking-trainer",
"ab_test_cells": [
   {"test_id": "12345","cell_number": 2,"cell_name": "treatment_ranking_v5"}
 ]
 ...
}

Step 2: Uncover A/B take a look at context
The enrichment job discovers the pipeline ran for A/B take a look at cell #2 and queries the Experimentation Platform for take a look at particulars: GET /api/v1/checks/12345

{
 "test_id": "12345",
 "title": "Rating Mannequin v5 vs v4",
 "standing": "ACTIVE",
 "cells": [{"cell_number": 1, "name": "control_ranking_v4"}],
 ...
}

Step 3: Infer transitive relationships
The enrichment job now has the whole chain:

Mannequin Occasion was produced by Pipeline Run
Pipeline Run was executed for A/B Take a look at Cell #2
The A/B Take a look at Cell #2 belongs to A/B Take a look at “Rating Mannequin v5 vs v4”
Mannequin Occasion now will get related to this A/B Take a look at

The job writes the inferred relationship again to Datomic and triggers re-indexing, and materializes these edges within the graph. MDS doesn’t simply retailer what it’s informed; it derives new data by strolling the graph within the background.

Why this issues: With out MDS, answering “Which A/B checks are utilizing this mannequin?” requires:

Wanting up the mannequin within the Mannequin Registry
Discovering which pipeline produced it
Checking the Pipeline Orchestrator for A/B take a look at tags
Querying the Experimentation Platform for take a look at particulars

With the mannequin lifecycle graph, it’s a single question:

question {
  mannequin(id: "aip://mannequin/registry/ranking-model-v5-20XX0101") {
    title
    house owners { title }
    currentInstance {
      model
      pipeline {
        title
        house owners { title }
      }
      options {
        edges {
          node {
            title
            knowledge { edges { node { title } } }
          }
        }
      }
      associatedAbTests {
        title
        cells { quantity title }
      }
    }
  }
}

The reverse question additionally works: “What fashions are being examined in experiment 12345?”

Enabling Exploration, Not Simply Search

With the Mannequin Lifecycle Graph in place, we shift from entity search to entity exploration. Discovery isn’t nearly discovering a mannequin; It’s about traversing relationships:

Begin with a mannequin, discover its options
From options, navigate to the core knowledge driving them
From the info, hint again to the pipelines producing it
From pipelines, see which groups personal and rely upon them
From experiments, perceive which fashions are being examined

For instance, think about an engineer investigating a degraded engagement metric for a personalization mannequin. They would possibly:

Begin with the mannequin occasion powering the affected suggestions within the AIP Portal.
Examine the mannequin’s options and observe a suspicious function to its upstream dataset.
From the dataset web page, see that its pipeline not too long ago had failed runs and establish the proudly owning crew.
Verify which A/B checks are at the moment working this mannequin occasion to know which members and surfaces are impacted.

Earlier than MDS and the Mannequin Lifecycle Graph, this required guide checks throughout a number of instruments (mannequin registry, pipeline orchestrator, experiment platform). Now it’s a contiguous journey in a single interface.

This graph-based exploration solutions questions that had been beforehand unattainable:

Lineage queries: What’s the full lineage of this mannequin, from coaching knowledge to manufacturing experiments?
Affect evaluation: Which fashions can be affected if I alter this function?
Utilization discovery: Which A/B checks are utilizing this mannequin?
Dependency mapping: What knowledge sources does my pipeline transitively rely on?
Deprecation planning: Which entities are now not getting used and might be retired?

Each entity has deep context: its creation time, possession, replace historical past, and most significantly, its relationships to different entities.

The Mannequin Lifecycle Graph is surfaced to practitioners by means of the AIP Portal, a unified interface that gives full-text search throughout all entity sorts, detailed entity pages with navigable relationships, and personalised views for groups and people.

A typical interplay within the AIP Portal seems to be like:

Search: Sort a mannequin, function, dataset, or crew title into the one search field backed by Elasticsearch.
Examine: Land on an entity web page that reveals key metadata (description, house owners, domains, tags) alongside a relationships panel.
Discover: Click on by means of to associated entities (upstream datasets, downstream experiments, and sibling mannequin variations) to navigate the Mannequin Lifecycle Graph with out leaving the portal.

When new entity sorts are launched into MDS, the portal robotically gives baseline search, entity pages, and relationship navigation, and we are able to then layer on domain-specific visualizations (similar to mannequin deployment historical past or dataset model timelines) over time.

The Highway Forward: Open Challenges

Constructing the ML lifecycle graph is an ongoing journey. Important challenges stay, and these signify the longer term alternatives for us:

Software Proliferation: As new ML instruments emerge, we want strong integration patterns that scale. How can we design plugin architectures that make including new sources seamless? If we don’t sustain with new instruments, practitioners can be pressured again into fragmented views, and the Mannequin Lifecycle Graph will lose protection and belief.
Area-Particular Visualizations: Totally different entity sorts require distinct visualization experiences. Mannequin pages ought to show deployment historical past, A/B take a look at associations, and efficiency metrics. Characteristic pages ought to spotlight knowledge lineage and consuming fashions. Pipeline pages should present execution historical past, dependencies, and schedules. Dataset pages require versioning timelines and downstream shoppers. How can we design a versatile UI framework that permits every entity sort to have its personal tailor-made expertise whereas sustaining constant navigation and interplay patterns throughout the portal? With out wealthy, domain-specific experiences, the portal dangers turning into a generic catalog somewhat than a instrument that ML practitioners depend on of their each day workflows.
Metadata High quality: Right this moment, MDS ensures knowledge consistency by means of source-of-truth hydration and schema validation at ingestion. Background enrichment jobs constantly infer relationships and materialize entities from supply programs. Nonetheless, challenges stay in making certain completeness and timeliness at scale. When supply programs fail to emit occasions, when possession info turns into stale, or when entities lack descriptions and contextual metadata, the graph’s utility degrades. How can we construct automated validation and enrichment programs to detect metadata anomalies, counsel lacking relationships, and keep high quality benchmarks throughout tens of millions of entities? Poor or stale metadata erodes practitioner belief: if the graph is incomplete or incorrect, groups will revert to advert hoc data and one-off integrations somewhat than utilizing MDS as their supply of reality.
Superior Relationship Inference: Past express relationships declared in supply programs, how can we infer implicit connections? Can we detect that two fashions serve related functions based mostly on shared options? Can we suggest options based mostly on utilization patterns from related pipelines? We’re within the early phases of exploring these concepts. Executed properly, they’d flip MDS from a passive catalog into an lively advice engine for ML property, accelerating reuse and lowering duplicate work throughout domains.

Acknowledgments

This work represents the collective effort of beautiful colleagues throughout the AI Platform group: Emma Carney, Megan Ren, Nadeem Ahmad, Pat Oleniuk, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu

Democratizing Machine Studying at Netflix: Constructing the Mannequin Lifecycle Graph was initially revealed in Netflix TechBlog on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.

Source link

What's Hot

10 Best Books of the 21st Century

‘The Sopranos’ Actor Vincent Pastore Tragically Passes Away at 80

Hollywood stars who embrace their natural, imperfect teeth

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

‘The Sopranos’ Actor Vincent Pastore Tragically Passes Away at 80

7 Things Fans Hated About ‘Spider-Man: Brand New Day’

Modeling Device Capabilities for Analytics

The Best Crime Franchise on TV Is Officially Coming to Netflix for the First Time

Subscribe to Updates

What's Hot

10 Best Books of the 21st Century

‘The Sopranos’ Actor Vincent Pastore Tragically Passes Away at 80

Hollywood stars who embrace their natural, imperfect teeth

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Introduction

The Problem: A Fragmented ML Panorama

The Laborious Drawback: Connecting the whole lot

The Answer: Metadata Service and the Mannequin Lifecycle Graph

Core Abstractions: The Vocabulary of the System

From Occasions to Entities to Graph

1 Occasion Ingestion

2 Entity Enrichment

3 Knowledge Transformation and Normalization

4 Storage and Indexing

5 Information Enrichment and Graph Formation

Instance: Connecting Fashions to A/B Checks

Enabling Exploration, Not Simply Search

The Highway Forward: Open Challenges

Acknowledgments

Related Posts

‘The Sopranos’ Actor Vincent Pastore Tragically Passes Away at 80

7 Things Fans Hated About ‘Spider-Man: Brand New Day’

Modeling Device Capabilities for Analytics

The Best Crime Franchise on TV Is Officially Coming to Netflix for the First Time