Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph | by Netflix Technology Blog

The journey from uncooked system occasions to a queryable graph occurs in phases. Let’s stroll by means of every with a concrete instance: connecting a mannequin to its A/B assessments by means of relationship inference.

1 Occasion Ingestion

MDS integrates with varied supply programs through Kafka and AWS SNS/SQS, consuming occasions in real-time. Supply programs emit skinny occasions that embody an identifier and an occasion sort.

Instance occasion:

{
"event_type": "model_instance_created",
"instance_id": "ranking-model-v5-20XX0101",
...
}

This design retains producers easy. Supply programs solely must announce {that a} change occurred, with out constructing full payloads or understanding downstream necessities.

Every supply system has devoted occasion handlers in MDS:

Pipeline Orchestration: Ingests pipeline execution occasions, together with node definitions, schedules, requests, and job makes an attempt
Mannequin Registry: Captures mannequin deployments, configurations, and model updates
Characteristic Retailer: Tracks function definitions and their variations
Experimentation Platform: Screens A/B take a look at configurations and allocations
Datasets: Tracks ML datasets and their variations
Id Platform: Maintains possession and crew membership info

2 Entity Enrichment

MDS implements a hydration contract for every occasion sort. When an occasion arrives, MDS:

Validates the occasion schema
Calls the supply system’s API to fetch the whole, present state
Transforms the response right into a normalized entity

This design has a vital property: the order of occasions doesn’t matter. MDS all the time fetches the newest details from the supply of fact. This sample decouples the occasion stream from state consistency. If the occasion bus drops a message or delivers it out of order, the subsequent occasion corrects the state. The occasion stream turns into a notification of change reasonably than a log of adjustments.

This notification of change sample has just a few essential tradeoffs. On the plus facet, it retains producers easy, makes us sturdy to out-of-order or dropped occasions, and ensures that MDS can all the time reconcile to the newest state by studying from the supply of fact. The tradeoff is that we place further learn load on supply programs throughout hydration and have to be deliberate about price limiting, caching, and backoff in our enrichment staff in order that we don’t overload them.

For our rating mannequin instance, when the model_instance_created occasion arrives, MDS calls the Mannequin Registry API: GET /api/v1/cases/ranking-model-v5-20XX0101

The registry responds with a full descriptor. Instance response (key fields solely):

{
"id": "ranking-model-v5-20XX0101",
"pipeline_run_id": "train-weekly-ranking-20XX0101",
"owner_emails": ["alice@netflix.com"],
"labels": [{"key": "team", "value": "personalization"}],
...
}

3 Information Transformation and Normalization

Uncooked occasions are heterogeneous and every supply system has its personal schema and semantics. MDS staff rework these occasions right into a unified entity mannequin with standardized fields.

With out normalization, downstream shoppers would want to know each supply system’s schema. Normalization creates a constant interface, permitting queries and relationships to work throughout all entity sorts. Right here is an instance.

Normalized MDS entity:

{
"id": "aip://mannequin/registry/ranking-model-v5-20XX0101",
"pipeline_run": "aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101",
"entity_type": "ModelInstance",
"house owners": ["aip://user/identity/alice"],
"tags": [{"tag": "team", "value": "personalization"}],
...
}

The normalization course of standardizes discipline names and codecs. For instance, platform-specific IDs develop into world AIP URIs, owner_emails turns into house owners with resolved consumer URIs, and labels develop into tags. International keys like pipeline_run_id are reworked into entity references. Nonetheless, there’s nonetheless no reference to which A/B assessments are utilizing this mannequin. The Mannequin Registry doesn’t monitor experiments, and the Experimentation Platform doesn’t monitor which pipeline produced a given mannequin. That is the place information enrichment turns into essential.

4 Storage and Indexing

As soon as normalized, entities are endured to Datomic and instantly listed in Elasticsearch. This occurs synchronously inside the occasion processing circulation.

Datomic for Caching and Relationships
Normalized entities are first written to Datomic, which serves as each an area cache and a graph database.

Why Datomic? Datomic serves as each the system of document for MDS and the working dataset for enrichment processes. Its immutable reality mannequin means we will repeatedly add relationships with out shedding the unique entity state.

What we retailer:

All entity attributes as details
Entity references (overseas keys which will level to entities not but absolutely resolved)
All relationships as reified edges (added by enrichment processes)
Entity lifecycle state (monitoring which entities are absolutely enriched vs awaiting hydration)

This allows:

Complicated graph traversals: Navigate from a mannequin to its options to their knowledge sources in a single question
Entity relationships: Be part of throughout a number of domains with out N+1 question issues
Versatile schema evolution: Simple so as to add new entity sorts and attributes because the catalog grows
Progressive enrichment: Background jobs effectively establish and course of entities requiring further hydration, enabling gradual graph completion with out reprocessing absolutely enriched entities

In apply, we use Datomic for relationship-heavy, navigational queries akin to:

Ranging from this mannequin occasion, present me all upstream datasets and downstream experiments.
Given this function, checklist all consuming fashions and their proudly owning groups.

These queries usually span a number of hops within the graph and profit from Datomic’s immutable reality mannequin and environment friendly joins throughout entity relationships.

Get Netflix Know-how Weblog’s tales in your inbox

Be part of Medium free of charge to get updates from this author.

Keep in mind me for sooner register

Elasticsearch for Discovery
Instantly after writing to Datomic, entities are listed in Elasticsearch to energy quick, full-text search throughout the catalog.

What we index:

Major fields: Entity identify, description, entity sort, proprietor names
Relationship metadata: Names of associated entities (e.g., a mannequin’s options, pipelines, A/B assessments) saved within the associated discipline
Tags: Area-specific metadata saved as key-value pairs (e.g., crew::personalization, env::manufacturing, mannequin.state::launched)

Index construction:

Single entities index: All entity sorts (fashions, options, pipelines, and many others.) are listed in a single unified index, differentiated by the entityType discipline
Separate house owners index: Devoted index for customers and teams to allow cross-entity proprietor searches
Relevance boosting: Actual identify matches rating increased than different related matches

This allows:

Multi-field textual content search throughout entity names, descriptions, tags, and associated metadata
Relevance rating with boosting (actual identify matches rating considerably increased)
Complicated filtering by entity sort, possession, tags, and domain-specific attributes (saved as tags)
Fuzzy matching to deal with typos and partial queries

Elasticsearch powers the entry level into the system: customers sometimes begin with a free-text search within the AIP Portal (for a mannequin identify, a crew, or a site time period), after which change to graph navigation as soon as they land on an entity web page. Indexing occurs in close to real-time as a part of the ingestion and enrichment workflows, so adjustments are normally seen within the Portal with a brief delay that’s acceptable for interactive use.

5 Data Enrichment and Graph Formation

As soon as entity metadata is endured in Datomic, scheduled background processes take over to find and materialize relationships. These enrichment jobs run periodically, scanning for uncached or partially resolved entities (entities that exist solely as references with out full metadata).

The enrichment workflow:

Establish candidates: Discover entities marked as uncached or with unresolved references
Hydrate relationships: Question source-of-truth programs to fetch associated entity particulars
Materialize edges: Write found relationships again to Datomic
Re-index: Set off Elasticsearch indexing for up to date entities
Mark as enriched: Replace entity standing to forestall redundant processing

This asynchronous strategy permits MDS to deal with the computational value of graph formation with out blocking real-time occasion ingestion. It additionally permits retry logic and gradual enrichment as new entities develop into accessible.

As a result of enrichment is asynchronous, newly found relationships could seem with a brief delay after the underlying entities are created (sometimes minutes reasonably than seconds). We monitor when every entity was final enriched and floor this timestamp within the AIP Portal, so practitioners can cause about staleness and know when it’s protected to depend on a specific relationship for debugging or impression evaluation.

Why enrich? Supply programs are purpose-built and don’t find out about entities in different domains. Enrichment discovers and materializes cross-system relationships that allow highly effective lineage and impression queries.

Instance: Connecting Fashions to A/B Exams

When MDS processes a brand new mannequin occasion, background enrichment jobs uncover relationships by means of multi-hop inference:

Step 1: Direct hyperlink to pipeline

The mannequin references a pipeline_run_id. An enrichment job hydrates the pipeline and discovers its A/B take a look at associations: GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101

Response:

{
"run_id": "train-weekly-ranking-20XX0101", "pipeline":  "weekly-ranking-trainer",
"ab_test_cells": [
{"test_id": "12345","cell_number": 2,"cell_name": "treatment_ranking_v5"}
]
...
}

Step 2: Uncover A/B take a look at context
The enrichment job discovers the pipeline ran for A/B take a look at cell #2 and queries the Experimentation Platform for take a look at particulars: GET /api/v1/assessments/12345

{
"test_id": "12345",
"identify": "Rating Mannequin v5 vs v4",
"standing": "ACTIVE",
"cells": [{"cell_number": 1, "name": "control_ranking_v4"}],
...
}

Step 3: Infer transitive relationships
The enrichment job now has the whole chain:

Mannequin Occasion was produced by Pipeline Run
Pipeline Run was executed for A/B Take a look at Cell #2
The A/B Take a look at Cell #2 belongs to A/B Take a look at “Rating Mannequin v5 vs v4”
Mannequin Occasion now will get related to this A/B Take a look at

The job writes the inferred relationship again to Datomic and triggers re-indexing, and materializes these edges within the graph. MDS doesn’t simply retailer what it’s informed; it derives new information by strolling the graph within the background.

Why this issues: With out MDS, answering “Which A/B assessments are utilizing this mannequin?” requires:

Wanting up the mannequin within the Mannequin Registry
Discovering which pipeline produced it
Checking the Pipeline Orchestrator for A/B take a look at tags
Querying the Experimentation Platform for take a look at particulars

With the mannequin lifecycle graph, it’s a single question:

question {
mannequin(id: "aip://mannequin/registry/ranking-model-v5-20XX0101") {
identify
house owners { identify }
currentInstance {
model
pipeline {
identify
house owners { identify }
}
options {
edges {
node {
identify
knowledge { edges { node { identify } } }
}
}
}
associatedAbTests {
identify
cells { quantity identify }
}
}
}
}

The reverse question additionally works: “What fashions are being examined in experiment 12345?”

Enabling Exploration, Not Simply Search

With the Mannequin Lifecycle Graph in place, we shift from entity search to entity exploration. Discovery isn’t nearly discovering a mannequin; It’s about traversing relationships:

Begin with a mannequin, discover its options
From options, navigate to the core knowledge driving them
From the info, hint again to the pipelines producing it
From pipelines, see which groups personal and depend upon them
From experiments, perceive which fashions are being examined

For instance, think about an engineer investigating a degraded engagement metric for a personalization mannequin. They may:

Begin with the mannequin occasion powering the affected suggestions within the AIP Portal.
Examine the mannequin’s options and observe a suspicious function to its upstream dataset.
From the dataset web page, see that its pipeline just lately had failed runs and establish the proudly owning crew.
Affirm which A/B assessments are at present working this mannequin occasion to know which members and surfaces are impacted.

Earlier than MDS and the Mannequin Lifecycle Graph, this required guide checks throughout a number of instruments (mannequin registry, pipeline orchestrator, experiment platform). Now it’s a contiguous journey in a single interface.

This graph-based exploration solutions questions that have been beforehand inconceivable:

Lineage queries: What’s the full lineage of this mannequin, from coaching knowledge to manufacturing experiments?
Affect evaluation: Which fashions can be affected if I modify this function?
Utilization discovery: Which A/B assessments are utilizing this mannequin?
Dependency mapping: What knowledge sources does my pipeline transitively depend upon?
Deprecation planning: Which entities are now not getting used and may be retired?

Each entity has deep context: its creation time, possession, replace historical past, and most significantly, its relationships to different entities.

The Mannequin Lifecycle Graph is surfaced to practitioners by means of the AIP Portal, a unified interface that gives full-text search throughout all entity sorts, detailed entity pages with navigable relationships, and customized views for groups and people.

A typical interplay within the AIP Portal appears to be like like:

Search: Kind a mannequin, function, dataset, or crew identify into the one search field backed by Elasticsearch.
Examine: Land on an entity web page that reveals key metadata (description, house owners, domains, tags) alongside a relationships panel.
Discover: Click on by means of to associated entities (upstream datasets, downstream experiments, and sibling mannequin variations) to navigate the Mannequin Lifecycle Graph with out leaving the portal.

When new entity sorts are launched into MDS, the portal mechanically supplies baseline search, entity pages, and relationship navigation, and we will then layer on domain-specific visualizations (akin to mannequin deployment historical past or dataset model timelines) over time.

The Street Forward: Open Challenges

Constructing the ML lifecycle graph is an ongoing journey. Important challenges stay, and these characterize the longer term alternatives for us:

Software Proliferation: As new ML instruments emerge, we want sturdy integration patterns that scale. How can we design plugin architectures that make including new sources seamless? If we don’t sustain with new instruments, practitioners can be compelled again into fragmented views, and the Mannequin Lifecycle Graph will lose protection and belief.
Area-Particular Visualizations: Completely different entity sorts require distinct visualization experiences. Mannequin pages ought to show deployment historical past, A/B take a look at associations, and efficiency metrics. Characteristic pages ought to spotlight knowledge lineage and consuming fashions. Pipeline pages should present execution historical past, dependencies, and schedules. Dataset pages require versioning timelines and downstream shoppers. How can we design a versatile UI framework that enables every entity sort to have its personal tailor-made expertise whereas sustaining constant navigation and interplay patterns throughout the portal? With out wealthy, domain-specific experiences, the portal dangers turning into a generic catalog reasonably than a device that ML practitioners depend on of their day by day workflows.
Metadata High quality: Right this moment, MDS ensures knowledge consistency by means of source-of-truth hydration and schema validation at ingestion. Background enrichment jobs repeatedly infer relationships and materialize entities from supply programs. Nonetheless, challenges stay in making certain completeness and timeliness at scale. When supply programs fail to emit occasions, when possession info turns into stale, or when entities lack descriptions and contextual metadata, the graph’s utility degrades. How can we construct automated validation and enrichment programs to detect metadata anomalies, recommend lacking relationships, and preserve high quality benchmarks throughout thousands and thousands of entities? Poor or stale metadata erodes practitioner belief: if the graph is incomplete or incorrect, groups will revert to advert hoc information and one-off integrations reasonably than utilizing MDS as their supply of fact.
Superior Relationship Inference: Past specific relationships declared in supply programs, how can we infer implicit connections? Can we detect that two fashions serve related functions based mostly on shared options? Can we advocate options based mostly on utilization patterns from related pipelines? We’re within the early phases of exploring these concepts. Executed properly, they’d flip MDS from a passive catalog into an energetic advice engine for ML property, accelerating reuse and lowering duplicate work throughout domains.

Acknowledgments

This work represents the collective effort of gorgeous colleagues throughout the AI Platform group: Emma Carney, Megan Ren, Nadeem Ahmad, Pat Olenik, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu

Source link

What's Hot

Is CapCut Getting Banned? What You Should Do

10 Prime Video Shows That Will Keep You Hooked From Start to Finish

Movies Before 1990s I Checklist Quiz

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph | by Netflix Technology Blog | May, 2026

Narnia Release Date Delayed, Netflix & Greta Gerwig Movie Gets Major Update

State of Routing in Model Serving | by Netflix Technology Blog | May, 2026

Netflix makes content discovery easy with new Instagram like vertical video feed

8 Netflix Series That Are So Good, You’ll Want to Binge Them in One Weekend

Subscribe to Updates