Entertainer.newsEntertainer.news
  • Home
  • Celebrity
  • Movies
  • Music
  • Web Series
  • Podcast
  • OTT
  • Television
  • Interviews
  • Awards

Subscribe to Updates

Get the latest Entertainment News and Updates from Entertainer News

What's Hot

Nicola Peltz Beckham breaks silence following Brooklyn’s cryptic birthday message from parents

March 6, 2026

Lil Poppa’s Funeral Will Be Open to the Public and Livestreamed

March 6, 2026

SCREAM Slashes Past $1 Billion at the Box Office and Joins Horror’s Elite Club — GeekTyrant

March 5, 2026
Facebook Twitter Instagram
Friday, March 6
  • About us
  • Advertise with us
  • Submit Articles
  • Privacy Policy
  • Contact us
Facebook Twitter Tumblr LinkedIn
Entertainer.newsEntertainer.news
Subscribe Login
  • Home
  • Celebrity
  • Movies
  • Music
  • Web Series
  • Podcast
  • OTT
  • Television
  • Interviews
  • Awards
Entertainer.newsEntertainer.news
Home MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix | by Netflix Technology Blog | Feb, 2026
Web Series

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix | by Netflix Technology Blog | Feb, 2026

Team EntertainerBy Team EntertainerFebruary 23, 2026Updated:February 24, 2026No Comments9 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp VKontakte Email
MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix | by Netflix Technology Blog | Feb, 2026
Share
Facebook Twitter LinkedIn Pinterest Email


Netflix Technology Blog

Avneesh Saluja, Santiago Castro, Bowei Yan, Ashish Rastogi

Introduction

Netflix’s core mission is to attach tens of millions of members all over the world with tales they’ll love. This requires not simply an unbelievable catalog, but in addition a deep, machine-level understanding of each piece of content material in that catalog, from the largest blockbusters to essentially the most area of interest documentaries. As we onboard new kinds of content material equivalent to dwell occasions and podcasts, the necessity to scalably perceive these nuances turns into much more important to our productions and member-facing experiences.

Many of those media-related duties require refined long-form video understanding e.g., figuring out refined narrative dependencies and emotional arcs that span complete episodes or movies. Earlier work has discovered that to really grasp the content material’s essence, our fashions should leverage the total multimodal sign. For instance, the audio soundtrack is an important, non-visual modality that may assist extra exactly establish clip-level tones or when a brand new scene begins. Can we use our assortment of reveals and flicks to discover ways to a) fuse modalities like audio, video, and subtitle textual content collectively and b) develop strong representations that leverage the narrative construction that’s current in lengthy type leisure? Consisting of tens of tens of millions of particular person photographs throughout a number of titles, our various but entertainment-specific dataset supplies the proper basis to coach multimodal media understanding fashions that allow many capabilities throughout the corporate equivalent to advertisements relevancy, clip reputation prediction, and clip tagging.

For these causes, we developed the Netflix Media Foundational Mannequin (MediaFM), our new, in-house, multimodal content material embedding mannequin. MediaFM is the primary tri-modal (audio, video, textual content) mannequin pretrained on parts of the Netflix catalog. Its core is a multimodal, Transformer-based encoder designed to generate wealthy, contextual embeddings¹ for photographs from our catalog by studying the temporal relationships between them by means of integrating visible, audio, and textual data. The ensuing shot-level embeddings are highly effective representations designed to create a deeper, extra nuanced, and machine-readable understanding of our content material, offering the important spine for efficient chilly begin of newly launching titles in suggestions, optimized promotional property (like artwork and trailers), and inside content material evaluation instruments.

Press enter or click on to view picture in full dimension
Determine 1: MediaFM Structure

Enter Illustration & Preprocessing

The mannequin’s basic unit of enter is a shot, derived by segmenting a film or episode (collectively known as “title”) utilizing a shot boundary detection algorithm. For every shot, we generate three distinct embeddings from its core modalities:

  • Video: an inside mannequin known as SeqCLIP (a CLIP-style mannequin fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented photographs
  • Audio: the audio samples from the identical photographs are embedded utilizing Meta FAIR’s wav2vec2
  • Timed Textual content: OpenAI’s text-embedding-3-large mannequin is used to encode the corresponding timed textual content (e.g., closed captions, audio descriptions, or subtitles) for every shot

For every shot, the three embeddings² are concatenated and unit-normed to type a single 2304-dimensional fused embedding vector. The transformer encoder is skilled on sequences of photographs, so every instance in our dataset is a temporally-ordered sequence of those fused embeddings from the identical film or episode (as much as 512 photographs per sequence). We even have entry to title-level metadata which is used to offer world context for every sequence (by way of the [GLOBAL]token). The title-level embedding is computed by passing title-level metadata (equivalent to synopses and tags) by means of the text-embedding-3-large mannequin.

Mannequin Structure and Coaching Goal

The core of our mannequin is a transformer encoder, architecturally just like BERT. A sequence of preprocessed shot embeddings is handed by means of the next levels:

  1. Enter Projection: The fused shot embeddings are first projected right down to the mannequin’s hidden dimension by way of a linear layer.
  2. Sequence Development & Particular Tokens: Earlier than getting into the Transformer, two particular embeddings are prepended to the sequence:
    • a learnable [CLS] embedding is added on the very starting.
    • the title-level embedding is projected to the mannequin’s hidden dimension and inserted after the [CLS] token because the [GLOBAL] token, offering title-level context to each shot within the sequence and taking part within the self-attention course of.
  3. Contextualization: The sequence is enhanced with positional embeddings and fed by means of the Transformer stack to offer shot representations primarily based on their surrounding context.
  4. Output Projection: The contextualized hidden states from the Transformer are handed by means of a closing linear layer, projecting them from the hidden layers again as much as the 2304-dimensional fused embedding area for prediction.

We practice the mannequin utilizing a Masked Shot Modeling (MSM) goal. On this self-supervised job, we randomly masks 20% of the enter shot embeddings in every sequence by changing them with a learnable [MASK] embedding. The mannequin’s goal is to foretell the unique, unmasked fused embedding for these masked positions. The mannequin is optimized by minimizing the cosine distance between its predicted embedding and the ground-truth embedding for every masked shot.

Get Netflix Expertise Weblog’s tales in your inbox

Be part of Medium without spending a dime to get updates from this author.

We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It’s price noting that the swap to Muon resulted in noticeable enhancements.

Analysis

To guage the discovered embeddings, we study task-specific linear layers on high of frozen representations (i.e., linear probes). A lot of the duties are clip-level, i.e., every instance is a brief clip starting from a number of seconds to a minute which are sometimes introduced to our members whereas recommending a title to them. When embedding these clips, we discover that “embedding in context”, specifically extracting the embeddings from inside a bigger sequence (e.g., the episode containing the clip), naturally does a lot better than embedding solely the photographs from a clip.

Duties

Our embeddings are foundational and we discover that they create worth to functions throughout Netflix. Listed below are a number of:

  • Advert Relevancy: A multilabel classification job to categorize Netflix clips for related advert placement, measured by Common Precision. On this job, these representations function on the retrieval stage, the place they assist in figuring out the candidate set and in flip are fed into the advert serving system for relevance optimization.
  • Clip Recognition Rating: A rating job to foretell the relative efficiency (in click-through price, CTR) of a media clip relative to different clips from that present or film, measured by a ten-fold with Kendall’s tau correlation coefficient.
  • Clip Tone: A multi-label classification of hook clips into 100 tone classes (e.g., creepy, scary, humorous) from our inside Metadata & Rankings group, measured by micro Common Precision (averaged throughout tone classes).
  • Clip Style: A multi-label classification of clips into eleven core genres (Motion, Anime, Comedy, Documentary, Drama, Fantasy, Horror, Youngsters, Romance, Sci-fi, Thriller) derived from the style of the guardian title, measured by macro Common Precision (averaged throughout genres).
  • Clip Retrieval: a binary classification of clips from motion pictures or episodes into “clip-worthy” (i.e., a very good clip to showcase the title) or not, as decided by human annotators, and as measured by Common Precision. The constructive to unfavourable clip ratio is 1:3, and for every title we choose 6–10 constructive clips and the corresponding variety of negatives.

It’s price noting that for the duties above (in addition to different duties that use our mannequin), the mannequin outputs are utilized as data that the related groups use when driving to a call fairly than being utilized in a very end-to-end trend. Most of the enhancements are additionally in varied levels of deployment.

Outcomes

Determine 2³ compares MediaFM to a number of sturdy baselines:

Press enter or click on to view picture in full dimension
Determine 2: Efficiency of MediaFM vs. exterior and inside fashions.

On all duties, MediaFM is best than the baselines. Enhancements appear to be bigger for duties that require extra detailed narrative understanding e.g., predicting essentially the most related advertisements for an advert break given the encircling context. We glance additional into this subsequent.

Ablations

MediaFM’s main enhancements over earlier Netflix work stem from two key areas: combining a number of modalities and studying to contextualize shot representations. To find out the contribution of every issue throughout completely different duties, we in contrast MediaFM to a baseline. This baseline concatenates the three enter embeddings, primarily offering the identical full, shot-level enter as MediaFM however with out the contextualization step. This comparability permits us to isolate which duties profit most from the contextualization side.

Press enter or click on to view picture in full dimension

Further modalities assist considerably for tone however the primary enchancment comes from contextualization.

Press enter or click on to view picture in full dimension

Oddly, a number of uncontextualized modalities hurts the clip reputation rating mannequin, however including contextualization considerably improves efficiency.

Press enter or click on to view picture in full dimension

For clip retrieval we see a pure development of round 15% for every enchancment.

Subsequent Steps

MediaFM presents a technique to discover ways to fuse and/or contextualize shot-level data by leveraging Netflix’s catalog in a self-supervised method. With this angle, we’re actively investigating how pretrained multimodal (audio, video/picture, textual content) LLMs like Qwen3-Omni, the place the modality fusion has already been discovered, can present an excellent stronger place to begin for subsequent mannequin generations.

Subsequent on this collection of weblog posts, we are going to current our methodology to embed title-level metadata and adapt it to our wants. Keep tuned!

Footnotes

  1. We selected embeddings over generative textual content outputs to prioritize modular design. This supplies a tighter, cleaner abstraction layer: we generate the illustration as soon as, and it’s consumed throughout our complete suite of companies. This avoids the architectural fragility of fine-tuning, permitting us to boost our current embedding-based workflows with new modalities extra flexibly.
  2. All of our information has audio and video; we zero-pad for lacking timed textual content information, which is comparatively prone to happen (e.g., in photographs with out dialogue).
  3. The title-level duties couldn’t be evaluated with the VertexAI MM and Marengo embedding fashions because the movies exceed the size restrict set by the APIs.

Acknowledgements

We wish to thank Matt Thanabalan and Chaitanya Ekanadham for his or her contributions to this work.



Source link

Blog Feb Foundation Media MediaFM Multimodal Netflix Technology Understanding
Share. Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Email
Previous ArticleLouder Than Life unveils 2026 lineup
Next Article Why people stop watching your videos (and how to fix It)
Team Entertainer
  • Website

Related Posts

LITTLE HOUSE ON THE PRAIRIE Series Renewed for Season 2 at Netflix Ahead of the Season 1 Premiere — GeekTyrant

March 4, 2026

Optimizing Recommendation Systems with JDK’s Vector API | by Netflix Technology Blog | Mar, 2026

March 3, 2026

Skip ‘Wuthering Heights’ and Watch This 21st Century Period Romance Before It Leaves Netflix

March 1, 2026

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs | by Netflix Technology Blog

February 28, 2026
Recent Posts
  • Nicola Peltz Beckham breaks silence following Brooklyn’s cryptic birthday message from parents
  • Lil Poppa’s Funeral Will Be Open to the Public and Livestreamed
  • SCREAM Slashes Past $1 Billion at the Box Office and Joins Horror’s Elite Club — GeekTyrant
  • Metallica Add Third Set of Las Vegas Sphere Dates

Archives

  • March 2026
  • February 2026
  • January 2026
  • December 2025
  • November 2025
  • October 2025
  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021

Categories

  • Actress
  • Awards
  • Behind the Camera
  • BollyBuzz
  • Celebrity
  • Edit Picks
  • Glam & Style
  • Global Bollywood
  • In the Frame
  • Insta Inspector
  • Interviews
  • Movies
  • Music
  • News
  • News & Gossip
  • News & Gossips
  • OTT
  • Podcast
  • Power & Purpose
  • Press Release
  • Spotlight Stories
  • Spotted!
  • Star Luxe
  • Television
  • Trending
  • Uncategorized
  • Web Series
NAVIGATION
  • About us
  • Advertise with us
  • Submit Articles
  • Privacy Policy
  • Contact us
  • About us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us
Copyright © 2026 Entertainer.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?