By Jiangwei Pan, Gary Tang, Henry Wang, and Justin Basilico
Our mission at Netflix is to entertain the world. Our personalization algorithms play an important function in delivering on this mission for all members by recommending the suitable reveals, motion pictures, and video games on the proper time. This objective extends past speedy engagement; we purpose to create an expertise that brings lasting enjoyment to our members. Conventional recommender methods usually optimize for short-term metrics like clicks or engagement, which can not absolutely seize long-term satisfaction. We attempt to advocate content material that not solely engages members within the second but additionally enhances their long-term satisfaction, which will increase the worth they get from Netflix, and thus they’ll be extra more likely to proceed to be a member.
One easy method we are able to view suggestions is as a contextual bandit drawback. When a member visits, that turns into a context for our system and it selects an motion of what suggestions to indicate, after which the member offers numerous varieties of suggestions. These suggestions alerts might be speedy (skips, performs, thumbs up/down, or including objects to their playlist) or delayed (finishing a present or renewing their subscription). We are able to outline reward features to mirror the standard of the suggestions from these suggestions alerts after which prepare a contextual bandit coverage on historic knowledge to maximise the anticipated reward.
There are lots of ways in which a suggestion mannequin might be improved. They could come from extra informative enter options, extra knowledge, completely different architectures, extra parameters, and so forth. On this publish, we deal with a less-discussed side about bettering the recommender goal by defining a reward perform that tries to raised mirror long-term member satisfaction.
Member retention may appear to be an apparent reward for optimizing long-term satisfaction as a result of members ought to keep in the event that they’re happy, nonetheless it has a number of drawbacks:
- Noisy: Retention might be influenced by quite a few exterior elements, resembling seasonal tendencies, advertising campaigns, or private circumstances unrelated to the service.
- Low Sensitivity: Retention is simply delicate for members on the verge of canceling their subscription, not capturing the complete spectrum of member satisfaction.
- Onerous to Attribute: Members may cancel solely after a collection of dangerous suggestions.
- Gradual to Measure: We solely get one sign per account per thirty days.
Because of these challenges, optimizing for retention alone is impractical.
As a substitute, we are able to prepare our bandit coverage to optimize a proxy reward perform that’s extremely aligned with long-term member satisfaction whereas being delicate to particular person suggestions. The proxy reward r(consumer, merchandise) is a perform of consumer interplay with the advisable merchandise. For instance, if we advocate “One Piece” and a member performs then subsequently completes and offers it a thumbs-up, a easy proxy reward is perhaps outlined as r(consumer, merchandise) = f(play, full, thumb).
Click on-through price (CTR)
Click on-through price (CTR), or in our case play-through price, might be considered as a easy proxy reward the place r(consumer, merchandise) = 1 if the consumer clicks a suggestion and 0 in any other case. CTR is a typical suggestions sign that typically displays consumer desire expectations. It’s a easy but sturdy baseline for a lot of suggestion functions. In some circumstances, resembling adverts personalization the place the clicking is the goal motion, CTR might even be an affordable reward for manufacturing fashions. Nonetheless, normally, over-optimizing CTR can result in selling clickbaity objects, which can hurt long-term satisfaction.
Past CTR
To align the proxy reward perform extra carefully with long-term satisfaction, we have to look past easy interactions, take into account all varieties of consumer actions, and perceive their true implications on consumer satisfaction.
We give just a few examples within the Netflix context:
- Quick season completion ✅: Finishing a season of a advisable TV present in someday is a robust signal of enjoyment and long-term satisfaction.
- Thumbs-down after completion ❌: Finishing a TV present in a number of weeks adopted by a thumbs-down signifies low satisfaction regardless of vital time spent.
- Taking part in a film for simply 10 minutes ❓: On this case, the consumer’s satisfaction is ambiguous. The transient engagement may point out that the consumer determined to desert the film, or it may merely imply the consumer was interrupted and plans to complete the film later, maybe the following day.
- Discovering new genres ✅ ✅: Watching extra Korean or recreation reveals after “Squid Recreation” suggests the consumer is discovering one thing new. This discovery was possible much more priceless because it led to a wide range of engagements in a brand new space for a member.
Reward engineering is the iterative means of refining the proxy reward perform to align with long-term member satisfaction. It’s just like characteristic engineering, besides that it may be derived from knowledge that isn’t out there at serving time. Reward engineering entails 4 phases: speculation formation, defining a brand new proxy reward, coaching a brand new bandit coverage, and A/B testing. Under is an easy instance.
Consumer suggestions used within the proxy reward perform is commonly delayed or lacking. For instance, a member might resolve to play a advisable present for only a few minutes on the primary day and take a number of weeks to completely full the present. This completion suggestions is due to this fact delayed. Moreover, some consumer suggestions might by no means happen; whereas we may need in any other case, not all members present a thumbs-up or thumbs-down after finishing a present, leaving us unsure about their stage of enjoyment.
We may attempt to wait to provide an extended window to watch suggestions, however how lengthy ought to we anticipate delayed suggestions earlier than computing the proxy rewards? If we wait too lengthy (e.g., weeks), we miss the chance to replace the bandit coverage with the newest knowledge. In a extremely dynamic atmosphere like Netflix, a stale bandit coverage can degrade the consumer expertise and be notably dangerous at recommending newer objects.
Resolution: predict lacking suggestions
We purpose to replace the bandit coverage shortly after making a suggestion whereas additionally defining the proxy reward perform based mostly on all consumer suggestions, together with delayed suggestions. Since delayed suggestions has not been noticed on the time of coverage coaching, we are able to predict it. This prediction happens for every coaching instance with delayed suggestions, utilizing already noticed suggestions and different related data as much as the coaching time as enter options. Thus, the prediction additionally will get higher as time progresses.
The proxy reward is then calculated for every coaching instance utilizing each noticed and predicted suggestions. These coaching examples are used to replace the bandit coverage.
However aren’t we nonetheless solely counting on noticed suggestions within the proxy reward perform? Sure, as a result of delayed suggestions is predicted based mostly on noticed suggestions. Nonetheless, it’s less complicated to cause about rewards utilizing all suggestions immediately. As an illustration, the delayed thumbs-up prediction mannequin could also be a posh neural community that takes under consideration all noticed suggestions (e.g., short-term play patterns). It’s extra easy to outline the proxy reward as a easy perform of the thumbs-up suggestions moderately than a posh perform of short-term interplay patterns. It may also be used to regulate for potential biases in how suggestions is offered.
The reward engineering diagram is up to date with an elective delayed suggestions prediction step.
Two varieties of ML fashions
It’s price noting that this method employs two varieties of ML fashions:
- Delayed Suggestions Prediction Fashions: These fashions predict p(remaining suggestions | noticed feedbacks). The predictions are used to outline and compute proxy rewards for bandit coverage coaching examples. In consequence, these fashions are used offline through the bandit coverage coaching.
- Bandit Coverage Fashions: These fashions are used within the bandit coverage π(merchandise | consumer; r) to generate suggestions on-line and in real-time.
Improved enter options or neural community architectures usually result in higher offline mannequin metrics (e.g., AUC for classification fashions). Nonetheless, when these improved fashions are subjected to A/B testing, we frequently observe flat and even unfavorable on-line metrics, which might quantify long-term member satisfaction.
This online-offline metric disparity normally happens when the proxy reward used within the suggestion coverage will not be absolutely aligned with long-term member satisfaction. In such circumstances, a mannequin might obtain greater proxy rewards (offline metrics) however lead to worse long-term member satisfaction (on-line metrics).
However, the mannequin enchancment is real. One method to resolve that is to additional refine the proxy reward definition to align higher with the improved mannequin. When this tuning ends in constructive on-line metrics, the mannequin enchancment might be successfully productized. See [1] for extra discussions on this problem.
On this publish, we offered an summary of our reward engineering efforts to align Netflix suggestions with long-term member satisfaction. Whereas retention stays our north star, it isn’t simple to optimize immediately. Subsequently, our efforts deal with defining a proxy reward that’s aligned with long-term satisfaction and delicate to particular person suggestions. Lastly, we mentioned the distinctive problem of delayed consumer suggestions at Netflix and proposed an method that has confirmed efficient for us. Seek advice from [2] for an earlier overview of the reward innovation efforts at Netflix.
As we proceed to enhance our suggestions, a number of open questions stay:
- Can we be taught a great proxy reward perform robotically by correlating conduct with retention?
- How lengthy ought to we anticipate delayed suggestions earlier than utilizing its predicted worth in coverage coaching?
- How can we leverage Reinforcement Studying to additional align the coverage with long-term satisfaction?
[1] Deep studying for recommender methods: A Netflix case examine. AI Journal 2021. Harald Steck, Linas Baltrunas, Ehtsham Elahi, Dawen Liang, Yves Raimond, Justin Basilico.
[2] Reward innovation for long-term member satisfaction. RecSys 2023. Gary Tang, Jiangwei Pan, Henry Wang, Justin Basilico.
