Writer: Keertana Chidambaram, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya
(*The work was completed when Keertana interned at Netflix.)
Introduction
This weblog focuses on post-training generative recommender methods. Generative recommenders (GRs) characterize a brand new paradigm within the discipline of advice methods (e.g. HSTU, OneRec). These fashions draw inspiration from current developments in transformer architectures used for language and imaginative and prescient duties. They method the advice downside, together with each rating and retrieval, as a sequential transduction activity. This attitude permits generative coaching, the place the mannequin learns by imitating the following occasion in a sequence of person actions, thereby successfully modeling person habits over time.
Nevertheless, a key problem with merely replicating noticed person patterns is that it could not all the time result in the absolute best suggestions. Consumer interactions are influenced by a wide range of elements — corresponding to developments, or exterior options — and the system’s view of those interactions is inherently restricted. For instance, if a person tries a well-liked present however later signifies it wasn’t a very good match, a mannequin that solely imitates this habits would possibly proceed to suggest comparable content material, lacking the prospect to boost the person’s expertise.
This highlights the significance of incorporating person preferences and suggestions, reasonably than solely counting on noticed habits, to enhance advice high quality. Within the context of advice methods, we profit from a wealth of person suggestions, which incorporates express alerts corresponding to rankings and critiques, in addition to implicit alerts like watch time, click-through charges, and general engagement. This abundance of suggestions serves as a invaluable useful resource for enhancing mannequin efficiency.
Given the current success of reinforcement studying strategies in post-training giant language fashions, corresponding to DPO and GRPO, this examine investigates whether or not comparable strategies will be utilized to generative recommenders. In the end, our purpose is to determine each the alternatives and challenges in utilizing these strategies to boost the standard and relevance of suggestions.
Not like language fashions, post-training generative recommenders presents distinctive challenges. One of the crucial important is the issue of acquiring counterfactual suggestions in advice eventualities. The advice suggestions is generated on-policy — that’s, it displays customers’ real-time interactions with the system as they naturally use it. Since a typical person sequence can span weeks and even years of exercise, it’s impractical to ask customers to evaluate or present suggestions on hypothetical, counterfactual experiences. In consequence, the absence of counterfactual knowledge makes it difficult to use post-training strategies corresponding to PPO or DPO, which require suggestions from counterfactual person sequences.
Moreover, post-training strategies sometimes depend on a reward mannequin — both implicit or express — to information optimization. The standard of reward fashions closely influences the effectiveness of post-training. Within the context of advice methods, nonetheless, reward alerts are usually a lot noisier. As an illustration, if we use watch time as an implicit reward, it could not all the time precisely mirror person satisfaction: a viewer would possibly cease watching a favourite present merely as a result of time constraints, whereas ending a prolonged present doesn’t essentially point out real enjoyment.
To deal with these post-training challenges, we introduce a novel algorithm known as Benefit-Weighted Supervised Tremendous-tuning (A-SFT). Our evaluation first demonstrates that reward fashions in advice methods typically exhibit increased uncertainty because of the points mentioned above. Fairly than relying solely on these unsure reward fashions, A-SFT combines supervised fine-tuning with the benefit operate to extra successfully information post-training optimization. This method proves particularly efficient when the reward mannequin has excessive variance however nonetheless gives invaluable directional alerts. We benchmark A-SFT in opposition to 4 different consultant strategies, and our outcomes present that A-SFT achieves higher alignment between the pre-trained generative advice mannequin and the reward mannequin.
In Determine 1, we conceptualize the professionals and cons of various post-training paradigms. For instance, On-line Reinforcement Studying is most helpful when the reward mannequin has a very good generalization potential, and habits cloning is appropriate when no reward fashions can be found. Utilizing these algorithms beneath becoming use instances is the important thing to a profitable post-training. For instance, over-exploitation of noisy reward fashions will harm activity efficiency, as steering from the reward fashions will be merely noise. Conversely, not leveraging a very good reward mannequin leaves out potential enhancements. We discover A-SFT matches the candy level between offline reinforcement studying and habits cloning, the place it advantages from the directional alerts in these noisy estimations and is much less depending on the reward accuracy.
Determine 1: The panorama of RL algorithms based mostly on the reward fashions’ accuracy
Challenges in Submit-training for Advice
Reinforcement Studying from Human Suggestions (RLHF) is the most well-liked framework for post-training giant language fashions. On this framework, human annotators consider and rank totally different outputs generated by a mannequin. This suggestions is then used to coach a reward mannequin that predicts how properly a mannequin output aligns with human preferences. This reward mannequin then serves as a proxy for human judgment throughout reinforcement studying, guiding the mannequin to generate outputs which are extra more likely to be most well-liked by people.
Whereas conventional RLHF strategies like PPO or DPO are efficient for aligning LLMs, there are a number of challenges in making use of them on to large-scale advice methods:
- Lack of Counter-factual Observations
As in typical RLHF settings, amassing real-time suggestions from a various person base throughout a variety of things is each expensive and impractical. The information in advice are generated by the real-time person pursuits. Any third-party annotators and even the person themselves lack the sensible means to judge another actuality. For instance, it’s impractical to ask the Netflix customers to judge a whole bunch of unseen motion pictures. Consequently, we lack a stay setting through which to carry out reinforcement studying.
2. Noisy Reward Fashions
Along with the restricted counter-factual knowledge, the advice activity itself has the next randomness by its nature. The advice knowledge has much less construction than language knowledge. Customers select to observe some reveals not as a result of there’s a grammar rule that nouns must comply with by the verbs. In reality, the customers’ decisions normally exhibit a degree of permutation invariance, the place swapping the order of occasions within the person historical past nonetheless makes a legitimate exercise sequence. This randomness within the behaviors makes studying a very good reward mannequin extraordinarily troublesome. Typically the reward fashions we learnt nonetheless have a big margin of errors.
Right here is an ablation examine we did on the reward mannequin efficiency with O(Tens of millions) customers and O(Billions) of tokens. The reward mannequin makes use of an open-sourced HSTU structure within the comfort of reproducing this examine. We undertake the usual RLHF method of coaching a reward mannequin utilizing offline, human-collected suggestions. We begin by making a proxy reward, scored on a scale from 1 to five within the comfort of understanding. This reward mannequin is co-trained as a shallow reward head on prime of the generative recommender. It predicts the reward for probably the most lately chosen title based mostly on a person’s interplay historical past. To guage its effectiveness, we examine the mannequin’s efficiency in opposition to two easy baselines: (1) predicting the following reward as the common reward the person has given of their previous interactions, and (2) predicting it as the common reward that each one customers have assigned to that specific title in earlier interactions.
Desk 1: Reward mannequin efficiency metrics
We observe that the mannequin’s predictions don’t considerably outperform the straightforward baselines. This result’s intuitive, as a person’s historic interactions sometimes cowl solely a small subset of titles, making it troublesome to precisely predict their responses to the huge variety of unexplored titles within the catalogue. We count on this to be a possible problem for any giant advice methods the place the ratio between explored and unexplored titles may be very small.
3. Lack of Logged Coverage
In advice methods, the coverage that generated the logged knowledge is usually unknown and can’t be immediately estimated. Offline reinforcement studying strategies typically depend on Inverse Propensity Scoring (IPS) to debias such knowledge by reweighting interactions in response to the logging coverage’s motion possibilities. Nevertheless, estimating the logging coverage precisely is difficult and liable to error, which might introduce extra biases, and IPS itself is understood to undergo from excessive variance. Consequently, offline RL approaches that depend upon IPS are ill-suited for our setting.
Benefit Weighted Supervised Tremendous Tuning
Given the three challenges outlined above, we suggest a brand new algorithm Benefit-Weighted SFT (A-SFT). It leverages a mix of supervised fine-tuning and benefit reweighting from reinforcement studying. The important thing statement is as follows. Regardless of the reward estimation for every particular person occasion having a excessive uncertainty, we discover the estimations of rewards include directional alerts between high-reward and low-reward occasions. These alerts might assist higher align the mannequin throughout post-training.
A central issue on this examine is the generalization potential of the reward mannequin. Higher generalization permits extra correct predictions of person preferences for unseen titles, thereby making exploration more practical. For reward fashions with reasonable to excessive generalization energy, each on-line RL strategies corresponding to PPO and offline RL strategies corresponding to CQL can carry out successfully. Nevertheless, in our setting, reward mannequin generalization is worse than the language counterparts’, which makes these algorithms much less applicable. As well as, using strategies like inverse propensity scoring (IPS) introduces a heightened threat of high-variance estimates, prompting us to exclude algorithms corresponding to off-policy REINFORCE.
Our proposed methodology A-SFT doesn’t depend on IPS. Without having of prior information of the logging coverage, it may be typically utilized to instances the place statement of the environments are restricted or biased. That is significantly helpful to the advice setting because of the person suggestions loop and distribution shifts with time. With out understanding the logging coverage, A-SFT nonetheless gives means to manage the coverage deviation between the present coverage and logging coverage by tuning the parameter. This design gives important means to manage the learnt bias from unsure reward fashions. We present that A-SFT outperforms baseline habits cloning by immediately optimizing noticed rewards.
The advantage-weighted SFT algorithm is as follows:
For the outcomes offered on this weblog submit, we deal with the advice downside as a contextual bandit, i.e. given a historical past of person interactions because the context, can we suggest a excessive reward subsequent title advice for the person?
Benchmarks
We in contrast consultant algorithms together with PPO, IPO, DPO, CQL and SFT because the baselines:
- Reward weighted Conduct Cloning: This benchmark algorithm modifies supervised fine-tuning (SFT) by weighting the loss with the uncooked rewards of the chosen merchandise as an alternative of weighing the loss with benefit as within the proposed algorithm.
- Rejection Sampling Direct Choice Optimization / Id Choice Optimization (RS DPO/IPO): this can be a variant of DPO/IPO the place, for every person historical past x, we generate contrasting response pairs by coaching an ensemble of reward fashions to estimate confidence intervals for the reward of a number of potential responses y. If the decrease certain of the reward confidence interval for one response is lower than the higher certain for an additional response, then this pair is used to coach DPO/IPO.
- Conservative Q-Studying (CQL): This can be a normal offline algorithm that learns a conservative Q operate, penalizing overestimation of Q-values, significantly in areas of the state-action house with little or no reward knowledge.
- Proximal Coverage Optimization (PPO): This can be a normal RLHF (Reinforcement Studying from Human Suggestions) algorithm that makes use of reward fashions as an internet setting. PPO learns a bonus operate and optimizes the coverage to maximise anticipated reward whereas sustaining proximity to the preliminary coverage.
We sampled a separate check set of O(Tens of millions) customers. This check set is collected on a future date after the coaching.
Offline Analysis Outcomes
We consider our algorithm on a dataset of high-reward person trajectories. For sake of simplicity, we think about a trajectory to have a excessive reward if the accrued reward is increased than the median of the inhabitants. We current the next metrics for the held out check dataset:
- NDCG@ok: This measures the rating high quality of the really useful gadgets as much as place ok. It accounts for the place of related gadgets within the advice record, assigning increased scores when related gadgets seem increased within the rating. The acquire is discounted logarithmically at decrease ranks, and the result’s normalized by the perfect rating (i.e., the absolute best ordering of things).
- HR@ok: This measures the proportion of check instances through which the ground-truth chosen merchandise y seems within the prime ok suggestions. It’s a binary metric per check case (hit and miss) and is averaged over all check instances.
- MRR: MRR evaluates the rating high quality by measuring the reciprocal of the rank at which the chosen merchandise seems within the advice record. The metric is averaged throughout all check instances.
- Reward Mannequin as A Decide: We use the reward mannequin to judge the coverage for future person occasions. We suggest to make use of an ensemble of reward fashions for the analysis to extend confidence. The result’s based mostly on the discounted reward generated for just a few steps. The usual deviation is lower than 4%.
We measure the proportion enchancment in every metric in comparison with the baseline, Reward Weighted Conduct Cloning(BC). We discover that benefit weighted SFT reveals the biggest enchancment in metrics, outweighing BC in addition to reward mannequin dependent algorithms like CQL, PPO, DPO and IPO.
Our experiments present that benefit weighted SFT is an easy however promising method for post-training generative recommenders because it offers with the difficulty of poor reward mannequin generalizations and lack of IPS. Extra particularly, we discover PPO, IPO and DPO obtain a very good reward rating, but additionally causes the overfitting from the reward mannequin. Conservative Q-Studying achieves extra sturdy enhancements however doesn’t totally seize the potential alerts within the reward modeling. A-SFT achieved each higher advice metrics and reward scores.
