# recflow_an_industrial_full_flow_recommendation_dataset__67bd4499.pdf Published as a conference paper at ICLR 2025 RECFLOW: AN INDUSTRIAL FULL FLOW RECOMMENDATION DATASET Qi Liu1, Kai Zheng2, Rui Huang2, Wuchao Li1, Kuo Cai2, Yuan Chai2, Yanan Niu2, Yiqun Hui2, Bing Han2, Na Mou2, Hongning Wang4, Wentian Bao3, Yunen Yu3, Guorui Zhou2, Han Li2, Yang Song2, Defu Lian1 , 1University of Science and Technology of China 2Kuaishou 3Independent 4Tsinghua University {qiliu67,liwuchao}@mail.ustc.edu.cn, {liandefu}@ustc.edu.cn {zhengkai,huangrui06,caikuo,niuyanan,chaiyuan}@Kuaishou.com {huiyiqun,hanbing,zhouguorui,lihan08,songyang}@Kuaishou.com {hw5x}@virginia.edu, {wb2328}@columbia.edu {yuenyun}@126.com, {285208254,gai.kun}@qq.com Industrial recommendation systems (RS) rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items from a vast corpus to users. Existing RS benchmark datasets primarily focus on the exposure space, where novel RS algorithms are trained and evaluated. However, when these algorithms transition to real-world industrial RS, they face two critical challenges: (1) handling unexposed items a significantly larger space than the exposed one, profoundly impacting their practical performance; and (2) overlooking the intricate interplay between multiple stages of the recommendation pipeline, resulting in suboptimal system performance. To bridge the gap between offline RS benchmarks and real-world online environments, we introduce Rec Flow an industrial full-flow recommendation dataset. Unlike existing datasets, Rec Flow includes samples not only from the exposure space but also from unexposed items filtered at each stage of the RS funnel. Rec Flow comprises 38 million interactions from 42,000 users across nearly 9 million items with additional 1.9 billion stage samples collected from 9.3 million online requests over 37 days and spanning 6 stages. Leveraging Rec Flow, we conduct extensive experiments to demonstrate its potential in designing novel algorithms that enhance effectiveness by incorporating stage-specific samples. Some of these algorithms have already been deployed online at Kuai Shou, consistently yielding significant gains. We propose Rec Flow as the first comprehensive whole-pipeline benchmark dataset for the RS community, enabling research on algorithm design across the entire recommendation pipeline, including selection bias study, debiased algorithms, multi-stage consistency and optimality, multi-task recommendation, and user behavior modeling. 1 INTRODUCTION Recommendation systems (RS) are pivotal in modern web and mobile applications that handle vast amounts of information. Their primary objective is to deliver personalized recommendations from an extensive corpus of items, based on estimated user preferences. To meet stringent online latency requirements, industrial RS predominantly employs a multi-stage funnel-like pipeline (Covington et al., 2016), striking a balance between effectiveness and efficiency. Substantial efforts have been Corresponding author Published as a conference paper at ICLR 2025 devoted to designing algorithms within this system, aiming to enhance its effectiveness as measured by user feedback on selected items. A typical multi-stage RS consists of successive stages: retrieval pre-ranking ranking re-ranking. During online serving, the retrieval stage (Hidasi et al., 2015; Kang & Mc Auley, 2018; Zhu et al., 2018) retrieves thousands of preferred items from the entire corpus. The pre-ranking stage (Huang et al., 2013; Wang et al., 2020) filters out less favorable items from the retrieved set, forwarding hundreds of more promising items to the ranking stage. In turn, the ranking stage (Cheng et al., 2016; Zhou et al., 2018; Bian et al., 2022) selects the most appealing items from this refined set. Finally, the re-ranking (Pei et al., 2019; Bello et al., 2018) stage determines the final items to be displayed, considering both diversity and business objectives. Notably, as we progress through the stages, the model complexity tends to increase, incorporating additional features and interleaving them at shallow layers of deep neural network models. Importantly, the latter three stages typically learn from the exposure space, which captures actual user feedback (both positive and negative) on the displayed items. Despite the maturity of industrial RS, two significant shortcomings persist. First, a discrepancy exists between the data distribution in the training space and that in the serving space (Qin et al., 2022). The former corresponds to the exposed space, while the latter primarily resides in the unexposed space. This discrepancy, which we refer to as the distribution shift problem, poses challenges. For instance, consider the pre-ranking model (Wang et al., 2020): It must score thousands of items, yet only a few of these items are exposed to users and stored as training data in each request. Most of the remaining samples have not been exposed even once. Consequently, a pre-ranking model trained solely on the exposure space may inaccurately predict preferences in the retrieved space, leading to suboptimal recommendations (Wei et al., 2024). Similar issues arise in the ranking and re-ranking stages. Second, there is a discrepancy between the learning and serving environments. Although models at different stages are learned and evaluated separately, they must collaborate as a cohesive system to meet user preferences. Insufficient knowledge about other stages during the learning process can result in suboptimal performance when these learned models serve online. For example, the online performance of a retrieval algorithm not only depends on itself but is also influenced by subsequent stages. Incorporating knowledge from subsequent stages can enhance the retrieval algorithm s performance (Ding et al., 2019; Lou et al., 2022; Zheng et al., 2024). Large-scale datasets serve as the bedrock for advancing various machine learning algorithms. For instance, Image Net (Deng et al., 2009) has significantly contributed to computer vision, while GLUE (Wang et al., 2018) has played a crucial role in natural language processing. However, in the RS domain, existing datasets (Harper & Konstan, 2015; Ni et al., 2019; Asghar, 2016; Zhu et al., 2018; Yuan et al., 2022; Gao et al., 2022a;b; Sun et al., 2023) though instrumental in fueling RS research have a limitation: they are exclusively collected from the exposure space. These datasets cannot fully capture the true dynamics of online recommendation services. Moreover, this inherent bias prevents them from effectively addressing the discrepancy between training and serving in RS. We propose Rec Flow, an industrial large-scale full-flow dataset collected from the real industrial RS. The industrial RS s multi-stage funnel-like pipeline encompasses the following stages: retrieval, preranking, coarse ranking, ranking, re-ranking, and edge ranking. Unlike all previous RS benchmarks, Rec Flow samples representative unexposed items from each stage of the funnel in a single request alongside all the exposed items. The inclusion of full-stage samples in each request provides several merits. (1) By recording items from the serving space, Rec Flow enables the study of how to alleviate the discrepancy between training and serving for specific stages during both the learning and evaluation processes (Qin et al., 2022). (2) Rec Flow also records the stage information for different stage samples, facilitating research on joint modeling of multiple stages, such as stage consistency or optimal multi-stage RS (Zheng et al., 2024). (3) The positive and negative samples from the exposure space are suitable for classical click-through rate prediction or sequential recommendation tasks (Zhou et al., 2018; Kang & Mc Auley, 2018). (4) Rec Flow stores multiple types of positive feedback (e.g., effective view, long view, like, follow, share, comment), supporting research on multi-task recommendation (Ma et al., 2018a; Zhao et al., 2019; Tang et al., 2020; Liu et al., 2023). (5) Information about video duration and playing time for each exposure video allows the study of learning through implicit feedback, such as predicting playing time (Covington et al., 2016; Lin et al., 2023). (6) Rec Flow includes a request identifier feature, which can contribute to studying the re-ranking problem (Pei et al., 2019; Bello et al., 2018). (7) Timestamps for each sample enable the aggregation of user feedback in chronological order, facilitating the study of user behavior sequence modeling algorithms (Zhou et al., 2018; 2019; Chang et al., 2023; Hou et al., 2023). (8) Rec Flow Published as a conference paper at ICLR 2025 Pre-ranking Retrieval 40 40 10 10 prerank_neg coarse_neg rank_neg rerank_neg sampling sampling sampling sampling Top-10 Top-10 Coarse Ranking Ranking Re-Ranking Edge Ranking : Collected into Rec Flow Figure 1: The overall collection process of Rec FLow. incorporates context, user, and video features beyond identity features (e.g., user ID and video ID), making it suitable for context-based recommendation (Huang et al., 2019; Wang et al., 2022). (9) The rich information recorded about RS and user feedback allows the construction of more accurate RS simulators or user models in feed scenarios (Shi et al., 2019; Zhao et al., 2023). (10) Rich stage data may help estimate selection bias more accurately and design better unbiased algorithms (Chen et al., 2023). Furthermore, Rec Flow is a large-scale dataset, containing 38 million exposure samples and 1.9 billion stage samples, ensuring the credibility of algorithm improvements based on its data. Given these characteristics, Rec Flow can be utilized across a broad spectrum of RS algorithms. In this paper, we primarily conduct pioneering experiments to explore its potential in each stage of the RS funnel. In the retrieval stage, we investigate the effectiveness of using filtered videos from each stage as hard negative samples and explore the interplay between retrieval and subsequent stages. For the coarse ranking stage, we leverage corresponding stage samples to address the distribution shift problem and model mutual effects between stages. Motivated by existing works, we explore how to exploit stage samples for designing auxiliary ranking tasks and behavior sequence modeling algorithms to improve classical AUC metrics. Similar exploration experiments are also conducted for the ranking stage. Notably, Rec Flow also introduces a new recall metric to assess the performance of different methods based on stage samples to mitigate the gap between training and serving environments. Rec Flow is the first RS dataset containing stage samples. It stands as one of the largest and most comprehensive datasets for RS, covering nearly all recommendation tasks. We have made the dataset and source codes publicly available to promote reproducibility and advance RS research at https://github.com/Rec Flow-ICLR/Rec Flow. The dataset is licensed under CC-BY-NC-SA-4.0 International License. 2 DATASET CHARACTERISTIC 2.1 COLLECTION Rec Flow is the first RS dataset containing intermediate filtered videos of each stage in the industrial RS funnel. The multi-stage funnel-like pipeline of the industrial RS contains six stages, including retrieval pre-ranking coarse ranking ranking re-ranking edge ranking. The number of videos output at each stage is 8000 3000 500 120 10 6. We collected the online request logs from January 13 to February 18, 2024. The collection process is as follows. We randomly sample 42K seed users on January 12, 2024, and store each recommendation request of the seed users from January 13, 2024. As shown in Figure 1, we sample some filtered videos from each stage but adopt a stage-wise strategy. From January 13 to February 04, 2024, which is called the 1st period, we sample 10 filtered videos of the pre-ranking stage named pre-rank neg, 10 filtered videos of the coarse ranking stage named coarse neg, top 10 ranking videos as rank pos, 10 sampling filtered videos after the 120-th re-ranking video as rank neg in the ranking stage, top 10 re-ranking videos as rerank pos and 10 sampling filtered videos after the 80-th re-ranking video as rerank neg in the re-ranking stage, and the user s various feedbacks on the exposed videos. Note that the recommendation scenario is feeds-style, the user can only watch one video on the screen. So, the 6 output videos of the RS may not all be exposed to the user because the user can leave the APP at any Published as a conference paper at ICLR 2025 Table 1: Detail quantity information of various aspects in Rec Flow. #Stage Sample #Request #Users #Realshow videos #All videos 1st Period 352,120,401 6,062,348 38,193 5,984,924 30,305,725 2nd Period 1,572,217,303 3,308,233 35,073 3,627,694 55,665,503 Total 1,924,337,704 9,370,581 42,472 8,773,147 82,216,301 #Realshow #Like #Long view #Effective view #Follow 1st Period 24,523,473 1,027,013 5,853,054 9,343,776 69,495 2nd Period 13,721,842 618,158 3,111,439 5,063,751 37,558 Total 38,245,315 1,645,171 8,964,493 14,407,527 107,053 #Forward #Comment #Prerank neg #coarse neg #Rank pos 1st Period 45,966 175,896 60,623,480 60,623,480 60,624,430 2nd Period 23,769 114,741 132,329,320 132,329,320 33,082,330 Total 69,735 290,637 192,952,800 192,952,800 93,706,760 #Rank neg #Rank #Rerank pos #Rerank neg #Re-rank 1st Period 60,624,012 121,248,442 60,624,613 60,623,606 121,248,219 2nd Period 33,082,330 1,307,558,663 33,082,330 33,082,330 1,307,558,663 Total 93,706,342 1,428,807,105 93,706,943 93,705,936 1,428,806,882 time. We define the realshow field to identify whether the user has watched the video. From February 05 to February 18, 2024, which is called the 2nd period, we expand the amount of stage samples. Both the pre-ranking neg and the coarse neg go up to 40. For the ranking, re-ranking, and edge ranking stages, we save all the videos that appear in these stages. We still obtain the rank pos, rank neg, rerank pos, rerank neg, and realshow under the same stage-wise strategy as the previous period. We collect stage samples in this way, considering the storage pressure and information integrity. The 2nd period has more complete stage information compared to the 1st period, which gives the researchers more choices to further process the dataset based on their needs. We sample 10/40 filtered videos from the pre-ranking and coarse ranking stages because keeping all of the filtered videos has huge storage pressure. Besides, the videos filtered by the first three stages are less important. For the latter three stages, we keep the information integrity of the stage. The videos appearing in these stages are closer to the user s preference and have a small scale. 2.2 FEATURES The formation of each instance in Rec Flow is {request id, request timestamp, user id, device id, age, gender, province, video id, author id, category level one, category level two, upload type, upload timestamp, duration, realshow, rerank pos, rerank neg, rank pos, rank neg, coarse neg, pre-rank neg, rank index, rerank index, playing time, effective view, long view, like, follow, forward, comment}. realshow indicates whether the user has watched the video. The same procedure is applied to the other * pos/neg fields. For example, when the video ranks top 10 in the ranking stage, then the rank pos is set to 1 otherwise 0. To reserve the original industrial RS information, we also retain the ranking position of each video in the ranking and reranking stages through the rank index and rerank index fields. We record seven types of positive feedback that reflect the user s varying degrees of preference towards videos. playing time is the time the user spends watching the video. The other features details are in the subsection Feature Description A.1 of Appendix. 2.3 ANALYSIS In this section, we conduct a basic statistical analysis to show Rec Flow s characteristics. We collect 9 million requests. It has 38 million exposure samples and 1.9 billion stage samples (including exposure samples). Among these samples, there are 42K users, 8.7 million exposed videos, and 82 million videos. Nearly 89% of videos are not exposed. This new character does not exist in existing RS datasets. During the first period, the quantity of each defined stage s samples is about 60 million. Stage samples are 14.8x larger than exposed samples. The difference between stage samples and exposure samples has increased to 236 times in the 2nd period. The huge quantity difference is the foundation for studying the distribution shift problem. The detailed quantities of Published as a conference paper at ICLR 2025 the dataset are shown in Table 1. Figure 3, whose horizontal axis represents the range of the number of videos interacted with by users and the vertical axis shows the number and percentage of users within that range, illustrates that the frequency of users exhibits a long-tail distribution. In Figure 4, the horizontal axis represents the logarithm of the frequency of video appearances, while the vertical axis shows the video quantity corresponding to that frequency. The left chart only includes videos marked as realshow with 1, which are the exposed videos, while the right chart includes videos from all stages. It shows the frequency of videos in exposure space and all stages space, respectively. The left chart shows that exposure video frequency follows a long-tail distribution. The right chart reveals that video frequency in all stages also obeys the long-tail distribution, which is new discovery. #Normal Interactions Figure 3: User Distribution. 0 1 2 3 4 5 6 7 Log10 Frequency Log10 #Realshow Videos 0 2 4 6 8 10 12 Log10 Frequency Log10 #Stage Videos Figure 4: Video Distribution. 2.4 COMPARISON We compare Rec Flow with existing RS datasets. Movie Lens (Harper & Konstan, 2015) contains the user s rating data for movies. Amazon (Ni et al., 2019) dataset contains the user s review information on the product. Yelp (Asghar, 2016) is a dataset for location recommendation. The three datasets only contain the user s single type of positive feedback. Taobao (Zhu et al., 2018), an e-commerce dataset, has four types of the user s positive feedback. Tenrec (Yuan et al., 2022) is a comprehensive recommendation dataset that captures multiple types of user feedback across four distinct recommendation scenarios. Kuai Rec (Gao et al., 2022a) is a full-observed video recommedation dataset. Kuai Rand (Gao et al., 2022b) is an unbiased sequential video recommendation dataset with randomly exposed videos. Kuai SAR (Sun et al., 2023) is a unified search and recommendation dataset. The three datasets are opened for dedicated research problems. Rec Flow differs from those datasets because of the existence of samples from each recommendation stage. Table 8 in the subsection Dataset Comparison A.2 of Appendix gives a detailed comparison between Rec Flow and existing RS datasets. We also discuss the limitation of Rec Flow in subsection A.3 of the Appendix. 2.5 USER CONSENT AND PRIVACY PROTECTION We only collect interaction data from the user who has made his/her personal information publicly (like user id, age, gender, province, etc), and this public information allows for some level of data sharing, according to the privacy policy that users voluntarily agreed to when they signed up for an account. Besides, we anonymize all features that contain personal information. In detail, we anonymize each feature ID by adding the raw ID value with a random large integer first and remapping it to a new ID through the Hash algorithm. It can not know who the person in the real world is from the anonymous data. The General Data Protection Regulation of the European Union has confirmed that personal information that has been anonymized does not belong to personal information. Therefore, personal information that has been anonymized does not have the corresponding personal information compliance obligations, and companies can freely process it without the consent of individuals. Thus, our open-source dataset meets legal requirements. We have anonymized all features which contain personal information including request id, user id, device id, age, gender, province, video id, author id, category level one, category level two, and upload type. We first anonymize each feature ID by adding the raw ID value with a random large integer and then remapping it to a new ID through the Hash algorithm. Note that each raw ID value owns a unique larger integer. The rest of the features are stage labels and the user s feedback labels, which are not related to privacy. Anonymizing data with random noise and the Hash algorithm Published as a conference paper at ICLR 2025 satisfies the privacy protection requirements of the law of the European Union. The way of Rec Flow s anonymization is more strict than previous public recommendation datasets, including Amazon (Ni et al., 2019), Taobao (Zhu et al., 2018), Kuai Rec (Gao et al., 2022a), and Tenrec (Yuan et al., 2022). We add random large integer noise before the Hash algorithm and others not. It is nearly impossible to recover raw personal information, such as who the person in the real world is. 3 EXPERIMENTS We explore how to utilize stage samples to alleviate distribution shift and distill knowledge of subsequent stages for improving RS s performance. We focus on typical retrieval, coarse ranking, and ranking stages. For each stage, we briefly introduce its duty and existing learning paradigm. Then, we state the motivation and the ways of exploiting stage samples. Finally, we report the experiment results and analysis. We run all experiments five times with Pytorch (Imambi et al., 2021) on Nvidia 32G V100. We report the average result and standard deviation. For all methods and all experiments, we train the neural models for only one epoch, and there is no early stopping. Thus, all methods are compared fairly. There are two reasons for only one epoch. First, all online recommendation models of the industrial RS are trained by one epoch. We keep consistency with the online configuration. Second, there exists a one-epoch phenomenon (Zhang et al., 2022) of the training recommendation model, which indicates that multi-epoch training does not bring improvement. 3.1 RETRIEVAL Retrieval is the first stage of the industrial RS. It aims at retrieving thousands of videos that the user potentially prefers from the 100 million scale video corpus. Given the large candidate pool, the retrieval stage mostly adopts the lightweight two-tower model together with approximate nearest neighbor search to retrieve items quickly. To ensure that the user s preferred videos are obtained, the retrieval models usually learn with positive feedback videos as positive samples and randomly sampling videos as negative samples. We choose SASRec (Kang & Mc Auley, 2018) with one head and one layer for exploration experiments. We apply the effective view videos as positive samples and randomly sample 200 videos as negative samples for each positive. To keep consistency with the real industrial RS s online learning mode, we train SASRec with the first 36 days data day by day. The data from the last day is for evaluation. We utilize the standard top-N ranking metrics, including hit Recall@K and NDCG@K. K is set to 100, 500, 1000. The feature is the user s 50 past effective view videos. We apply embedding for the video id feature and set the embedding dimension to 8. The batch size is 4, 096 and the learning rate is 1e 1. BPR (Rendle et al., 2012) is the loss function, and Adam (Kingma & Ba, 2014) is used for optimization. 3.1.1 HARD NEGATIVE MINING Recent research (Zhang et al., 2013; Rendle & Freudenthaler, 2014; Lian et al., 2020) has shown that hard negative mining usually not only accelerates the convergence but also improves the model accuracy for the retrieval model. The hard negative samples are those videos that are similar to the positive videos but uninteresting to the user. The multi-stage RS pipeline aims at estimating the user s preference. Videos that fail to be exposed to the user during the pipeline are similar to the displayed positive video but very likely less attractive to the user. Thus, we think the unexposed stage samples indeed satisfy the definition of hard negative samples. We conduct experiments to explore the effectiveness of the stage samples as hard negative samples. In the experiments, we replace some randomly sampled easy negative samples with the same number of hard negative stage samples. The total number of negative videos for each positive video is 200. We have the following findings from the result in Table 2. (1) Applying filtered videos from each stage as hard negatives all gains performance improvement on the Recall/NDCG metric. (2) As the K in Recall/NDCG@K becomes smaller, the performance improvement becomes better. For example, when we add 1 pre-rank neg as hard negative, the relative promotion of Recall@100, 500, 1000 are 24.7%, 18.2%, 9.2% respectively, and the relative promotion of NDCG@100, 500, 1000 are 28.3%, 20.7%, 12.6% respectively. (3) The hard negative video from rerank pos outperforms than the other stages. We think that videos from rerank pos are negative samples of appropriate difficulty. We also vary the number of hard negative samples to observe the changes in effectiveness. The experiment result and analysis are in the subsection A.4 of the Appendix. Published as a conference paper at ICLR 2025 Table 2: Recall(R) and NDCG(N) results (mean std) obtained by using a single different stage sample as the hard negative sample during the retrieval stage, with units of %. The best and baseline results are based on the paired t-test at the significance level 5%. Hard Negative Type R@100 N@100 R@500 N@500 R@1000 N@1000 Baseline 0.461 0.085 0.099 0.085 1.593 0.229 0.241 0.045 2.685 0.186 0.356 0.040 Prerank neg 0.575 0.095 0.127 0.028 1.883 0.170 0.291 0.030 2.931 0.142 0.401 0.030 Coarse neg 0.555 0.066 0.121 0.021 1.729 0.152 0.267 0.033 2.758 0.169 0.376 0.035 Rank neg 0.462 0.126 0.094 0.030 1.695 0.230 0.249 0.043 2.733 0.221 0.359 0.042 Rank pos 0.648 0.074 0.134 0.017 1.794 0.187 0.277 0.028 2.737 0.173 0.376 0.025 Rerank neg 0.577 0.091 0.119 0.019 1.804 0.208 0.274 0.034 2.724 0.242 0.371 0.036 Rerank pos 0.687 0.087 0.144 0.018 1.889 0.108 0.295 0.021 2.892 0.105 0.401 0.020 Exposure neg 0.603 0.093 0.137 0.016 1.860 0.207 0.295 0.032 2.902 0.221 0.405 0.033 Table 3: Recall(R) and NDCG(N) results (mean std) obtained by using a single different stage sample as the cascade sample during the retrieval stage, with units of %. The best and baseline results are based on the paired t-test at the significance level 5%. Cascade Type R@100 N@100 R@500 N@500 R@1000 N@1000 Baseline 0.461 0.085 0.099 0.085 1.593 0.229 0.241 0.045 2.685 0.186 0.356 0.040 Prerank neg 0.677 0.061 0.167 0.041 2.268 0.129 0.367 0.048 3.446 0.111 0.492 0.042 Coarse neg 0.665 0.120 0.163 0.045 2.253 0.052 0.361 0.037 3.371 0.090 0.479 0.038 Rank neg 0.704 0.150 0.173 0.049 2.282 0.250 0.373 0.055 3.410 0.203 0.491 0.052 Rank pos 0.685 0.094 0.151 0.025 2.191 0.085 0.340 0.023 3.346 0.078 0.462 0.019 Rerank neg 0.707 0.083 0.163 0.024 2.273 0.121 0.359 0.024 3.338 0.083 0.471 0.022 Rerank pos 0.795 0.108 0.176 0.025 2.263 0.078 0.361 0.017 3.394 0.048 0.480 0.016 Exposure neg 0.692 0.071 0.156 0.028 2.150 0.108 0.340 0.033 3.266 0.183 0.458 0.036 FS-LTR 0.803 0.095 0.215 0.027 2.466 0.090 0.425 0.029 3.606 0.060 0.545 0.024 3.1.2 INTERPLAY BETWEEN RETRIEVAL AND SUBSEQUENT STAGES The most important characteristic of industrial RS is the multi-stage. Every stage has its duty and mature paradigm. The goal of each stage is consistent, which is to fit the user s preference. Although models of all stages aim at fitting the user s preference, they can not capture the user s preference perfectly. Few people focus on the interplay between stages. The academic researchers lack available datasets and the industrial engineers only devote effort to the assigned stage. (Zheng et al., 2024) has pointed out that there are two factors influencing the video s exposure and the user s feedback. First, it is the user s preference on the video. Second, it is the preference of the subsequent stage towards the video. For example, one video that the user likes is retrieved during the retrieving stage but is filtered out by the ranking model due to its imperfect preference estimation ability. This video is inefficient for the whole RS because it can not be exposed to the user at all. The optimal solution for the model of each stage is to select videos that satisfy the preference of the user and subsequent stages simultaneously. FS-LTR (Zheng et al., 2024) has proposed the Generalized Probability Ranking Principle (GPRP) to prove that the solution proposed above is optimal theoretically. We implement FS-LTR in this section to see its effectiveness. The user s preference can be learned from the positive feedback samples and randomly sampled negative samples. To learn the preference of subsequent stages, we introduce additional ranking loss, which forces the logits of samples from high-priority stages to be bigger than the logits of samples from low-priority stages. The priority of stages are {positive:6, exposure neg:5, rerank pos:4, rank pos:4, rerank neg:3, rank neg:3, corase neg:2, pre-rank neg:1, random neg:0}. Exposure neg represents the video that has been exposed to the user (realshow=1) but obtains negative feedback. This definition of priority applies throughout the paper. We always keep one positive sample with 200 negative samples. We first introduce the stage preference one stage at a time by replacing random negatives with stage samples with BPR as Eq( 1): j {k:pk