# settosequence_rankingbased_conceptaware_learning_path_recommendation__fe3f2079.pdf Set-to-Sequence Ranking-Based Concept-Aware Learning Path Recommendation Xianyu Chen1, Jian Shen1, Wei Xia2, Jiarui Jin1, Yakun Song1, Weinan Zhang1 , Weiwen Liu2, Menghui Zhu1, Ruiming Tang2, Kai Dong3, Dingyin Xia3, Yong Yu1* 1 Shanghai Jiao Tong University 2 Huawei Noah s Ark Lab 3 Huawei Technologies Co Ltd {xianyujun,r ocky,jinjiarui97,ereboas,wnzhang,zerozmi7}@sjtu.edu.cn, yuyong@apex.sjtu.edu.cn {xiawei24,liuweiwen8,tangruiming,dongkai4,xiadingyin}@huawei.com With the development of the online education system, personalized education recommendation has played an essential role. In this paper, we focus on developing path recommendation systems that aim to generate and recommend an entire learning path to the given user in each session. Noticing that existing approaches fail to consider the correlations of concepts in the path, we propose a novel framework named Setto-Sequence Ranking-Based Concept-Aware Learning Path Recommendation (SRC), which formulates the recommendation task under a set-to-sequence paradigm. Specifically, we first design a concept-aware encoder module that can capture the correlations among the input learning concepts. The outputs are then fed into a decoder module that sequentially generates a path through an attention mechanism that handles correlations between the learning and target concepts. Our recommendation policy is optimized by policy gradient. In addition, we also introduce an auxiliary module based on knowledge tracing to enhance the model s stability by evaluating students learning effects on learning concepts. We conduct extensive experiments on two real-world public datasets and one industrial dataset, and the experimental results demonstrate the superiority and effectiveness of SRC. Code now is available at https://gitee.com/mindspore/models/ tree/master/research/recommend/SRC. Introduction Different from providing the same learning content for all students in each classroom session in traditional learning, adaptive learning aims to tailor different learning objectives to meet the individual needs of different learners (Carbonell 1970). Existing recommendation methods of learning content can be summarized into two categories: (i) Step by step, the following learning item is recommended for students in real-time, and the interaction of each step (i.e., students answers) will be integrated into the recommendation for the next step (Liu et al. 2019; Cai, Zhang, and Dai 2019; Huang et al. 2019). (ii) Plan a certain length of the learning path for students at one time. The latter is because users sometimes want to know the entire learning path at the beginning (for example, universities need to organize courses for students) (Joseph, Abraham, and Mani 2022; Chen et al. 2022; *Corresponding author. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Shao, Guo, and Pardos 2021; Bian et al. 2019; Dwivedi, Kant, and Bharadwaj 2018). As the latter direction is more restricted and complex (e.g., larger search space, less available feedback), it is more challenging and is also the main focus of this paper. Previous studies (Piaget and Duckworth A 10 B 15 C 0 D 5 A 85 B 45 C 10 D 15 A 90 B 80 C 40 D 30 A 95 B 85 C 90 D 80 A Mathematical Analysis B Probability Theory C Linear Algebra D Machine Learning :Learning Concept :Target Concept Score:5 Score:80 Learning Path Underlying level of knowledge Figure 1: Illustration of the student learning process to improve a student s mastery of concept D by learning the path composed of three concepts A, B, and C. The student is given a test before and after the path to knowing his mastery of D. The bottom four tables show the student s potential mastery of all concepts at each step, which is not accessible during the training process. We can only know the mastery of the current learning concept. For example, after learning B, we know his mastery of B is 80. 1970; Inhelder et al. 1976; Pinar et al. 1995) reveal that cognitive structure greatly influences adaptive learning, which includes both the relationship between items (e.g., premise relationship and synergy relationship) and the characteristics of students dynamic development with learning. Most existing methods to solve learning path planning are either based on a knowledge graph (or some relationship between concepts) to constrain path generation (Liu et al. 2019; Shi et al. 2020; Wang et al. 2022), or based on collaborative filtering of features to search for paths (Joseph, Abraham, and Mani 2022; Chen et al. 2022; Nabizadeh, Jorge, and Leal 2019). However, these models can not penetrate into the important features of the cognitive structure perfectly, and the model is relatively simple, resulting in the path generated either with a low degree of individuation or with a poor learning effect. From this perspective, we argue that how to effectively mine the correlations among concepts and the important charac- The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) teristics of students in learning path planning is still challenging, and summarize the specific challenges as follows: (C1) How to effectively explore the correlations among concepts? There may be complex and diverse correlations between concepts, such as prerequisite relationship and synergy relationship, which will affect students learning of concepts (Tong et al. 2020; Liu et al. 2019). As shown in Figure 1, mastery, of course, A (Mathematical Analysis) is of greater help to mastery of course B (Probability Theory), and of less help to mastery, of course, C (Linear Algebra). Therefore, it should be taken into account when planning the learning path. (C2) How to evaluate and optimize the generation algorithm by effectively using the students learning effect on the target concepts? As shown in Figure1, we expect students to achieve the best improvement in the target concept D (Machine Learning). However, the existing path recommendation algorithms either do not use this feedback but use indirect factors such as similarity degree and occurrence probability (Joseph, Abraham, and Mani 2022; Shao, Guo, and Pardos 2021), or lack of excellent generation algorithms (Zhou et al. 2018; Nabizadeh, Jorge, and Leal 2019). As a result, it is difficult for them to provide an efficient learning path. This is because it is still challenging to optimize a path using only feedback that is available at the end of the path. In contrast, in the stepwise recommendation scenario, immediate feedback can be obtained at the end of each step, which allows some more advanced reinforcement learning (RL) algorithms (Sun et al. 2021; Li et al. 2021) to be applied. (C3) How can student feedback on learning concepts be incorporated into the model? As shown in Figure 1, students have different learning feedback for concepts A, B, and C on the path after learning. In the field of knowledge tracing (KT), this information plays a great role in modeling students knowledge levels. Many models (Piech et al. 2015; Yang et al. 2020; Zhang et al. 2017) take students past answers as features to predict the current answer. In (Liu et al. 2019), its DKT (Piech et al. 2015) module used this information to trace students knowledge levels in real-time to adjust recommendations for the next step. However, in path recommendation, this feedback can only be obtained after the path ends, so the above approach is difficult to implement here. To address these challenges, we propose a novel framework Set-to-Sequence Ranking-based Concept-aware Learning Path Recommendation (SRC). We formulate the learning path recommendation task as a set-to-sequence paradigm. In particular, firstly, in order to mine the correlation between concepts (C1), we design a concept-aware encoder module. This module can globally calculate the correlation between each learning concept and other learning concepts in the set so as to get a richer representation of the concept. At the same time, in the decoder module, on the one hand, we use a recurrent neural network to update the state of students; on the other hand, we use the attention mechanism to calculate the correlation between the remaining learning concepts in the set and the target concepts, so as to select the most suitable concept in the current position of the path. Secondly, we need to effectively utilize feedback on the target concepts (C2). Since the feedback is generally continuous and considering the large path space, the policy gradient algorithm is more suitable in this case. Thus the correlation between the learning concept and the target concepts calculated by the previous decoder can be expressed in the form of selection probability. So we get a parameterized policy, and we can update the model parameters in a way that maximizes the reward. Finally, we designed an auxiliary module to utilize feedback on learning concepts (C3). Similar to the KT task, the student state updated by the previous decoder at each step is fed into an MLP to predict the student s answer at that step. In this way, students feedback on the learning concepts can participate in the updating of model parameters to enhance the stability of the algorithm. Related Works Learning Path Recommendation. A class of branches (Joseph, Abraham, and Mani 2022; Chen et al. 2022; Zhou et al. 2018; Nabizadeh, Jorge, and Leal 2019; Liu and Li 2020; Shao, Guo, and Pardos 2021) in existing methods models the task as a general sequence recommendation task, which is dedicated to reconstructing the user s behavior sequence. For example, in Zhou et al. (2018), KNN (Cover and Hart 1967) is used to complete the collaborative filtering and then RNN is used to estimate the learning effect; in Shao, Guo, and Pardos (2021), the BERT (Devlin et al. 2018) paradigm is directly used to solve this problem. Another branch (Liu et al. 2019; Shi et al. 2020; Wang et al. 2022; Zhu et al. 2018) focuses on mining the role of knowledge structure. For example, Zhu et al. (2018) formulates some rules based on the knowledge structure to constrain the generation of paths. In general, most of the above methods fail to take full advantage of student feedback on the target concepts. One of the better ones is Liu et al. (2019), which uses this feedback through reinforcement learning methods to optimize the generative model. However, on the one hand, it can obtain real-time feedback on learning concepts, and its application scenario is actually a step-by-step recommendation; on the other hand, it uses the concept relationship graph as a rule to constrain path generation without mining deeply into their correlations, which makes it challenging to apply in the general case. In our method, we use an attention mechanism to mine inter-concept correlations and make full use of various feedback from students to optimize the modeling of correlations, which makes our method more general. Learning Item Recommendation. In step-by-step learning item recommendations, immediate feedback is available. This allows them (Cai, Zhang, and Dai 2019; Huang et al. 2019; Sun et al. 2021; Li et al. 2021) to use more complex RL algorithms. Such as Sun et al. (2021) use the DQN(Mnih et al. 2013), and Cai, Zhang, and Dai (2019) use the Advantage Actor-Critic. Our method also uses policy gradient in RL for optimization, but since we have no immediate feedback, only delayed feedback after the path ends, training may be more difficult. Therefore we introduce the KT auxiliary task to enhance the model stability. Set-to-Sequence Formulation. The set-to-sequence task aims to permute and organize a set of unordered candidate items into a sequence whose solutions can be roughly divided into three fields: point-wise, pair-wise, and list-wise. Among them, the point-wise method is the most widely used, which is designed to score each item individually, and then rank the items in descending order of their scores (Friedman 2001). The pair-wise methods (Burges et al. 2005; Joachims 2006) do not care about the specific score of each item. Instead, formulate the problem pair-wise, focusing on predicting the relative orders among each pair of items. The list-wise algorithms (Burges 2010; Cao et al. 2007; Xia et al. 2008) treat the entire sequence as a whole, which allows the model to mine the deep correlations among the items carefully. Noticing that students feedback on a concept is likely to be significantly affected by the other concepts on the same path, we here design our model in a list-wise manner. The main difficulty with list-wise is that the sorting process is not completely differentiable because there are no gradients available for sorting operations (Xia et al. 2008). One solution randomly optimizing the ranking network by continuous relations (Grover et al. 2019; Swezey et al. 2020). Another class of branches, named Plackett-Luce (PL) ordering model (Burges 2010; Luce 2012; Plackett 1975), represents ordering as a series of decision problems, where each decision is made by softmax operation. Its probabilistic nature leads to more robust performance (Bruch et al. 2020), but computing the gradient of the PL model requires iterating over every possible permutation. A solution proposed in the recent literature (Oosterhuis 2021) is the policy gradient algorithm (Williams 1992). Problem Formulation Consider a student u, whose historical sequence of concepts learning is H = {h1, h2, , hk}. The record ht = {ct, yt} of each time t includes the learned concept ct and the degree of mastery yt of the concept. Now given a set S = {s1, s2, , sm} consisting of m candidate concepts, the student u is to learn n non-repetitive concepts from S in some order (hence m n). Through the study of such a learning path π = {π1, π2 , πn}, he can improve his mastery of some target concepts T = {t1, t2, }. Following (Liu et al. 2019), we quantify the learning effect as ET = Ee Eb Esup Eb (1) where Ee and Eb represent the student s mastery of the target concepts before and after the path π (which can be obtained through exams), and Esup represents the upper bound of mastery. At the same time, we can also observe the students mastery Yπ = {yπ1, yπ2, , yπn} of learning concepts after the end of the path. Then, we can formulate our problem as follows: Definition 1. Learning Path Recommendation. Given a student s historical learning sequence H, target concepts T, and candidate concepts set S, it is required to select n concepts from S without repetition and rank them to generate a path π to recommend to the student. The ultimate goal is to maximize the learning effect ET at the end of the path. SRC Framework Figure 2 shows the overview framework of our SRC model. As shown, first we design a concept-aware encoder to model the correlations among candidates learned concepts to obtain their global representations. Then in the decoder module, we use the recurrent neural network to model the knowledge state of the students along the path and calculate the correlation between the learning concepts and the target concepts through the attention mechanism to determine the most suitable concept for the position. In addition to this, based on the knowledge state obtained in the decoder, we further predict the student s answer to the learning concepts. At the end of the learning path, we pass the obtained feedback ET and Yπ to the model to optimize the parameters. Encoder First, for each concept si in the candidate concept set S, we access the embedding layer to obtain its continuous representation xsi. However, as we discussed in the introduction, there are complex and diverse correlations between concepts, and these correlations can seriously affect the final learning effect of the path. However, the embedding representation we obtained can only reflect the characteristics of the concept itself in isolation, and cannot reflect the correlation between concepts. Therefore, we need a function f e to capture these correlations within the set and fuse them into the concept representation to get the global representation Es: Es = f e(Xs), Xs = [xs1, xs2, , xsm]T . (2) For the implementation of f e, a simple approach is to add these concept representations to each concept after a pooling operation (e.g., average pooling operation), unfortunately, it is not capable to model complex correlations. Recent literature (Pang et al. 2020; Lee et al. 2019) uses a more complex Transformer to extract information, but it mainly focuses on correlation and thus pays less attention to the unique characteristics of the concept itself. And note that our training is based on the paradigm of a policy gradient with only one reward per path. With label sparsity, complex models like Transformer are extremely difficult to train due to potential over-smoothing issues (Liu et al. 2021). We empirically verify it in our follow-up experiments, which motivates us to combine the above two approaches. First, we apply the self-attention mechanism to XS: Ea s = softmax(QKT Q = Xs W Q, K = Xs W K, V = Xs W V , (4) W Q, W K, W V are all trainable weights, d is the dimension of the embedding. At the same time, we pass a simple multilayer perceptron(MLP) to the embedding and add the average pooling part: El s = El + Pm i el i m , El = f l(XS), (5) Correct Probability Embedding Layer Self-Attention MLP Layer Softmax Attention Layer KT Predict Layer Output Path Sample :Candidate Set :History Sequence :Target Concepts :Hidden State :Output Policy Figure 2: The overview of our framework. SRC is composed of the encoder, decoder, and KT auxiliary module. The encoder captures the correlation between concepts in the candidate set S to obtain the representation of concepts ES. The decoder generates a ranking of S based on the information of ES, T, and H, and outputs the policy π. KT auxiliary module is responsible for predicting the correct probability of each step on the path. where f l is the MLP, el i is the feature of the i-th concept in El. Then the final representation of learning concepts in S is Es = [Ea s ; El s], (6) where [.; .] is the concat operation. In this way, we obtain the representation Es that are being aware of the other concepts in the set and retain their own characteristics. After obtaining a representation of each learned concept, we will generate their permutation and the probability in the decoder module: π, P = f d(Es, H, T). (7) The implementation of f d refers to Pointer Network (Vinyals, Fortunato, and Jaitly 2015). First, we design an LSTM (denoted as g) (Hochreiter and Schmidhuber 1997) to trace student states. The initial state v0 of the student in g before the start of the path should be related to the student s past learning sequence H. Considering that each step i in H contains both the learning concept ci and the mastery degree yi, v0 is calculated as: vo = g([xc1; y1]Wh, [xc2; y2]Wh, , [xck; yk]Wh), (8) where xci is the embedding of concept ci, Wh is a trainable matrix that transforms [xc1; y1] into the same input dimension as Es. Now we assume that the state after the (i 1)-th concept πi 1 in the learning path is vi 1. Let π 20. This is probably because the concepts that make up the path in this scenario are selectable by the model. While the number of concepts that are helpful for learning the target concept is limited, they are already selected when the path is short. Concepts added on longer paths have little value and are offset by factors such as forgetting. 10 20 30 Lengths p=2 SRC GRU4Rec Rule-Based DQN 10 20 30 Lengths Figure 3: Impact of Length Result for Industrial Dataset To verify the effectiveness of SRC in practical applications, we deploy our model in the online education department of Huawei. Table 4 shows some experimental results on the company s internal industrial dataset. The dataset includes 159 students and 614 concepts and the average length of trajectories is 108.99. It can be seen that SRC shows the best performance. The performance of other methods is also similar to those on public datasets. The main difference is that the negative rewards appear more frequently here. Under DKT, the random method even has negative rewards close to 1; correspondingly, under DKT, the SRC method can learn rewards close to the upper bound of 1. Under Co KT, although the reward of the learned optimal path is still negative in some cases, the fluctuation range is greatly reduced. These results reflect the different properties of difficulty curves and forgetting curves under different simulators built on different datasets. And our model can show the best performance in a variety of situations, indicating the effectiveness and generalization of our method. Further online experimental results are being deployed and collected in the coming weeks. Model p=0 p=1 p=2 p=3 Rule-based 0.0319 0.2092 0.1622 0.9507 Random -0.8202 -0.7088 -0.8098 -0.8885 DQN 0.4495 -0.2504 -0.6021 0.8800 SRC 0.9319* 0.4701* 0.2842* 0.9861* Rule-based -0.0445 -0.0631 -0.0819 -0.0595 Random -0.0637 -0.0848 -0.0832 -0.0548 DQN 0.0101 -0.0092 -0.0215 -0.0504 SRC 0.1288* -0.0042* -0.0159* 0.2577* Table 4: Result on the industrial dataset. In this paper, we formulate the path recommendation of the online education system as a set-to-sequence task and provide a new set-to-sequence ranking-based concept-aware framework, named SRC. Specifically, we first design a concept-aware encoder module that captures correlations between input learning concepts. The output is then fed to a decoder module that sequentially generates paths through an attention mechanism that handles correlations between learning and target concepts. The recommendation policy will be optimized through policy gradients. In addition, we introduce an auxiliary module based on knowledge tracking to enhance the stability of the model by evaluating students learning effects on the learned concepts. We conduct extensive experiments on two real-world public datasets and one industrial proprietary dataset, where SRC demonstrates its performance superiority over other baselines. In future work, it might be an interesting direction to further explore the relationships between concepts, such as using graph neural networks. In addition, we plan to further deploy our model in the real-world online education system. Acknowledgements The SJTU team is partially supported by the National Natural Science Foundation of China (62177033). The work is also sponsored by Huawei Innovation Research Program. We gratefully acknowledge the support of Mind Spore (Mind Spore 2022), CANN (Compute Architecture for Neural Networks), and Ascend AI Processor used for this research. We also thank Li ang Yin from Shanghai Jiao Tong University. References Bian, C.-L.; Wang, D.-L.; Liu, S.-Y.; Lu, W.-G.; and Dong, J.-Y. 2019. Adaptive learning path recommendation based on graph theory and an improved immune algorithm. KSII Transactions on Internet and Information Systems (TIIS), 13(5): 2277 2298. Bruch, S.; Han, S.; Bendersky, M.; and Najork, M. 2020. A stochastic treatment of learning to rank scoring functions. In Proceedings of the 13th international conference on web search and data mining, 61 69. Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, 89 96. Burges, C. J. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581): 81. Cai, D.; Zhang, Y.; and Dai, B. 2019. Learning path recommendation based on knowledge tracing model and reinforcement learning. In 2019 IEEE 5th International Conference on Computer and Communications (ICCC), 1881 1885. IEEE. Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; and Li, H. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, 129 136. Carbonell, J. R. 1970. AI in CAI: An artificial-intelligence approach to computer-assisted instruction. IEEE transactions on man-machine systems, 11(4): 190 202. Chang, H.-S.; Hsu, H.-J.; and Chen, K.-T. 2015. Modeling Exercise Relationships in E-Learning: A Unified Approach. In EDM, 532 535. Chen, Y.-H.; Huang, N.-F.; Tzeng, J.-W.; Lee, C.-a.; Huang, Y.-X.; and Huang, H.-H. 2022. A Personalized Learning Path Recommender System with Line Bot in MOOCs Based on LSTM. In 2022 11th International Conference on Educational and Information Technology (ICEIT), 40 45. IEEE. Cover, T.; and Hart, P. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1): 21 27. Deisenroth, M.; and Rasmussen, C. E. 2011. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), 465 472. Citeseer. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805. Dwivedi, P.; Kant, V.; and Bharadwaj, K. K. 2018. Learning path recommendation based on modified variable length genetic algorithm. Education and information technologies, 23(2): 819 836. Feng, M.; Heffernan, N.; and Koedinger, K. 2009. Addressing the assessment challenge with an online system that tutors as it assesses. User modeling and user-adapted interaction, 19(3): 243 266. Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189 1232. Grover, A.; Wang, E.; Zweig, A.; and Ermon, S. 2019. Stochastic optimization of sorting networks via continuous relaxations. ar Xiv preprint ar Xiv:1903.08850. Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2015. Session-based recommendations with recurrent neural networks. ar Xiv preprint ar Xiv:1511.06939. Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735 1780. Hu, Y.; Da, Q.; Zeng, A.; Yu, Y.; and Xu, Y. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 368 377. Huang, Z.; Liu, Q.; Zhai, C.; Yin, Y.; Chen, E.; Gao, W.; and Hu, G. 2019. Exploring multi-objective exercise recommendations in online education systems. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 1261 1270. Inhelder, B.; Chipman, H. H.; Zwingmann, C.; et al. 1976. Piaget and his school: a reader in developmental psychology. Springer. Joachims, T. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 217 226. Joseph, L.; Abraham, S.; and Mani, B. P. 2022. Exploring the Effectiveness of Learning Path Recommendation based on Felder-Silverman Learning Style Model: A Learning Analytics Intervention Approach. Journal of Educational Computing Research, 07356331211057816. Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; and Teh, Y. W. 2019. Set transformer: A framework for attentionbased permutation-invariant neural networks. In International conference on machine learning, 3744 3753. PMLR. Li, X.; Xu, H.; Zhang, J.; and Chang, H.-h. 2021. Optimal hierarchical learning path design with reinforcement learning. Applied psychological measurement, 45(1): 54 70. Liu, D.; Wang, S.; Ren, J.; Wang, K.; Yin, S.; and Zhang, Q. 2021. Trap of Feature Diversity in the Learning of MLPs. ar Xiv preprint ar Xiv:2112.00980. Liu, H.; and Li, X. 2020. Learning path combination recommendation based on the learning networks. Soft Computing, 24(6): 4427 4439. Liu, Q.; Tong, S.; Liu, C.; Zhao, H.; Chen, E.; Ma, H.; and Wang, S. 2019. Exploiting cognitive structure for adaptive learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 627 635. Long, T.; Qin, J.; Shen, J.; Zhang, W.; Xia, W.; Tang, R.; He, X.; and Yu, Y. 2022. Improving Knowledge Tracing with Collaborative Information. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 599 607. Luce, R. D. 2012. Individual choice behavior: A theoretical analysis. Courier Corporation. Mind Spore. 2022. Mind Spore. https://www.mindspore.cn/. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602. Nabizadeh, A. H.; Jorge, A. M.; and Leal, J. P. 2019. Estimating time and score uncertainty in generating successful learning paths under time constraints. Expert Systems, 36(2): e12351. Oosterhuis, H. 2021. Computationally efficient optimization of plackett-luce ranking models for relevance and fairness. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1023 1032. Pang, L.; Xu, J.; Ai, Q.; Lan, Y.; Cheng, X.; and Wen, J. 2020. Setrank: Learning a permutation-invariant ranking model for information retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 499 508. Piaget, J.; and Duckworth, E. 1970. Genetic epistemology. American Behavioral Scientist, 13(3): 459 480. Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L. J.; and Sohl-Dickstein, J. 2015. Deep knowledge tracing. Advances in neural information processing systems, 28. Pinar, W. F.; Reynolds, W. M.; Slattery, P.; Taubman, P. M.; et al. 1995. Understanding curriculum: An introduction to the study of historical and contemporary curriculum discourses, volume 17. Peter lang. Plackett, R. L. 1975. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2): 193 202. Shao, E.; Guo, S.; and Pardos, Z. A. 2021. Degree planning with plan-bert: Multi-semester recommendation using future courses of interest. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 14920 14929. Shi, D.; Wang, T.; Xing, H.; and Xu, H. 2020. A learning path recommendation model based on a multidimensional knowledge graph framework for e-learning. Knowledge Based Systems, 195: 105618. Sun, Y.; Zhuang, F.; Zhu, H.; He, Q.; and Xiong, H. 2021. Cost-effective and interpretable job skill recommendation with deep reinforcement learning. In Proceedings of the Web Conference 2021, 3827 3838. Swezey, R.; Grover, A.; Charron, B.; and Ermon, S. 2020. Pirank: Learning to rank via differentiable sorting. ar Xiv preprint ar Xiv:2012.06731. Tong, S.; Liu, Q.; Huang, W.; Hunag, Z.; Chen, E.; Liu, C.; Ma, H.; and Wang, S. 2020. Structure-based knowledge tracing: an influence propagation view. In 2020 IEEE International Conference on Data Mining (ICDM), 541 550. IEEE. Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. Advances in neural information processing systems, 28. Wang, X.; Liu, K.; Wang, D.; Wu, L.; Fu, Y.; and Xie, X. 2022. Multi-level recommendation reasoning over knowledge graphs with reinforcement learning. In Proceedings of the ACM Web Conference 2022, 2098 2108. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3): 229 256. Xia, F.; Liu, T.-Y.; Wang, J.; Zhang, W.; and Li, H. 2008. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning, 1192 1199. Yang, Y.; Shen, J.; Qu, Y.; Liu, Y.; Wang, K.; Zhu, Y.; Zhang, W.; and Yu, Y. 2020. GIKT: a graph-based interaction model for knowledge tracing. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 299 315. Springer. Zhang, J.; Shi, X.; King, I.; and Yeung, D.-Y. 2017. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on World Wide Web, 765 774. Zhou, Y.; Huang, C.; Hu, Q.; Zhu, J.; and Tang, Y. 2018. Personalized learning full-path recommendation model based on LSTM neural networks. Information Sciences, 444: 135 152. Zhu, H.; Tian, F.; Wu, K.; Shah, N.; Chen, Y.; Ni, Y.; Zhang, X.; Chao, K.-M.; and Zheng, Q. 2018. A multi-constraint learning path recommendation algorithm based on knowledge map. Knowledge-Based Systems, 143: 102 114.