# computerized_adaptive_testing_via_collaborative_ranking__42d3b537.pdf Computerized Adaptive Testing via Collaborative Ranking Zirui Liu1, Yan Zhuang1, Qi Liu1,2 , Jiatong Li1, Yuren Zhang1, Zhenya Huang1, Jinze Wu3, Shijin Wang3 1: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2: Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3: i FLYTEK Co., Ltd {liuzirui,zykb,cslijt,yr160698,hxwjz}@mail.ustc.edu.cn {qiliuql,huangzhy}@ustc.edu.cn, sjwang3@iflytek.com As the deep integration of machine learning and intelligent education, Computerized Adaptive Testing (CAT) has received more and more research attention. Compared to traditional paper-and-pencil tests, CAT can deliver both personalized and interactive assessments by automatically adjusting testing questions according to the performance of students during the test process. Therefore, CAT has been recognized as an efficient testing methodology capable of accurately estimating a student s ability with a minimal number of questions, leading to its widespread adoption in mainstream selective exams such as the GMAT and GRE. However, just improving the accuracy of ability estimation is far from satisfactory in the real-world scenarios, since an accurate ranking of students is usually more important (e.g., in high-stakes exams). Considering the shortage of existing CAT solutions in student ranking, this paper emphasizes the importance of aligning test outcomes (student ranks) with the true underlying abilities of students. Along this line, different from the conventional independent testing paradigm among students, we propose a novel collaborative framework, Collaborative Computerized Adaptive Testing (CCAT), that leverages inter-student information to enhance student ranking. By using collaborative students as anchors to assist in ranking test-takers, CCAT can give both theoretical guarantees and experimental validation for ensuring ranking consistency. 1 Introduction With the rapid advancements in computer science, online education has undergone significant transformation, reshaping and displacing traditional offline educational assessment techniques. In this evolving landscape, Computerized Adaptive Testing (CAT) [1, 2] has emerged as a prominent methodology for standardized testing, widely adopted in selective exams such as the GMAT [3], GRE [4], and TOEFL [5]. Diverging from traditional paper-and-pencil tests, CAT offers personalized and interactive assessments, where the difficulty and characteristics of questions are continuously adapted based on real-time responses. By aligning questions with current estimation of students abilities, CAT refines the estimation process each iterative step [6]. Upon test completion, the final ability score shown in Figure 1(a) is provided as score report to students. This score plays a pivotal role in influencing their educational and career prospects. Corresponding Author. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Figure 1: (a) The score report provided by GRE and an example to show that a low MSE cannot guarantee the correct ranking of students testing results. (b) This line chart shows the performance of previous CAT methods in ranking, and it can be seen that the method that performs state-of-the-art (BECAT) in accuracy may only achieve the effect of random selection in ranking. However, while massive efforts have been made on optimizing the accuracy of ability estimation via improvements to the question selection algorithms [7, 8, 9, 10, 11, 12], it is crucial to underscore that accurate ability estimation does not inherently guarantee correct student ranking. As illustrated in Figure 1(a), minimizing mean squared error (MSE) in ability scores does not always translate into accurate rankings of students. In fact, even state-of-the-art (SOTA) question selection algorithms with superior accuracy performance can exhibit inconsistencies in ranking performance, sometimes performing worse than random selection methods, as presented in Figure 1(b). Meanwhile, the asynchronicity and independency between different students in the CAT test process [13, 14] is a significant technical challenge in achieving accurate ability ranking. This issue prevents the utilization of all students testing information together for question selection to enhance ranking precision among students, thereby complicates the resolution of the ranking consistency issue in CAT. To address this challenge, we propose a novel framework Collaborative Computerized Adaptive Testing (CCAT), which introduces a collaborative learning [15, 16] approach that leverages data from collaborative students as ranking anchors. This framework facilitates interaction among test-takers, allowing for more robust ranking results. Importantly, we also present a theoretical analysis that demonstrates how, with a sufficient number of collaborative students, the ranking consistency error can be significantly reduced to an acceptable level. In summary, our contributions are: To the best of our knowledge, this is the first research to unveil the ranking consistency dilemma inherent in CAT, by providing its formal definition and approximation. This discovery has enabled us to significantly refine the objectives of CAT, which is a vital advancement for its deployment in high-stakes examination contexts. We introduce a novel, collaboration-based methodology that enhances both question selection and ability estimation to minimize ranking inconsistency, providing theoretical guarantees for ranking consistency even with a limited number of questions. Our methodology is general enough to integrate with existing question selection algorithms. Empirical results on extensive real-world educational datasets proves the effectiveness of CCAT, manifesting in an average 5% rise in ranking consistency compared with other methods, and this improvement is more significant in the short test scenarios. 2 Related Work CAT is designed to efficiently and accurately estimate students abilities[2]. It is widely employed in various competitive exams, including the GRE. CAT essentially operates in two stages: first, it uses methods such as Item Response Theory (IRT) [17] to estimate students abilities. Subsequently, it uses these estimations to select the next question for each student. The following paragraphs separately outline Item Response Theory and common question selection algorithms used in CAT. Item Response Theory (IRT). IRT is a psychological measurement theory predominantly employed in education to estimate students abilities [17, 18, 19]. It posits that examinees abilities remain constant throughout a test, and their performance depends solely on their ability and the information provided by the questions . The standard model is the two-parameter logistic (2PL) model: Pj(θ) = P(yj = 1) = σ(aj(θ bj)), where σ(x) = 1 1+e x is sigmoid function and yj = 1 indicates a correct response to question j. The parameters aj, bj R represent the discrimination and difficulty of question j. These parameters are estimated by algorithms such as Markov Chain Monte Carlo (MCMC) [20, 21] and Gradient Descent (GD) [22, 23] before testing. θ R represents the student s ability, which is estimated using the maximum likelihood method at each step t: θt = arg max θ Θ j=1 yj ln Pj(θ) + (1 yj) ln (1 Pj(θ)). (1) In recent years, the increasing studies [24, 25, 26, 27, 28] leveraging the rapid advancements in deep learning technologies (e.g., the neural networks) have significantly enhanced the accuracy of student ability estimation. For example, Neural CD [24] leverages a non-negative fully connected neural network to capture the complex student-question interactions to achieve a more accurate estimation. Selection Algorithms. Research on selection algorithms can be categorized into two main approaches: traditional rule-based algorithms and data-driven algorithms. Firstly, traditional question selection algorithms[29, 30, 31] view CAT as a parameter estimation problem. They calculate the information value of each question based on the student s current proficiency and select the question with the maximum information value[32], typically using metrics such as Fisher Information (FSI) [32] and Kullback-Leibler Information (KLI) [33]. Subsequently, in order to optimize the accuracy of the test result directly, researchers have proposed methods such as MAAT [34], BOBCAT [35] and NCAT [36], which are based on active learning [37], meta-learning [38] and reinforcement learning [39]. Recently, BECAT [40] proposes to use the ability estimated by student s full responses on the entire question bank as the true value and solve the CAT problem using a data efficiency method [41]. In fact, in many exams, especially selective exams, the ranking of grades is usually one of the most important bases for employment. So we argue that the requirement of students in CAT is not necessarily a more precise estimation of their abilities on the test set. Rather, CAT should ensure that students with stronger abilities receive better rankings. Consequently, we establish the ranking consistency of CAT as our primary objective. 3 Ranking Consistency of CAT We first assume that the testing step in CAT is uniformly T steps and all the selected questions come from question bank Q. The questions answered by each student constitute a subset S Q. For each step t, the student s ability estimated by IRT is θt and the student s final result is θT when the test stops. For traditional CAT methods, the goal is that test results θT should be as close as possible to the true abilities of students θ with fewer questions [40, 42]: min |S|=T ||θT θ ||, (2) where θ is approximated by the abilities of students estimated by their full responses to the entire question bank Q [40]. However, as previously mentioned, CAT often prioritize the issue of ranking among students over merely improving the accuracy of θT . For instance, if students learn that a peer with lower true ability outperforms them in CAT, they may question the fairness of the exam [43]. Thus, we define the consistency of CAT ranking as follows: Definition 1. (Ranking Consistency of CAT) In computerized adaptive testing, the true abilities of two students are represented by θ 1 and θ 2. The testing results of these two students on subsets S1 and S2 of question bank Q are denoted by θT 1 and θT 2 . The ranking consistency of testing demands that students with higher true abilities should also exhibit higher testing results: max |S1|=|S2|=T P(θT 1 > θT 2 |θ 1 > θ 2). (3) Tested Student Select Algorithm Question Selection Part Ability Estimation Part IRT Estimation Ranking Ability Responses Collaborative Students Untested Question Question Answer Correctly Question Answer Inorrectly T θ IRT Ability Collaborative students: This question is difficult, and we cannot correctly answer it. You have no need to waste time on this question! Figure 2: The structure of CCAT framework. CCAT consists of two parts: question selection and ability estimation. The question selection part utilizes the performance of collaborative students in answering various questions to select appropriate questions for the tested student, and the ability estimation part ranks the tested student with collaborative students and uses the ranking as the test result. Given the varied performance, queries, and progress of the students undergoing testing, they remain independent during the CAT process. Consequently, it is impractical to intervene in ranking consistency by selecting questions based on each others performance in the test. This complicates the direct optimization of this problem. 4 The CCAT Framework To address the problem of ranking consistency, in this section, we first introduce the concept of collaborative students as anchors for the tested students. Then we elucidate their application in question selection and ability estimation. Finally, we conducted a theoretical analysis of the collaborative student method, demonstrating that while the ranking of the tested students among collaborative students may not be entirely accurate, the likelihood of achieving ranking consistency in CAT can reach at least 1 δ when a sufficient number of collaborative students are available. Definition 2. (Collaborative Students) Collaborative students represent a group with M students who are utilized as anchors to assist in ranking test-takers [44, 45]. It can be assumed that collaborative students have already completed answering all questions in the question bank Q, and their abilities on question bank Q or subset S(|S| = T) are θ c and θT c , which can be obtained easily. Due to the absence of information disclosure between any two students during the testing process, we cannot directly intervene in their ranking relationship. Nonetheless, since the collaborative students answered every question from the question bank, we can hypothesize that each collaborative student will accompany the tested students in responding to the same questions during the test. This could facilitate the establishment of relationships among the tested students. Specifically, when two students, A and B, answer distinct sets of questions, say q1, q2, q3 for student A and q4, q5, q6 for student B, inconsistencies may arise due to the dissimilarity of the questions. However, each collaborative student can compare their performance with both students A and B. For instance, a collaborative student can assess her performance on questions q1, q2, q3 alongside student A and on questions q4, q5, q6 alongside student B. If the collaborative student finds that her abilities exceed those of student A but fall short of student B, she will provide valuable information for accurately ranking students A and B. 4.1 Problem Approximation As previously mentioned, our goal is to establish the ranking relationship between tested students by comparing with collaborative students. Obviously, the first step in ensuring the ranking consistency among tested students is to establish ranking consistency between the collaborative students and the tested students: max |S|=T P(θT > θT c |θ > θ c, S). (4) In Section 2, we outlined the estimation method for θ in Item Response Theory, as presented in Equation (1). Utilizing this formula, we can derive the subsequent lemma, which aids in simplifying the optimization objective. Lemma 1. Given two students, whose responses on S(|S| = T) are y1, y2, , y T and y1, y2, , y T , their testing abilities on S are θT , θT , which are estimated by IRT with parameters ai, bi. We can prove that if (θT θT ) > 0, then PT i=1 ai(yi yi) > 0, vice versa. Lemma 1 posits that if two students are tested on the same question subset, the term PT i=1 ai(yi yc i ) can be used to replace θT θT c because they share the same sign (either positive or negative). This substitution leads to a more streamlined formulation of the objective: P(θT > θT c |θ > θ c, S) = P qi S ai(yi yc i ) > 0|S, θ > θ c qi S ai P(yi > yc i |θ > θ c) qi S R(qi|θ > θ c), where R(qi|θ > θ c) = ai P(yi = 1|θ )P(yc i = 0|θ > θ c), yc j and yj represent the responses of collaborative students and tested students to question j respectively. The above derivation assumes that all questions in the question bank Q are independent, and students with high abilities should perform well on each question. This formula indicates that for each tested student, answering questions that students with weaker abilities cannot answer correctly enhances ranking consistency. Considering the asymmetry between collaborative students and tested students, we also need to consider the situation where collaborative students have stronger abilities than tested students: P(θT < θT c |θ < θ c, S) X qi S R(qi|θ < θ c), (6) where R(qi|θ < θ c) = ai P(yi = 0|θ )P(yc i = 1|θ < θ c). Similar to equation (5), our objective is to shield students from being assessed on questions that students with higher abilities may struggle to answer accurately. By utilizing the constraints from formulas (5) and (6), we can select specific questions for the tested students based on their collaborative students: qt = arg max q Q\St 1 P(θ < θ c)R(q|θ < θ c) + P(θ > θ c)R(q|θ > θ c). (7) Here St 1 represents the subset of questions selected up to step t, with St = St 1 {qt} where qt is the question selected at step t. This formula aims to find questions that collaborative students with higher abilities are likely to answer correctly, while tested students may struggle with. Meanwhile, it also identifies questions that collaborative students with lower abilities are unlikely to answer correctly, while tested students may respond correctly. The selection method enhances the performance of the originally strong students while diminishing that of weaker ones, aiding tested students in determining their ranking among collaborative students. After testing, the tested students received their performance on S, as well as their ranking relationship with each collaborative student. In the study, we used the mean ranking relationship among collaborative students as the test results for the tested students: ˆθT = E I θT > θT c = E i S ai(yi yc i ) > 0 where I ( ) is the indicator function. Due to the uncertainty of the tested students abilities and the incomplete responses from collaborative students during the testing process, we further approximate and elucidate the optimization problem in appendix section C. Algorithm 1: The CCAT framework Require: Q-question bank,IRT-estimation method. Initialize: Random initialize tested student s ability θ0, initialize the question subset St , the tested student s record Y and collaborative students records Y c . 1 for t = 1 to T do 2 Select question: qt arg maxq Q\St 1 P(θ < θ c)R(q|θ < θ c) + P(θ > θ c)R(q|θ > θ c), St St 1 {qt}. 3 Get tested student s and collaborative students answer: Y Y {yt}, Y c Y c {{yc 1t, , yc Mt}}. 4 Update students estimate ability θ:θt = arg minθ Θ log pθ(qi, yi). 5 Calculate tested student s rank in collaborative students: ˆθT 1 M PM i=1 σ(PT t=1 ai(yc it yt)). Output: The student s final estimate ranking ability ˆθT . 4.2 Theoretical Analyses of CCAT Through the above derivation and approximation, we provide the selection algorithm and estimation method for CCAT, which can ensure high degree of consistency in ranking between collaborative and tested students. This ranking is then used to provide the test results for the tested students, denoted as ˆθT . Regarding the test result ˆθT in ability estimation, we have the following conclusion: Theorem 1. Given two students A and B, their relationship with collaborative students are r1, r2, , r M; r1, r2, , r M, ri {0, 1} indicating whether student A outperforms collaborative student i in a given test. Assuming the probability that student A outperforms the collaborative students i is P(ri = 1) = ζ1 and student B outperforms the collaborative students i is P( ri = 1) = ζ2. Then the following conclusion can be drawn: (1) If M > ln 1 δ 2(ζ1 ζ2)2 collaborative students are provided, the prediction of ranking consistency will be at least 1 δ. (2) When the number of test questions T is small, the assessment of the ranking relationship between the tested students and collaborative students may yield inaccurate results. Assuming an error probability of ρ (0, 0.5), we can still derive that if M > ln 1 δ 2(1 2ρ)2(ζ1 ζ2)2 collaborative students are provided, the prediction of ranking consistency will be at least 1 δ. Drawing from Theorem 1, we can deduce that having a sufficient number of collaborative students ensures a consistent ranking of abilities among all tested students, even in the presence of rank errors between the tested and collaborative students. Meanwhile, Our question selection algorithm actually reduces the ranking error ρ by maximizing the ranking consistency between collaborative and tested students, thereby theoretically increasing the ranking consistency. Algorithm 1 outlines the pseudo-code for the CCAT framework. During the question selection phase, the complexity of our proposed question selection algorithm is O(|Q|TMN), as it involves selecting the most appropriate question from the question bank Q with a complexity of O(|Q|M) for each tested student. Here, T denotes the total number of questions in the test, M is the number of collaborative students, and N is the number of students being tested. It can be observed that the time complexity of CCAT is comparable to the inference speed of data-driven CAT methods. However, CCAT circumvents the time-consuming training process by storing collaborative students. Although this does increase spatial complexity, it significantly reduces the time required for training and eliminates the need for repeated training of models due to system changes. 5 Experiments In this section, to demonstrate the effectiveness of CCAT on ranking consistency, we compare the performance of CCAT on the ranking consistency metric with other baseline methods on two real-world datasets. In addition, we conduct a case study to compare IRT and collaborative ability estimation and gain deeper insights on how collaborative ability estimation leads to ranking consistency. 5.1 Expermental Setup Evaluation Method. The goal is to ensure consistency in the ranking of the test results of tested students on the subsets S and their abilities on all questions in question bank Q. In this study, we use the Kendall coefficient [46] between the abilities of tested students on the subsets S and on question bank Q, which we call intra-class ranking consistency: K = 2 N(N 1) 1 i 0, then (PT i=1 ai(yi yi)) > 0, vice versa. Proof. Since the abilities of θT and θT are the maximum likelihood estimation of IRT in Formula 2, they meet the following conditions: i=1 ai(Pi(θT ) yi) = i=1 ai(Pi( θT ) yi) = 0. According to the Lagrange mean value theorem [52], the following derivation can be derived: i=1 ai(yi yi) = i=1 ai(Pi(θT ) Pi( θT )) = i=1 ai P i(ζi)(θT θT ). Since P i(ζi) = ai Pi(ζi)(1 P(ζi)) and 0 < Pi(ζi) < 1, it implies that: i=1 ai(yi yi) = ( i=1 a2 i P(ζ)(1 P(ζ)))(θT θT ). Due to PT i=1 ai(yi yi) and θT θT shared positivity or negativity: i=1 ai(yi yi) > 0 θT θT > 0 B Proofs of Theorem 1 Theorem 1. Given two students A and B, their relationship with collaborative students are r1, r2, , r M; r1, r2, , r M, ri {0, 1} indicating whether student A outperforms collaborative student i in a given test. Assuming the probability that student A outperforms the collaborative students i is P(ri = 1) = ζ1 and student B outperforms the collaborative students i is P( ri = 1) = ζ2. Then the following conclusion can be drawn: (1) If M > ln 1 δ 2(ζ1 ζ2)2 collaborative students are provided, the prediction of ranking consistency will be at least 1 δ. (2) When the number of test questions T is small, the assessment of the ranking relationship between the tested students and collaborative students may yield inaccurate results. Assuming an error probability of ρ (0, 0.5), we can still derive that if M > ln 1 δ 2(1 2ρ)2(ζ1 ζ2)2 collaborative students are provided, the prediction of ranking consistency will be at least 1 δ. Proof. Assuming that the ranking abilities of two students are ˆθT 1 , ˆθT 2 . Without loss of generality, we suppose ζ1 > ζ2. We define the random variable Xi as the relationship between two students ranking, where Xi = ri ri. Suppose, we have M collaborative students, and we define X = 1 M PM i=1 Xi. Obviously, E[ˆθT 1 ˆθT 2 ] = EX = 1 i=1 EXi = (1 ρ)ζ1+ρ(1 ζ1) (1 ρ)ζ2 ρ(1 ζ2) = (1 2ρ)(ζ1 ζ2). According to the Hoeffding s inequality [53, 54], we have: P(ˆθT 1 < ˆθT 2 ) = P(ˆθT 1 ˆθT 2 E[ˆθT 1 ˆθT 2 ] < (1 2ρ)(ζ1 ζ2))) < exp( 2M[(1 2ρ)(ζ1 ζ2)]2). Setting δ = exp( 2M(1 2ρ)2(ζ1 ζ2)2), we have when M is larger than ln 1 δ 2(1 2ρ)2(ζ1 ζ2)2 , P(ˆθT 1 < ˆθT 2 ) < δ, which means the prediction error is small than δ. C Implementation Details of CCAT Due to the incomplete information in the test, we also made the following approximations: (1) Approximate Collaborative Students. Since there is no collaborative student in the real world who has completely completed all the questions as we assumed, we use Pi(θ c) to supplement the answer for question i, which means: ( 1 yc i = 1 0 yc i = 0 Pi(θ c) yc i = None . (11) Based on this, if yc i is not provided, I(yc i = 1) should be replaced to Pi(θ c) and I(yc i = 0) should be replaced to 1 Pi(θ c) in question selection part, and in ability estimation part, I P i S ai(yi yc i ) > 0 can be approximated as σ P i S ai(yi y c i) and it can be applied regardless of whether there is Pi(θ c) of supplementation or not. (2) Approximate Outperform Probability. In our method, we need to select questions by using the information on whether tested students outperform the collaborative students I(θ > θ c). However, the ground truth θ and θ c is unknown when testing. So we proposed using θt and θt c to approximate θ and θ c for each step t. Considering that there is a certain error between time t and the actual state, we use the sigmoid function σ(θt θt c) to approximate I(θ > θ c), which means the more tested students are ahead of collaborative students at step t, the higher the likelihood that their true abilities surpass those of the collaborative students. Through the above approximation, the question selection algorithm can be rewritten as follows: it = arg max qi Q\St 1 R(qi, θ > θ c|θt 1) + R(qi, θ < θ c|θt 1). (12) where R(qi, θ > θ c|θt 1) = ai P(yi = 0|θt) h P y c C y c iσ P i S ai(yi y c i) i and R(qi, θ < θ c|θt 1) = ai P(yi = 1|θt) h P y c C(1 y c i)σ P i S ai(y c i yi) i , P(yi = 0|θt), P(yi = 1|θt) can be calculated by IRT method and C is the set of collaborative students. D Details of Experiments D.1 Statistics of the datasets. Table 3: Statistics of the datasets Dataset NIPS-EDU JUNYI #Students 4,914 8,852 #Questions 900 702 #Response logs 1,382,173 801,270 #Response logs per student 281.27 90.52 #Response logs per question 1,535.75 1,141.41 D.2 Detailed Evaluation Method Statistic for Ranking Consistency. For CAT tasks, there are many methods that are sensitive to the initial abilities of students, including Random, FSI, KLI, MAAT, BECAT, and CCAT proposed in this article. However, data-driven methods such as BOBCAT and NCAT are often insensitive to the initial abilities of students. Therefore, this study randomly initialized the initial abilities of students 5 times and counted the mean and standard deviation of the ranking consistency of each question selection algorithm, as shown in Tables 4 and 5. It can be seen that although the current abilities of students are used in the selection process, CCAT is almost not affected by the initialization of student abilities. This indicates that CCAT not only performs well in ranking consistency but also is more stable compared to other strategies. Table 4: The Detail Performance of different question selection algorithms on NIPS-EDU. Algorithm X-C means use algorithm X for question selection but use collaborative ability estimation proposed in CCAT as the testing result instead of the abilities estimated by IRT. The bold font represents a significant improvement in statistics compared to the baseline. (a) Intra-class Ranking Consistency Performance on IRT estimated by GD Dataset NIPS-EDU Step 5 10 15 20 Random 0.7041 0.007 0.7434 0.005 0.7680 0.007 0.7856 0.004 FSI 0.7236 0.004 0.7889 0.003 0.8192 0.002 0.8321 0.002 KLI 0.7328 0.004 0.7868 0.005 0.8142 0.003 0.8316 0.002 MAAT 0.6725 0.001 0.7095 0.002 0.7359 0.002 0.7535 0.001 BECAT 0.7087 0.007 0.7542 0.004 0.7802 0.005 0.7957 0.005 CCAT (w/o C) 0.7320 0.002 0.7870 0.002 0.8177 0.002 0.8279 0.002 Random-C 0.6988 0.008 0.7444 0.004 0.7715 0.005 0.7909 0.004 FSI-C 0.7340 0.005 0.8031 0.003 0.8339 0.002 0.8546 0.001 KLI-C 0.7399 0.003 0.7982 0.003 0.8304 0.002 0.8509 0.001 MAAT-C 0.6689 0.002 0.7175 0.003 0.7475 0.002 0.7603 0.002 BECAT-C 0.7292 0.006 0.7959 0.003 0.8279 0.002 0.8438 0.007 CCAT 0.7533 0.000 0.8081 0.001 0.8364 0.000 0.8543 0.000 (b) Intra-class Ranking Consistency Performance on IRT estimated by MCMC Dataset NIPS-EDU Step 5 10 15 20 Random 0.7411 0.005 0.8061 0.005 0.8348 0.004 0.8540 0.005 FSI 0.7912 0.005 0.8570 0.003 0.8846 0.001 0.8975 0.001 KLI 0.7821 0.005 0.8532 0.003 0.8804 0.001 0.8965 0.002 MAAT 0.6762 0.005 0.8083 0.007 0.8588 0.004 0.8843 0.002 BECAT 0.7685 0.005 0.8441 0.002 0.8766 0.002 0.8958 0.002 CCAT (w/o C) 0.7982 0.001 0.8561 0.001 0.8832 0.001 0.8955 0.000 Random-C 0.7531 0.004 0.8084 0.005 0.8363 0.004 0.8547 0.004 FSI-C 0.7933 0.005 0.8573 0.003 0.8848 0.001 0.8977 0.001 KLI-C 0.7839 0.006 0.8530 0.003 0.8805 0.001 0.8966 0.002 MAAT-C 0.6909 0.005 0.8090 0.003 0.8595 0.004 0.8848 0.002 BECAT-C 0.7680 0.004 0.8449 0.001 0.8771 0.002 0.8961 0.001 CCAT 0.8149 0.002 0.8635 0.001 0.8851 0.001 0.8969 0.000 (c) Inter-class Ranking Consistency Performance on IRT estimated by MCMC Dataset NIPS-EDU Step 5 10 15 20 Random 0.7798 0.003 0.8325 0.003 0.8590 0.002 0.8760 0.002 FSI 0.8258 0.003 0.8785 0.002 0.9013 0.001 0.9126 0.001 KLI 0.8195 0.003 0.8758 0.002 0.8985 0.001 0.9119 0.001 MAAT 0.7242 0.004 0.8373 0.002 0.8807 0.002 0.9023 0.001 BECAT 0.8045 0.003 0.8676 0.001 0.8948 0.001 0.9104 0.001 CCAT 0.8476 0.001 0.8839 0.000 0.9013 0.000 0.9116 0.000 Table 5: The Detail Performance of different question selection algorithms on JUNYI. Algorithm X-C means use algorithm X for question selection but use collaborative ability estimation proposed in CCAT as the testing result instead of the abilities estimated by IRT. The bold font represents a significant improvement in statistics compared to the baseline. (a) Intra-class Ranking Consistency Performance on IRT estimated by GD Dataset JUNYI Step 5 10 15 20 Random 0.6875 0.008 0.7350 0.005 0.7671 0.003 0.7914 0.003 FSI 0.7639 0.004 0.8284 0.003 0.8586 0.002 0.8740 0.002 KLI 0.7748 0.002 0.8340 0.001 0.8623 0.001 0.8817 0.001 MAAT 0.6908 0.000 0.7465 0.000 0.7817 0.000 0.8113 0.000 BECAT 0.7248 0.003 0.7712 0.003 0.7920 0.003 0.8030 0.003 CCAT (w/o C) 0.8026 0.001 0.8560 0.001 0.8819 0.000 0.8978 0.000 Random-C 0.6862 0.008 0.7383 0.007 0.7734 0.004 0.7979 0.003 FSI-C 0.7736 0.005 0.8313 0.003 0.8623 0.002 0.8768 0.002 KLI-C 0.7813 0.001 0.8367 0.002 0.8671 0.001 0.8847 0.002 MAAT-C 0.7040 0.000 0.7822 0.000 0.8222 0.000 0.8464 0.000 BECAT-C 0.7603 0.003 0.8295 0.002 0.8603 0.002 0.8769 0.001 CCAT 0.8092 0.001 0.8647 0.000 0.8911 0.000 0.9066 0.000 (b) Intra-class Ranking Consistency Performance on IRT estimated by MCMC Dataset JUNYI Step 5 10 15 20 Random 0.6527 0.002 0.7759 0.005 0.8292 0.002 0.8600 0.002 FSI 0.8212 0.003 0.8820 0.001 0.9092 0.001 0.9257 0.000 KLI 0.8124 0.003 0.8795 0.002 0.9082 0.001 0.9244 0.001 MAAT 0.7404 0.008 0.8506 0.001 0.8925 0.001 0.9161 0.001 BECAT 0.7857 0.002 0.8699 0.001 0.9031 0.001 0.9225 0.000 CCAT (w/o C) 0.8190 0.001 0.8823 0.001 0.9098 0.000 0.9277 0.000 Random-C 0.7511 0.005 0.8074 0.005 0.8429 0.002 0.8667 0.002 FSI-C 0.8226 0.002 0.8820 0.001 0.9090 0.001 0.9251 0.001 KLI-C 0.8146 0.002 0.8795 0.001 0.9079 0.001 0.9237 0.001 MAAT-C 0.7441 0.003 0.8512 0.001 0.8926 0.001 0.9157 0.001 BECAT-C 0.7932 0.002 0.8706 0.001 0.9027 0.001 0.9217 0.000 CCAT 0.8448 0.007 0.8875 0.000 0.9100 0.000 0.9273 0.000 (c) Inter-class Ranking Consistency Performance on IRT estimated by MCMC Dataset JUNYI Step 5 10 15 20 Random 0.7651 0.003 0.8298 0.004 0.8648 0.001 0.8865 0.001 FSI 0.8575 0.001 0.9050 0.000 0.9249 0.000 0.9363 0.000 KLI 0.8502 0.002 0.9028 0.001 0.9240 0.000 0.9353 0.000 MAAT 0.7830 0.002 0.8767 0.001 0.9069 0.001 0.9249 0.001 BECAT 0.8287 0.001 0.8961 0.001 0.9204 0.001 0.9341 0.000 CCAT 0.8736 0.001 0.9082 0.000 0.9255 0.000 0.9373 0.000 5 10 15 20 Test Step 5 10 15 20 Test Step RANDOM FSI KLI MAAT NCAT BECAT CCAT Figure 5: The performance on NDCG of different question selection algorithms on the dataset NIPS-EDU for the IRT model estimated by MCMC method. NDCG. NDCG[55, 56, 57], as an important metric for ranking problems in recommendation systems, is also used as a reference metric for CAT ranking problems. Specifically, at each moment of the test, CAT provides students with an ability estimation, while selection exams can be seen as a recall of students. Specifically, we assume that 60% of students will be admitted or eliminated, which means recalling the top 60% of students (NDCG@A60%) and recalling the bottom 60% of students (NDCG@D60%). From Figure 5, it can be seen that CCAT, as a CAT method proposed for ranking problems, also performs outstandingly in recall tasks, indicating that the CCAT method can provide a more fair selection for selective exams. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (a) -0.23 vs -0.39 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (b) 0.36 vs 0.12 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (c) -0.11 vs 0.27 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (d) -0.72 vs -0.93 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (e) -0.37 vs -1.03 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (f) -0.36 vs -0.58 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (g) 0.32 vs -0.65 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (h) -0.05 vs -0.84 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (i) -0.43 vs -0.82 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (j) 0.43 vs -0.05 Figure 6: Rating Chart for different students pair estimated by collaborative students in NIPS-EDU. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (a) 0.91 vs 0.67 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (b) -0.91 vs -1.53 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (c) 0.58 vs -0.07 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (d) -0.99 vs -1.43 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (e) 0.22 vs -0.71 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (f) 0.04 vs -0.45 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (g) -0.76 vs -1.37 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (h) 0.69 vs -0.29 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (i) 0.86 vs -0.08 0.0 0.2 0.4 0.6 0.8 1.0 0.0 (j) 0.69 vs -0.24 Figure 7: Rating Chart for different students pair estimated by collaborative students in JUNYI. Case Study Supplement. Figures 6 and 7 illustrate the responses of collaborative students within each pair. Each point s coordinates denote the comparative performance of the student pairs relative to individual collaborative students. The intensity of the point s color corresponds to the response time, with darker hues indicating later responses. Based on Figure 6 and 7, it can be seen that CCAT determines the ranking of students at each moment by comparing the number of collaborative students in the upper and lower triangles. The light-colored points in the figure are mainly distributed in the center, while the dark ones are distributed around, indicating that as the number of test questions increases, each collaborative student s judgment of the two students gradually changes from vague to clear. It can be found that the collaborative ability estimation method is essentially collaborative student voting for tested students, and the collaborative student union in the upper left or lower right corner of the figure will ultimately distinguish the two students. Neur IPS Paper Checklist The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit. Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist: You should answer [Yes] , [No] , or [NA] . [NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available. Please provide a short (1 2 sentence) justification right after your answer (even for NA). The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper. The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found. IMPORTANT, please: Delete this instruction block, but keep the section heading Neur IPS paper checklist", Keep the checklist subsection headings, questions/answers and guidelines below. Do not modify the questions and only use the provided macros for your answers. Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We briefly elaborated in the abstract that this article investigates the issue of ranking in CAT, and described the main contributions of this study in the last paragraph of the introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In the experimental section and appendix section D, we discussed the limitations of our method under long test durations and the disadvantages of the GD method in ranking consistency problems. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: In sections A and B of the appendix of this article, we provide the proof process for Lemma1 and Theorem 1, respectively. In section C, we present all the hypotheses used to construct collaborative students and help with question selection. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We have introduced the dataset used in the experimental section and appendix section D of the article, and included the complete code and some data in the supplemental materials to reproduce the results of the article. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code for the experimental part of this article is included in the supplemental materials, and the datasets used in our experiments are all public datasets Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: This article introduces the method of data splits in appendix section D, and the implementation method of this article has been detailed in section 4 of the main text and appendix section C. In addition, this article is based on theoretical derivation, so there are no technical details such as hyperparameters, optimizers, etc. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We have provided detailed information on the experiment in appendix section D and provided statistical information on the experiment, such as the number of tests and the variance obtained. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The resources used in the experiment are introduced in appendix section D of the article. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have read and promise that our research conform with Neuro IPS ethical standards in every respect. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: In fact, the fundamental purpose of proposing the issue of CAT for ranking in our work is to address the issue of educational equity. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: The code used in the article is all original and the dataset used is open-source, which can be used after being referenced. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: This document is provided in the supplementary materials. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.