# knowledge_graph_completion_by_intermediate_variables_regularization__a28dfb62.pdf

Knowledge Graph Completion by Intermediate Variables Regularization

Changyi Xiao, Yixin Cao School of Computer Science, Fudan University changyi_xiao@fudan.edu.cn, caoyixin2011@gmail.com

Knowledge graph completion (KGC) can be framed as a 3-order binary tensor completion task. Tensor decomposition-based (TDB) models have demonstrated strong performance in KGC. In this paper, we provide a summary of existing TDB models and derive a general form for them, serving as a foundation for further exploration of TDB models. Despite the expressiveness of TDB models, they are prone to overfitting. Existing regularization methods merely minimize the norms of embeddings to regularize the model, leading to suboptimal performance. Therefore, we propose a novel regularization method for TDB models that addresses this limitation. The regularization is applicable to most TDB models and ensures tractable computation. Our method minimizes the norms of intermediate variables involved in the different ways of computing the predicted tensor. To support our regularization method, we provide a theoretical analysis that proves its effect in promoting low trace norm of the predicted tensor to reduce overfitting. Finally, we conduct experiments to verify the effectiveness of our regularization technique as well as the reliability of our theoretical analysis. The code is available at https://github.com/changyi7231/IVR.

1 Introduction

A knowledge graph (KG) can be represented as a 3rd-order binary tensor, in which each entry corresponds to a triplet of the form (head entity, relation, tail entity). A value of 1 denotes a known true triplet, while 0 denotes a false triplet. Despite containing a large number of known triplets, KGs are often incomplete, with many triplets missing. Consequently, the 3rd-order tensors representing the KGs are incomplete. The objective of knowledge graph completion (KGC) is to infer the true or false values of the missing triplets based on the known ones, i.e., to predict which of the missing entries in the tensor are 1 or 0.

A number of models have been proposed for KGC, which can be classified into translation-based models, tensor decomposition-based (TDB) models and neural networks models [Zhang et al., 2021]. We only focus on TDB models in this paper due to their wide applicability and great performance [Lacroix et al., 2018, Zhang et al., 2020]. TDB models can be broadly categorized into two groups: CANDECOMP/PARAFAC (CP) decomposition-based models, including CP [Lacroix et al., 2018], Dist Mult [Yang et al., 2014] and Compl Ex [Trouillon et al., 2017], and Tucker decomposition-based models, including Simpl E [Kazemi and Poole, 2018], ANALOGY [Liu et al., 2017], Quat E [Zhang et al., 2019] and Tuck ER [Balaževi c et al., 2019]. To provide a thorough understanding of TDB models, we present a summary of existing models and derive a general form that unifies them, which provides a fundamental basis for further exploration of TDB models.

Corresponding author.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

TDB models have been proven to be theoretically fully expressive [Trouillon et al., 2017, Kazemi and Poole, 2018, Balaževi c et al., 2019], implying they can represent any real-valued tensor. However, in practice, TDB models frequently fall prey to severe overfitting. To counteract this issue, various regularization techniques have been employed in KGC. One commonly used technique is the squared Frobenius norm regularization [Nickel et al., 2011, Yang et al., 2014, Trouillon et al., 2017]. Lacroix et al. [2018] proposed another regularization, N3 norm regularization, based on the tensor nuclear p-norm, which outperforms the squared Frobenius norm in terms of performance. Additionally, Zhang et al. [2020] introduced DURA, a regularization technique based on the duality of TDB models and distance-based models that results in significant improvements on benchmark datasets. Nevertheless, N3 and DURA rely on CP decomposition, limiting their applicability to CP [Lacroix et al., 2018] and Compl Ex [Trouillon et al., 2017]. As a result, there is a pressing need for a regularization technique that is widely applicable and can effectively alleviate the overfitting issue.

In this paper, we introduce a novel regularization method for KGC to improve the performance of TDB models. Our regularization focuses on preventing overfitting while maintaining the expressiveness of TDB models as much as possible. It is applicable to most TDB models, while also ensuring tractable computation. Existing regularization methods for KGC rely on minimizing the norms of embeddings to regularize the model [Yang et al., 2014, Lacroix et al., 2018], leading to suboptimal performance. To achieve superior performance, we present an intermediate variables regularization (IVR) approach that minimizes the norms of intermediate variables involved in the processes of computing the predicted tensor of TDB models. Additionally, our approach fully considers the computing ways because different ways of computing the predicted tensor may generate different intermediate variables.

To support the efficacy of our regularization approach, we further provide a theoretical analysis. We prove that our regularization is an upper bound of the overlapped trace norm [Tomioka et al., 2011]. The overlapped trace norm is the sum of the trace norms of the unfolding matrices along each mode of a tensor [Kolda and Bader, 2009], which can be considered as a surrogate measure of the rank of a tensor. Thus, the overlapped trace norm reflects the correlation among entities and relations, which can pose a constraint to jointly entities and relations embeddings learning. In specific, entities and relations in KGs are usually highly correlated. For example, some relations are mutual inverse relations, or a relation may be a composition of another two relations [Zhang et al., 2021]. Through minimizing the upper bound of the overlapped trace norm, we encourage a high correlation among entities and relations, which brings strong regularization and alleviates the overfitting problem.

The main contributions of this paper are listed below:

1. We present a detailed overview of a wide range of TDB models and establish a general form to serve as a foundation for further TDB model analysis.

2. We introduce a new regularization approach for TDB models based on the general form to mitigate the overfitting issue, which is notable for its generality and effectiveness.

3. We provide a theoretical proof of the efficacy of our regularization and validate its practical utility through experiments.

2 Related Work

Tensor Decomposition Based Models Research in KGC has been vast, with TDB models garnering attention due to their superior performance. The two primary TDB approaches that have been extensively studied are CP decomposition [Hitchcock, 1927] and Tucker decomposition [Tucker, 1966]. CP decomposition represents a tensor as a sum of n rank-one tensors, while Tucker decomposition decomposes a tensor into a core tensor and a set of matrices. Kolda and Bader [2009] shows more details about CP decomposition and Tucker decomposition.

Several techniques have been developed for applying CP decomposition and Tucker decomposition in KGC. For instance, Lacroix et al. [2018] employed the original CP decomposition, whereas Dist Mult [Yang et al., 2014], a variant of CP decomposition, made the embedding matrices of head entities and tail entities identical to simplify the model. Simpl E [Kazemi and Poole, 2018] tackled the problem of independence among the embeddings of head and tail entities within CP decomposition. Compl Ex [Trouillon et al., 2017] extended Dist Mult to the complex space to handle asymmetric relations. Quat E [Zhang et al., 2019] explored hypercomplex space to further enhance KGC. Other techniques

include Hol E [Nickel et al., 2016] proposed by Nickel et al. [2016], which operates on circular correlation, and ANALOGY [Liu et al., 2017], which explicitly utilizes the analogical structures of KGs. Notably, Liu et al. [2017] confirmed that Hol E is equivalent to Compl Ex. Additionally, Balaževi c et al. [2019] introduced Tuck ER, which is based on the Tucker decomposition and has achieved state-of-the-art performance across various benchmark datasets for KGC. Nickel et al. [2011] proposed a three-way decomposition RESCAL over each relational slice of the tensor. You can refer to [Zhang et al., 2021] or [Ji et al., 2021] for more detailed discussion about KGC models or TDB models.

Regularization Although TDB models are highly expressive [Trouillon et al., 2017, Kazemi and Poole, 2018, Balaževi c et al., 2019], they can suffer severely from overfitting in practice. Consequently, several regularization approaches have been proposed. A common regularization approach is to apply the squared Frobenius norm to the model parameters [Nickel et al., 2011, Yang et al., 2014, Trouillon et al., 2017]. However, this approach does not correspond to a proper tensor norm, as shown by Lacroix et al. [2018]. Therefore, they proposed a novel regularization method, N3, based on the tensor nuclear 3-norm, which is an upper bound of the tensor nuclear norm. Likewise, Zhang et al. [2020] introduced DURA, a regularization method that exploits the duality of TDB models and distance-based models, and serves as an upper bound of the tensor nuclear 2-norm. However, both N3 and DURA are derived from the CP decomposition, and thus are only applicable to CP and Compl Ex models.

In Section 3.1, we begin by providing an overview of existing TDB models and derive a general form for them. Thereafter, we present our intermediate variables regularization (IVR) approach in Section 3.2. Finally, in Section 3.3, we provide theoretical analysis to support the efficacy of our proposed regularization technique.

3.1 General Form

To facilitate the subsequent theoretical analysis, we initially provide a summary of existing TDB models and derive a general form for them. Given a set of entities E and a set of relations R, a KG contains a set of triplets S = {(i, j, k)} E R E. Let X {0, 1}|E| |R| |E| represent the KG tensor, with Xijk = 1 iff (i, j, k) S, where |E| and |R| denote the number of entities and relations, respectively. Let H R|E| D, R R|R| D and T R|E| D be the embedding matrices of head entities, relations and tail entities, respectively, where D is the embedding dimension.

Various TDB models can be attained by partitioning the embedding matrices into P parts. We reshape H R|E| D, R R|R| D and T R|E| D into H R|E| (D/P ) P , R R|R| (D/P ) P and T R|E| (D/P ) P , respectively, where P is the number of parts we partition. For different P, we can get different TDB models.

CP/Dist Mult Let P = 1, CP [Lacroix et al., 2018] can be represented as

Xijk = Hi:1, Rj:1, Tk:1 :=

d=1 Hid1Rjd1Tkd1

where , , is the dot product of three vectors. Dist Mult [Yang et al., 2014], a particular case of CP, which shares the embedding matrices of head entities and tail entities, i.e., H = T .

Compl Ex/Hol E Let P = 2, Compl Ex [Trouillon et al., 2017] can be represented as

Xijk = Hi:1, Rj:1, Tk:1 + Hi:2, Rj:1, Tk:2 + Hi:1, Rj:2, Tk:2 Hi:2, Rj:2, Tk:1

Liu et al. [2017] proved that Hol E [Nickel et al., 2011] is equivalent to Compl Ex.

Simpl E Let P = 2, Simpl E [Kazemi and Poole, 2018] can be represented as

Xijk = Hi:1, Rj:1, Tk:2 + Hi:2, Rj:2, Tk:1

ANALOGY Let P = 4, ANALOGY [Liu et al., 2017] can be represented as

Xijk = Hi:1, Rj:1, Tk:1 + Hi:2, Rj:2, Tk:2 + Hi:3, Rj:3, Tk:3 + Hi:3, Rj:4, Tk:4 + Hi:4, Rj:3, Tk:4 Hi:4, Rj:4, Tk:3

Quat E Let P = 4, Quat E [Zhang et al., 2019] can be represented as

Xijk = Hi:1, Rj:1, Tk:1 Hi:2, Rj:2, Tk:1 Hi:3, Rj:3, Tk:1 Hi:4, Rj:4, Tk:1 + Hi:1, Rj:2, Tk:2 + Hi:2, Rj:1, Tk:2 + Hi:3, Rj:4, Tk:2 Hi:4, Rj:3, Tk:2 + Hi:1, Rj:3, Tk:3 Hi:2, Rj:4, Tk:3 + Hi:3, Rj:1, Tk:3 + Hi:4, Rj:2, Tk:3 + Hi:1, Rj:4, Tk:4 + Hi:2, Rj:3, Tk:4 Hi:3, Rj:2, Tk:4 + Hi:4, Rj:1, Tk:4

Tuck ER Let P = D, Tuck ER [Balaževi c et al., 2019] can be represented as

n=1 Wlmn Hi1l Rj1m Tk1n

where W RP P P is the core tensor.

General Form Through our analysis, we observe that all TDB models can be expressed as a linear combination of several dot product. The key distinguishing factors among these models are the choice of the number of parts P and the core tensor W . The number of parts P determines the dimensions of the dot products of the embeddings, while the core tensor W determines the strength of the dot products. It is important to note that Tuck ER uses a parameter tensor as its core tensor, whereas the core tensors of other models are predetermined constant tensors. Therefore, we can derive a general form of these models as

n=1 Wlmn Hi:l, Rj:m, Tk:n =

d=1 Hidl Rjdm Tkdn)

n=1 Wlmn Hidl Rjdm Tkdn) (1)

d=1 W 1 H:d: 2 R:d: 3 T:d: (2)

where n is the mode-n product [Kolda and Bader, 2009], and W RP P P is the core tensor, which can be a parameter tensor or a predetermined constant tensor. The general form Eq.(2) can also be considered as a sum of D/P Tuck ER decompositions, which is also called block-term decomposition [De Lathauwer, 2008]. Eq.(2) is a block-term decomposition with a shared core tensor W . This general form is easy to understand, facilitates better understanding of TDB models and paves the way for further exploration of TDB models. The general form presents a unified view of TDB models and helps the researchers understand the relationship between different TDB models. Moreover, the general form motivates the researchers to propose new methods and establish unified theoretical frameworks that are applicable to most TDB models. Our proposed regularization in Section 3.2 and the theoretical analysis Section 3.3 are examples of such contributions.

The Number of Parameters and Computational Complexity The parameters of Eq.(2) come from two parts, the core tensor W and the embedding matrices H, R and T . The number of parameters of the core tensor W is equal to P 3 if W is a parameter tensor and otherwise equal to 0. The number of parameters of the embedding matrices is equal to |E|D + |R|D if H = T and otherwise equal to 2|E|D + |R|D. The computational complexity of Eq.(2) is equal to O(DP 2|E|2|R|). The larger the number of parts P, the more expressive the model and the more the computation. Therefore, the choice of P is a trade-off between expressiveness and computation.

unfolding ...

X(1) Rn1 x (n2n3)

X(2) Rn2 x (n3n1)

X(3) Rn3 x (n1n2)

X Rn1 x n2 x n3

Figure 1: Left shows a 3rd order tensor. Middle describes the corresponding mode-i fibers of the tensor. Fibers are the higher-order analogue of matrix rows and columns. A fiber is defined by fixing every index but one. Right describes the corresponding mode-i unfolding of the tensor. The mode-i unfolding of a tensor arranges the mode-i fibers to be the columns of the resulting matrix.

Tuck ER and Eq.(2) Tuck ER [Balaževi c et al., 2019] also demonstrated that TDB models can be represented as a Tucker decomposition by setting specific core tensors W . Nevertheless, we must stress that Tuck ER does not explicitly consider the number of parts P and the core tensor W , which are pertinent to the number of parameters and computational complexity of TDB models. Moreover, in Appendix A, we demonstrate that the conditions for a TDB model to learn logical rules are also dependent on P and W . By selecting appropriate P and W , TDB models can be able to learn symmetry rules, antisymmetry rules, and inverse rules.

3.2 Intermediate Variables Regularization

TDB models are theoretically fully expressive [Trouillon et al., 2017, Kazemi and Poole, 2018, Balaževi c et al., 2019], which can represent any real-valued tensor. However, TDB models suffer from the overfitting problem in practice [Lacroix et al., 2018]. Therefore, several regularization methods have been proposed, such as squared Frobenius norm method [Yang et al., 2014] and nuclear 3-norm method [Lacroix et al., 2018], which minimize the norms of the embeddings {H, R, T } to regularize the model. Nonetheless, merely minimizing the embeddings tends to have suboptimal impacts on the model performance. To enhance the model performance, we introduce a new regularization method that minimizes the norms of the intermediate variables involved in the processes of computing X. To ensure the broad applicability of our method, our regularization is rooted in the general form of TDB models Eq.(2).

To compute X in Eq.(2), we can first compute the intermediate variable W 1 H:d: 2 R:d:, and then combine T:d: to compute X. Thus, in addition to minimizing the norm of T:d:, we also need to minimize the norm of W 1 H:d: 2 R:d:. Since different ways of computing X can result in different intermediate variables, we fully consider the computing ways of X. Eq.(2) can also be written as:

d=1 (W 2 R:d: 3 T:d:) 1 H:d:,

d=1 (W 3 T:d: 1 H:d:) 2 R:d:

Thus, we also need to minimize the norms of intermediate variables {W 2 R:d: 3 T:d:, W 3 T:d: 1 H:d:} and {H:d:, R:d:}. In summary, we should minimize the (power of Frobenius) norms { H:d: α F , R:d: α F , T:d: α F } and { W 1H:d: 2R:d: α F , W 2R:d: 3T:d: α F , W 3T:d: 1 H:d: α F }, where α is the power of the norms.

Since computing X is equivalent to computing X(1) or X(2) or X(3), we can also minimize the norms of intermediate variables involved in the processes of computing X(1), X(2) and X(3), where X(n) is the mode-n unfolding of a tensor X [Kolda and Bader, 2009]. See Figure 1 for an example

of the notation X(n). We can represent X(1), X(2) and X(3) as [Kolda and Bader, 2009]:

d=1 (W 1 H:d:)(1)(T:d: R:d:)T ,

d=1 (W 2 R:d:)(2)(T:d: H:d:)T ,

d=1 (W 3 T:d:)(3)(R:d: H:d:)T

where is the Kronecker product. Thus, the intermediate variables include {W 1 H:d:, W 2 R:d:, W 3 T:d:} and {T:d: R:d:, T:d: H:d:, R:d: H:d:}. Therefore, we should minimize the (power of Frobenius) norms { W 1H:d: α F , W 2R:d: α F , W 3T:d: α F } and { T:d: R:d: α F = T:d: α F R:d: α F , T:d: H:d: α F = T:d: α F H:d: α F , R:d: H:d: α F = R:d: α F H:d: α F }.

Our Intermediate Variables Regularization (IVR) is defined as a combination of all these norms:

d=1 λ1( H:d: α F + R:d: α F + T:d: α F )

+ λ2( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F ) + λ3( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F ) + λ4( W 2 R:d: 3 T:d: α F + W 3 T:d: 1 H:d: α F + W 1 H:d: 2 R:d: α F ) (3) where {λi > 0|i = 1, 2, 3, 4} are the regularization coefficients.

In conclusion, our proposed regularization term is the sum of the norms of variables involved in the different ways of computing the tensor X.

We can easily get the weighted version of Eq.(3), in which the regularization term corresponding to the sampled training triplets only [Lacroix et al., 2018, Zhang et al., 2020]. For a training triplet (i, j, k), the weighted version of Eq.(3) is as follows:

reg(Xijk) =

d=1 λ1( Hid: α F + Rjd: α F + Tkd: α F )

+ λ2( Tkd: α F Rjd: α F + Tkd: α F Hid: α F + Rjd: α F Hid: α F ) + λ3( W 1 Hid: α F + W 2 Rjd: α F + W 3 Tkd: α F ) + λ4( W 2 Rjd: 3 Tkd: α F + W 3 Tkd: 1 Hid: α F + W 1 Hid: 2 Rjd: α F ) (4)

The computational complexity of Eq.(4) is the same as that of Eq.(1), i.e., O(DP 2), which ensures that our regularization is computationally tractable.

The hyper-parameters λi make IVR scalable. We can easily reduce the number of hyper-parameters by setting some of them zero or equal. The hyper-parameters make us able to achieve a balance between performance and efficiency as shown in Section 4.3. We set λ1 = λ3 and λ2 = λ4 for all models to reduce the number of hyper-parameters. You can refer to Appendix C for more details about the setting of hyper-parameters.

We use the same loss function, multiclass log-loss function, as in [Lacroix et al., 2018]. For a training triplet (i, j, k), our loss function is

ℓ(Xijk) = Xijk + log(

k =1 exp(Xijk )) + reg(Xijk)

At test time, we use Xi,j,: to rank tail entities for a query (i, j, ?).

3.3 Theoretical Analysis

To support the effectiveness of our regularization IVR, we provide a deeper theoretical analysis of its properties. The establishment of the theoretical framework of IVR is inspired by Lemma 1 in

Appendix B, which relates the Frobenius norm and the trace norm of a matrix. Lemma 1 shows that the trace norm of a matrix is an upper bound of a function of several Frobenius norms of intermediate variables, which prompts us to establish the relationship between IVR and trace norm. Based on Lemma 1, we prove that IVR serves as an upper bound for the overlapped trace norm of the predicted tensor, which promotes the low nuclear norm of the predicted tensor to regularize the model.

The overlapped trace norm [Kolda and Bader, 2009] for a 3rd-order tensor is defined as:

L(X; α) := X(1) α/2 + X(2) α/2 + X(3) α/2 where α is the power coefficient in Eq.(3). X(1) , X(2) and X(3) are the matrix trace norms of X(1), X(2) and X(3), respectively, which are the sums of singular values of the respective matrices. The matrix trace norm is widely used as a convex surrogate for matrix rank due to the non-differentiability of matrix rank [Goldfarb and Qin, 2014, Lu et al., 2016, Mu et al., 2014]. Thus, L(X; α) serves as a surrogate for rank(X(1))α/2 + rank(X(2))α/2 + rank(X(3))α/2, where rank(X(1)), rank(X(2)) and rank(X(3)) are the matrix ranks of X(1), X(2) and X(3), respectively. In KGs, each head entity, each relation and each tail entity uniquely corresponds to a row of X(1), X(2) and X(3), respectively. Therefore, rank(X(1)), rank(X(2)) and rank(X(3)) measure the correlation among the head entities, relations and tail entities, respectively. Entities or relations in KGs are highly correlated. For instance, some relations are mutual inverse relations or one relation may be a composition of another two relations [Zhang et al., 2021]. Thus, the overlapped trace norm L(X; α) can pose a constraint to the embeddings of entities and relations. Minimizing L(X; α) encourage a high correlation among entities and relations, which brings strong regularization and reduces overfitting. We next establish the relationship between our regularization term Eq.(3) and L(X; α) by Proposition 1 and Proposition 2. We will prove that Eq.(3) is an upper bound of L(X; α).

Proposition 1. For any X, and for any decomposition of X, X = PD/P d=1 W 1H:d: 2R:d: 3T:d:, we have

λ1λ4L(X; α)

d=1 λ1( H:d: α F + R:d: α F + T:d: α F )

+ λ4( W 2 R:d: 3 T:d: α F + W 3 T:d: 1 H:d: α F + W 1 H:d: 2 R:d: α F ) (5)

If X(1) = U1Σ1V T 1 , X(2) = U2Σ2V T 2 , X(3) = U3Σ3V T 3 are compact singular value decompositions of X(1), X(2), X(3) [Bai et al., 2000], respectively, then there exists a decomposition of X,

X = PD/P d=1 W 1 H:d: 2 R:d: 3 T:d:, such that the two sides of Eq.(5) equal.

Proposition 2. For any X, and for any decomposition of X, X = PD/P d=1 W 1H:d: 2R:d: 3T:d:, we have

d=1 λ2( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F )

+ λ3( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F ) (6)

And there exists some X , and for any decomposition of X , such that the two sides of Eq.(6) can not achieve equality.

Please refer to Appendix B for the proofs. Proposition 1 establishes that the r.h.s. of Eq.(5) provides a tight upper bound for 2 λ1λ4L(X; α), while Proposition 2 demonstrates that the r.h.s. of Eq.(4) is an upper bound of 2 λ2λ3L(X; α), but this bound is not always tight. Our proposed regularization term, Eq.(3), combines these two upper limits by adding the r.h.s. of Eq.(5) and the r.h.s. of Eq.(6). As a result, minimizing Eq.(3) can effectively minimize L(X; α) to regularize the model.

The two sides of Eq.(5) can achieve equality if P = D, meaning that the TDB model is Tuck ER model [Balaževi c et al., 2019]. Although L(X; α) may not always serve as a tight lower bound of the r.h.s. of Eq.(5) for TDB models other than Tuck ER model, it remains a common lower bound for all TDB models. To obtain a more tight lower bound, the exact values of P and W are required. For example, in the case of CP model [Lacroix et al., 2018] (P = 1 and W = 1), the nuclear 2-norm X is a more tight lower bound. The nuclear 2-norm is defined as follows:

d=1 H:d1 F R:d1 F T:d1 F |X =

d=1 W 1 H:d1 2 R:d1 3 T:d1}

Table 1: Knowledge graph completion results on WN18RR, FB15k-237 and YGAO3-10 datasets.

WN18RR FB15k-237 YAGO3-10

MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10

CP 0.438 0.416 0.485 0.332 0.244 0.507 0.567 0.495 0.696 CP-F2 0.449 0.420 0.506 0.331 0.243 0.507 0.570 0.499 0.699 CP-N3 0.469 0.432 0.541 0.355 0.261 0.542 0.575 0.504 0.703 CP-DURA 0.471 0.433 0.545 0.364 0.269 0.554 0.577 0.504 0.704 CP-IVR 0.478 0.437 0.554 0.365 0.270 0.555 0.579 0.507 0.708

Compl Ex 0.464 0.431 0.526 0.348 0.256 0.531 0.574 0.501 0.704 Compl Ex-F2 0.467 0.431 0.538 0.349 0.260 0.529 0.576 0.502 0.709 Compl Ex-N3 0.491 0.445 0.578 0.367 0.272 0.559 0.577 0.504 0.707 Compl Ex-DURA 0.484 0.440 0.571 0.372 0.277 0.563 0.585 0.512 0.714 Compl Ex-IVR 0.494 0.449 0.581 0.370 0.275 0.561 0.586 0.515 0.714

Simpl E 0.443 0.421 0.488 0.337 0.248 0.514 0.565 0.491 0.696 Simpl E-F2 0.451 0.422 0.506 0.338 0.249 0.514 0.566 0.494 0.699 Simpl E-IVR 0.470 0.436 0.537 0.357 0.264 0.544 0.578 0.504 0.707

ANALOGY 0.458 0.424 0.526 0.348 0.255 0.530 0.573 0.502 0.703 ANALOGY-F2 0.467 0.434 0.533 0.349 0.258 0.529 0.573 0.501 0.705 ANALOGY-IVR 0.482 0.439 0.568 0.367 0.272 0.558 0.582 0.509 0.713

Quat E 0.460 0.430 0.518 0.349 0.258 0.530 0.566 0.489 0.702 Quat E-F2 0.468 0.435 0.534 0.349 0.259 0.529 0.566 0.489 0.705 Quat E-IVR 0.493 0.447 0.580 0.369 0.274 0.561 0.582 0.509 0.712

Tuck ER 0.446 0.423 0.490 0.321 0.233 0.498 0.551 0.476 0.689 Tuck ER-F2 0.449 0.423 0.496 0.327 0.239 0.503 0.566 0.492 0.700 Tuck ER-IVR 0.501 0.460 0.579 0.368 0.274 0.555 0.581 0.508 0.712

where W = 1. The following proposition establishes the relationship between X and L(X; 2):

Proposition 3. For any X, and for any decomposition of X, X = PD d=1 W 1H:d1 2R:d1 3T:d1, and W = 1, we have

λ1λ4L(X; 2) 6 p

d=1 λ1( H:d1 2 F + R:d1 2 F + T:d1 2a F )

+ λ4( W 2 R:d1 3 T:d1 2 F + W 3 T:d1 1 H:d1 2 F + W 1 H:d1 2 R:d1 2 F )

Although the r.h.s. of Eq.(6) is not always a tight upper bound for 2 λ2λ3L(X; α) like the r.h.s. of Eq.(5), we observe that minimizing the combination of these two bounds, Eq.(3), can lead to better performance. The reason behind this is that the r.h.s. of Eq.(5) is neither an upper bound nor a lower bound of the r.h.s. of Eq.(6) for all X. We present Proposition 4 in Appendix B to prove this claim.

4 Experiments

We first introduce the experimental settings in Section 4.1 and show the results in Section 4.2. We next conduct ablation studies in Section 4.3. Finally, we verify the reliability of our proposed upper bounds in Section 4.4. Please refer to Appendix C for more experimental details.

4.1 Experimental Settings

Datasets We evaluate the models on three KGC datasets, WN18RR [Dettmers et al., 2018], FB15k237 [Toutanova et al., 2015] and YAGO3-10 [Dettmers et al., 2018].

Models We use CP, Compl Ex, Simpl E, ANALOGY, Quat E and Tuck ER as baselines. We denote CP with squared Frobenius norm method [Yang et al., 2014] as CP-F2, CP with N3 method [Lacroix

Table 2: The results on WN18RR and FB15k-237 datasets with different upper bounds.

WN18RR FB15k-237 YAGO3-10

MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10

Tuck ER 0.446 0.423 0.490 0.321 0.233 0.498 0.551 0.476 0.689 Tuck ER-IVR-1 0.497 0.455 0.578 0.366 0.272 0.553 0.576 0.501 0.709 Tuck ER-IVR-2 0.459 0.423 0.518 0.336 0.241 0.518 0.568 0.493 0.700 Tuck ER-IVR 0.501 0.460 0.579 0.368 0.274 0.555 0.581 0.508 0.712

Table 3: The results on Kinship dataset with different upper bounds.

X(1) X(2) X(3) L(X)

Tuck ER 10,719 8,713 11,354 30,786 Tuck ER-IVR-1 5,711 5,021 6,271 17,003 Tuck ER-IVR-2 6,145 5,441 6,744 18,330 Tuck ER-IVR 3,538 3,511 3,988 11,037

et al., 2018] as CP-N3, CP with DURA method [Zhang et al., 2020] as CP-DURA and CP with IVR method as CP-IVR. The notations for other models are similar to the notations for CP.

Evaluation Metrics We use the filtered MRR and Hits@N (H@N) [Bordes et al., 2013] as evaluation metrics and choose the hyper-parameters with the best filtered MRR on the validation set. We run each model three times with different random seeds and report the mean results.

4.2 Results

See Table 1 for the results. For CP and Compl Ex, the models that N3 and DURA are suitable, the results show that N3 enhances the models more than F2, and DURA outperforms both F2 and N3, leading to substantial improvements. IVR achieves better performance than DURA on WN18RR dataset and achieves similar performance to DURA on FB15k-237 and YAGO3-10 dataset. For Simpl E, ANALOGY, Quat E, and Tuck ER, the improvement offered by F2 is minimal, while IVR significantly boosts model performance. In summary, these results demonstrate the effectiveness and generality of IVR.

4.3 Ablation Studies

We conduct ablation studies to examine the effectiveness of the upper bounds. Our notations for the models are as follows: the model with upper bound Eq.(5) is denoted as IVR-1, model with upper bound Eq.(6) as IVR-2, and model with upper bound Eq.(3) as IVR.

See Table 2 for the results. We use Tuck ER as the baseline. IVR with only 1 regularization coefficient, IVR-1, achieves comparable results with vanilla IVR, which shows that IVR can still perform well with fewer hyper-parameters. IVR-1 outperforms IVR-2 due to the tightness of Eq.(5).

4.4 Upper Bounds

We verify that minimizing the upper bounds can effectively minimize L(X; α). L(X; α) can measure the correlation of X. Lower values of L(X; α) encourage higher correlations among entities and relations, and thus bring a strong constraint for regularization. Upon training the models, we compute L(X; α). As computing L(X; α) for large KGs is impractical, we conduct experiments on a small KG dataset, Kinship [Kok and Domingos, 2007], which consists of 104 entities and 25 relations. We use Tuck ER as the baseline and compare it against IVR-1 (Eq.(5)), IVR-2 (Eq.(6)), and IVR (Eq.(3)).

See Table 3 for the results. Our results demonstrate that the upper bounds are effective in minimizing L(X; α). All three upper bounds can lead to a decrease of L(X; α), achieving better performance (Table 2) by more effective regularization. The L(X; α) of IVR-1 is smaller than that of IVR-2

because the upper bound in Eq.(5) is tight. IVR, which combines IVR-1 and IVR-2, produces the most reduction of L(X; α). This finding suggests that combining the two upper bounds can be more effective. Overall, our experimental results confirm the reliability of our theoretical analysis.

5 Conclusion

In this paper, we undertake an analysis of TDB models in KGC. We first offer a summary of TDB models and derive a general form that facilitates further analysis. TDB models often suffer from overfitting, and thus, we propose a regularization based on our derived general form. It is applicable to most TDB models and improve the model performance. We further propose a theoretical analysis to support our regularization and experimentally validate our theoretical analysis. Our regularization is limited to TDB models, hoping that more regularization will be proposed in other types of models, such as translation-based models and neural networks models. We also intend to explore how to apply our regularization to other fields, such as tensor completion [Song et al., 2019].

Acknowledgements

This work is supported by the National Key Research and Development Program of China (2020AAA0106000), the National Natural Science Foundation of China (U19A2079), and the CCCD Key Lab of Ministry of Culture and Tourism.

Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst. Templates for the solution of algebraic eigenvalue problems: a practical guide. SIAM, 2000.

Ivana Balaževi c, Carl Allen, and Timothy Hospedales. Tucker: Tensor factorization for knowledge graph completion. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5185 5194, 2019.

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24, 2011.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.

Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Reexamining low rank matrix factorization for trace norm regularization. ar Xiv preprint ar Xiv:1706.08934, 2017.

Lieven De Lathauwer. Decompositions of a higher-order tensor in block terms part i: Lemmas for partitioned matrices. SIAM Journal on Matrix Analysis and Applications, 30(3):1022 1032, 2008.

Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Thirty-second AAAI conference on artificial intelligence, 2018.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Donald Goldfarb and Zhiwei Qin. Robust low-rank tensor recovery: Models and algorithms. SIAM Journal on Matrix Analysis and Applications, 35(1):225 253, 2014.

Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164 189, 1927.

Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and S Yu Philip. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE transactions on neural networks and learning systems, 33(2):494 514, 2021.

Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4289 4300, 2018.

Stanley Kok and Pedro Domingos. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning, pages 433 440, 2007.

Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3): 455 500, 2009.

Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, pages 2863 2872. PMLR, 2018.

Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical inference for multi-relational embeddings. In International conference on machine learning, pages 2168 2178. PMLR, 2017.

Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and Shuicheng Yan. Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5249 5257, 2016.

Cun Mu, Bo Huang, John Wright, and Donald Goldfarb. Square deal: Lower bounds and improved relaxations for tensor recovery. In International conference on machine learning, pages 73 81. PMLR, 2014.

Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 809 816, 2011.

Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings of knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.

Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor completion algorithms in big data analytics. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(1):1 48, 2019.

Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, and Hisashi Kashima. Statistical performance of convex tensor decomposition. Advances in neural information processing systems, 24, 2011.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1499 1509, 2015.

Théo Trouillon, Christopher R Dance, Éric Gaussier, Johannes Welbl, Sebastian Riedel, and Guillaume Bouchard. Knowledge graph completion via complex tensor factorization. Journal of Machine Learning Research, 18:1 38, 2017.

Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3): 279 311, 1966.

Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. ar Xiv e-prints, pages ar Xiv 1412, 2014.

Jing Zhang, Bo Chen, Lingxi Zhang, Xirui Ke, and Haipeng Ding. Neural, symbolic and neuralsymbolic reasoning on knowledge graphs. AI Open, 2:14 35, 2021.

Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion knowledge graph embeddings. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 2735 2745, 2019.

Zhanqiu Zhang, Jianyu Cai, and Jie Wang. Duality-induced regularizer for tensor factorization based knowledge graph completion. Advances in Neural Information Processing Systems, 33, 2020.

A Logical Rules

KGs often involve some logical rules to capture inductive capacity [Zhang et al., 2021]. Thus, we analyze how to design models such that the models can learn the symmetry, antisymmetry and inverse rules. We have derived a general form for TDB models, Eq.(1). Next, we study how to enable Eq.(1) to learn logical rules. We first define the symmetry rules, antisymmetry rules and inverse rules. We denote Xijk = f(Hi::, Rj::, Tk::) as f(h, r, t) for simplicity. We assume H = T , which mainly aims to reduce overfitting Yang et al. [2014]. Existing TDB models for KGC except CP [Lacroix et al., 2018] all forces H = T .

A relation r is symmetric if h, t, (h, r, t) S (t, r, h) S. A model is able to learn the symmetry rules if

r RD r = 0, h, t RD, f(h, r, t) = f(t, r, h)

A relation r is antisymmetric if h, t, (h, r, t) S (t, r, h) / S. A model is able to learn the antisymmetry rules if

r RD r = 0, h, t RD, f(h, r, t) = f(t, r, h)

A relation r1 is inverse to a relation r2 if h, t, (h, r1, t) S (t, r2, h) S. A model is able to learn the inverse rules if

r1 RD, r2 RD, h, t RD, f(h, r1, t) = f(t, r2, h)

We restrict r = 0 because r = 0 will result in f equal to an identically zero function. By choosing different P and W , we can define different TDB models as discussed in Section 3.1. Next, we give a theoretical analysis to establish the relationship between logical rules and TDB models. Theorem 1. Assume a model can be represented as the form of Eq.(1), then a model is able to learn the symmetry rules iff rank(W T (2) SW T (2)) < P. A model is able to learn the antisymmetry rules iff rank(W T (2) + SW T (2)) < P. A model is able to learn the inverse rules iff rank(W T (2)) =

rank([W T (2), SW T (2)]), where S RP 2 P 2 is a permutation matrix with S(i 1)P +j,(j 1)P +i = 1(i, j = 1, 2, . . . , P) and otherwise 0, [W T (2), SW T (2)] is the concatenation of matrix W T (2) and matrix SW T (2).

Proof. According to the symmetry rules, for any triplet (i, j, k), we have that f(i, j, k) = f(k, j, i), i.e., f(Hi:, Rj:, Tk:) = f(Hk:, Rj:, Ti:) = f(Tk:, Rj:, Hi:) (we use H = T here). We replace Hi: by h, replace Rj: by r, replace Tk: by t, then we have f(h, r, t) = f(t, r, h).

Then we have that

n=1 Wlmn h:l, r:m, t:n

n=1 Wlmn t:l, r:m, h:n

n=1 Wlmn h:l, r:m, t:n

n=1 Wnml h:l, r:m, t:n

n=1 (Wlmn Wnml) h:l, r:m, t:n

n=1 (Wlmn Wnml)r T :m(h:l t:n)

m=1 (Wlmn Wnml)r T :m)(h:l t:n) = 0

where is the Hadamard product. Since the above equation holds for any h, t RD, we can get

m=1 (Wlmn Wnml)r T :m = 0(l, n = 1, 2, . . . , P)

Therefore, a model is able to learn the symmetry rules iff

r RD r = 0,

m=1 (Wlmn Wnml)r T :m = 0

Therefore, the symmetry rule is transformed into a system of linear equations. This system of linear equations have non-zero solution iff rank(W T (2) SW T (2)) < P. Thus, a model is able to learn the symmetry rules iff rank(W T (2) SW T (2)) < P.

Similarly, a model is able to learn the anti-symmetry rules iff rank(W T (2) + SW T (2)) < P.

For the inverse rule:

r1 RD, r2 RD, h, t RD, f(h, r1, t) = f(t, r2, h)

a model is able to learn the symmetry rules iff the following equation

W T (2)r1 = SW T (2)r2

for any r1 RD, there exists r2 RD such that the equation holds. Thus, the column vectors of W T (2) can be expressed linearly by the column vectors of SW T (2). Since S is a permutation matrix and S = ST , we have that S2 = I, thus

SW T (2)r1 = S2W T (2)r2 = W T (2)r2

For any r1 RD, there exists r2 RD such that the above equation holds. Thus, the column vectors of SW T (2) can be expressed linearly by the column vectors of W T (2). Therefore, the column space of W T (2) is equivalent to the column space of SW T (2), thus we have

rank(W T (2)) = rank(SW T (2)) = rank([W T (2), SW T (2)])

Meanwhile, if rank(W T (2)) = rank([W T (2), SW T (2)]), then the columns of W T (2) can be expressed linearly by the columns of SW T (2), thus

r1 RD, r2 RD, W T (2)r1 = SW T (2)r2

Thus, a model is able to learn the inverse rules iff rank(W T (2)) = rank([W T (2), SW T (2)]).

By this theoerm, we only need to judge the relationship between P and the matrix rank about W(2). Compl Ex, Simpl E, ANALOGYY and Quat E design specific core tensors to make the models enable to learn the logical rules. We can easily verify that these models satisfy the conditions in this theorem. For example, for Compl Ex, we have that P = 2 and

1 0 0 1 0 1 1 0

, SW T (2) =

1 0 0 1 0 1 1 0

, [W(2), SW T (2)] =

1 0 1 0 0 1 0 1 0 1 0 1 1 0 1 0

rank(W T (2) SW T (2)) = rank(

0 0 0 2 0 2 0 0

) = 1 < P = 2

rank(W T (2) + SW T (2)) = rank(

2 0 0 0 0 0 2 0

) = 1 < P = 2

rank(W T (2)) = rank([W(2), SW T (2)]) = 2

Thus, Compl Ex is able to learn the symmetry rules, antisymmetry rules and inverse rules.

To prove the Proposition 1 and Proposition 2, we first prove the following lemma.

Z α = min Z=UV T 1 2(λ U 2α F + 1

where α > 0, λ > 0 and {Z, U, V } are real matrices. If Z = ˆUΣ ˆV T is a singular value decomposition of Z, then equality holds for the choice U = λ 1 2α ˆU

Σ and V = λ 1 2α ˆV

Σ is the element-wise square root of Σ.

Proof. The proof of Lemma 1 is based on the proof of Lemma 9 in [Ciliberto et al., 2017]. Let the singular value decomposition of Z Rm n be Z = ˆUΣ ˆV T , where ˆU Rm r, Σ Rr r, ˆV Rn r, r = rank(Z), ˆU T ˆU = Ir r and ˆV T ˆV = Ir r. We choose any U, V such that Z = UV T , then we have Σ = ˆU T UV T ˆV . Moreover, since ˆU, ˆV have orthogonal columns, ˆU T U F U F , ˆV T V F V F . Then

Z α = Tr(Σ)α = Tr( ˆU T UV T ˆV )α ˆU T U α F V ˆV T α F

λ U α F )( 1

λ V α F ) 1

2(λ U 2α F + 1

where the first upper bound is Cauchy-Schwarz inequality and the third upper bound is AM-GM inequality.

Let U = λ 1 2α ˆU

Σ and V = λ 1 2α ˆV

Σ, we have that

1 2(λ U 2α F + 1

λ V 2α F ) = 1

Σ 2α F + ˆV

Σ 2α F ) = 1

Σ 2α F ) = Z α

In summary,

Z α = min z=UV T 1 2(λ U 2α F + 1

Proposition 1. For any X, and for any decomposition of X, X = PD/P d=1 W 1H:d: 2R:d: 3T:d:, we have

λ1λ4L(X; α)

d=1 λ1( H:d: α F + R:d: α F + T:d: α F )

+ λ4( W 2 R:d: 3 T:d: α F + W 3 T:d: 1 H:d: α F + W 1 H:d: 2 R:d: α F ) (7)

If X(1) = U1Σ1V T 1 , X(2) = U2Σ2V T 2 , X(3) = U3Σ3V T 3 are compact singular value decompositions of X(1), X(2), X(3), respectively, then there exists a decomposition of X, X = PD/P d=1 W 1 H:d: 2 R:d: 3 T:d:, such that the two sides of Eq.(5) equal, where P = D,

α U1 Σ1, R:1: = p

α U2 Σ2, T:1: = p

α U3 Σ3, W = p

Σ 1 1 U T 1 2 q

Σ 1 2 U T 2 3 q

Σ 1 3 U T 3 .

Proof. Let the n-rank [Kolda and Bader, 2009] of X Rn1 n2 n3 be (r1, r2, r3), then U1 Rn1 r1, U2 Rn2 r2, U3 Rn3 r3, Σ1 Rr1 r1, Σ2 Rr2 r2, Σ3 Rr3 r3, W Rr1 r2 r3.

If X = 0, the above proposition is obviously true, we define 0 1 := 0 here.

For any X = 0, since X(1) = PD/P d=1 H:d:(W(1)(T:d: R:d:)T ), X(2) = PD/P d=1 R:d:(W(2)(T:d:

H:d:)T ), X(3) = PD/P d=1 T:d:(W(3)(R:d: H:d:)T ), by applying Lemma 1 to X(1), X(2), X(3), we

d=1 H:d:(W(1)(T:d: R:d:)T ) α/2 + R:d:(W(2)(T:d: H:d:)T ) α/2 + T:d:(W(3)(R:d: H:d:)T ) α/2

d=1 λ( H:d: α F + R:d: α F + T:d: α F )

λ( W(1)(T:d: R:d:)T α F + W(2)(T:d: H:d:)T α F + W(3)(R:d: H:d:)T α F )

d=1 λ( H:d: α F + R:d: α F + T:d: α F )

λ( W 2 R:d: 3 T:d: α F + W 3 T:d: 1 H:d: α F + W 1 H:d: 2 R:d: α F )

λ1/λ4, we have that

λ1λ4L(X; α)

d=1 λ1( H:d: α F + R:d: α F + T:d: α F )

+ λ4( W 2 R:d: 3 T:d: α F + W 3 T:d: 1 H:d: α F + W 1 H:d: 2 R:d: α F )

Since U T 1 U1 = Ir1 r1, U T 2 U2 = Ir2 r2, U T 3 U3 = Ir3 r3, thus we have X(1) = U1U T 1 X(1), X(2) = U2U T 2 X(2), X(3) = U3U T 3 X(3), then

X =X 1 (U1U T 1 ) 2 (U2U T 2 ) 3 (U3U T 3 )

Σ 1 1 U T 1 ) 2 (U2 p

Σ 1 2 U T 2 ) 3 (U3 p

Σ 1 3 U T 3 )

Σ 1 1 U T 1 2 q

Σ 1 2 U T 2 3 q

Σ 1 3 U T 3 1 U1 p

Let P = D and H:1: = p

α U1 Σ1, R:1: = p

α U2 Σ2, T:1: = p

α U3 Σ3, W = p

Σ 1 1 U T 1 2 q

Σ 1 2 U T 2 3 q

Σ 1 3 U T 3 , then X = W 1 H:1: 2 R:1: 3 T:1: is a decomposition of X. Since

Σ1W(1)(T:1: R:1:)T = U1 p

Σ2W(2)(T:1: H:1:)T = U2 p

Σ3W(3)(R:1: H:1:)T = U3 p

W(1)(T:1: R:1:)T = p

W(2)(T:1: H:1:)T = p

W(3)(R:1: H:1:)T = p

If P = D, we have that

d=1 λ1( H:d: α F + R:d: α F + T:d: α F )

+λ4( W(1)(T:d: R:d:)T α F + W(2)(T:d: H:d:)T α F + W(3)(R:d: H:d:)T α F )

=λ1( H:1: α F + R:1: α F + T:1: α F ) +λ4( W 2 R:1: 3 T:1: α F + W 3 T:1: 1 H:1: α F + W 1 H:1: 2 R:1: α F )) =λ1( H:1: α F + R:1: α F + T:1: α F )

+λ4( W(1)(T:1: R:1:)T α F + W(2)(T:1: H:1:)T α F + W(3)(R:1: H:1:)T α F )

Σ1 α F + U2 p

Σ2 α F + U3 p

Σ1V T 1 α F + p

Σ2V T 2 α F + p

Σ3V T 3 α F )

Σ3 α F ) + p

λ1λ4( X(1) α + X(2) α + X(3) α + X(1) α + X(2) α + X(3) α )

Proposition 2. For any X, and for any decomposition of X, X, X = PD/P d=1 W 1 H:d: 2 R:d: 3 T:d:, we have

d=1 λ2( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F )

+ λ3( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F ) (8)

And there exists some X , and for any decomposition of X , such that the two sides of Eq.(6) can not achieve equality.

Proof. Since X(1) = PD/P d=1 (H:d:W(1))(T:d: R:d:)T , X(2) = PD/P d=1 (R:d:W(2))(T:d:

H:d:)T , X(3) = PD/P d=1 (T:d:W(3))(R:d: H:d:)T and for any matrix A, B, A B α F = A α F B α F , by applying Lemma 1 to X(1), X(2), X(3), we have that

d=1 (H:d:W(1))(T:d: R:d:)T α/2 + (R:d:W(2))(T:d: H:d:)T α/2 + (T:d:W(3))(R:d: H:d:)T α/2

d=1 λ( T:d: R:d: α F + T:d: H:d: α F + R:d: H:d: α F )

λ( H:d:W(1) α F + R:d:W(2) α F + T:d:W(3) α F )

d=1 λ( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F )

+ λ( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F )

λ2/λ3, we have that

d=1 λ2( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F )

+ λ3( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F ) (9)

Let X R2 2 2, X 1,1,1 = X 2,2,2 = 1 and X ijk = 0 otherwise. Then

X (1) = X (2) = X (3) = 1 0 0 0 0 0 0 1

Thus rank(X (1)) = rank(X (2)) = rank(X (3)) = 2. In Lemma 1, a necessary condition of the equality holds is that rank(U) = rank(V ) = rank(Z). Thus, the equality holds only if P = D and

rank(X (1)) = rank(T:1: R:1:) = 2, rank(X (2)) = rank(T:1: H:1:) = 2

rank(X (3)) = rank(R:1: H:1:) = 2

Since for any matrix A, B, rank(A B) = rank(A) rank(B), we have that

rank(T:1:) rank(R:1:) = 2, rank(T:1:) rank(H:1:) = 2, rank(R:1:) rank(H:1:) = 2

The above equations have no non-negative integer solution, thus there is no decomposition of X

d=1 λ2( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F )

+ λ3( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F )

Proposition 3. For any X, and for any decomposition of X, X = PD d=1 W 1H:d1 2R:d1 3T:d1, we have

λ1λ4L(X; 2) 6 p

d=1 λ1( H:d1 2 F + R:d1 2 F + T:d1 2 F )

+ λ4( W 2 R:d1 3 T:d1 2 F + W 3 T:d1 1 H:d1 2 F + W 1 H:d1 2 R:d1 2 F )

where W = 1.

Proof. X, let S1 = {(W , H, R, T )|X = PD d=1 W 1 H:d: 2 R:d: 3 T:d:}, S2 = {(W , H, R, T )|X = PD d=1 W 1 H:d1 2 R:d1 3 T:d1, W = 1}. We have that

d=1 H:d1 F R:d1 F T:d1 F )

d=1 λ4 R:d1 2 F T:d1 2 F )1/2(

d=1 λ1 H:d1 2 F )1/2

d=1 λ1 H:d1 2 F + λ4 W 2 R:d: 3 T:d: 2 F

where W = 1. The equality holds if and only if λ4 R:d1 2 F T:d1 2 F = λ1 H:d1 2 F , i.e., p

λ4/λ1 R:d1 2 T:d1 2 = H:d1 2. For a CP decomposition of X, X = PD d=1 W 1 H:d1 2 R:d1 3 T:d1, we let H :d1 = q

R:d1 2 T:d1 2

H:d1 2 H:i, R :i = q

H:d1 2 R:d1 2 T:d1 2 R:i, T :i = T:i if

H:d1 2 = 0, R:d1 2 = 0, T:d1 2 = 0 and otherwise H :d = 0, R :d = 0, T :d = 0. Then X = PD d=1 W 1 H :d1 2 R :d1 3 T :d1 is another CP decomposition of X and R :d1 2 T :d1 2 = H :d1 2. Thus

λ1λ4 X = min S2

d=1 λ1 H:d1 2 F + λ4 W 2 R:d: 3 T:d: 2 F

Similarly, we can get

λ1λ4 X = min S2

d=1 λ1 R:d1 2 F + λ4 W 2 T:d: 3 H:d: 2 F

λ1λ4 X = min S2

d=1 λ1 T:d1 2 F + λ4 W 2 H:d: 3 R:d: 2 F

By Propostion 1, we have that

λ1λ4 X(1) =

d=1 λ1 H:d1 2 F + λ4 W 2 R:d: 3 T:d: 2 F

Since S2 is a subset of S1, we have that

Similarly, we can prove that X(2) X and X(3) X , thus

λ1λ4L(X; 2) 6 p

d=1 λ1( H:d1 2 F + R:d1 2 F + T:d1 2 F )

+ λ4( W 2 R:d1 3 T:d1 2 F + W 3 T:d1 1 H:d1 2 F + W 1 H:d1 2 R:d1 2 F )

Proposition 4. For any X, there exists a decomposition of X = PD/P d=1 ˆ W 1 ˆ H:d: 2 ˆR:d: 3 ˆT:d:, such that

d=1 λ1( ˆ H:d: α F + ˆR:d: α F + ˆT:d: α F )

+λ4( ˆ W 2 ˆR:d: 3 ˆT:d: α F + ˆ W 3 ˆT:d: 1 ˆ H:d: α F + ˆ W 1 ˆ H:d: 2 ˆR:d: α F )

d=1 λ2( ˆT:d: α F ˆR:d: α F + ˆT:d: α F ˆ H:d: α F + ˆR:d: α F ˆ H:d: α F )

+λ3( ˆ W 1 ˆ H:d: α F + ˆ W 2 ˆR:d: α F + ˆ W 3 ˆT:d: α F )

Furthermore, for some X, there exists a decomposition of X, X = PD/P d=1 ˆ W 1 ˆ H:d: 2 ˆR:d: 3 ˆT:d:, such that

d=1 λ1( H:d: α F + R:d: α F + T:d: α F )

+λ4( W 2 R:d: 3 T:d: α F + W 3 T:d: 1 H:d: α F + W 1 H:d: 2 R:d: α F )

d=1 λ2( T:d: α F R:d: α F + T:d: α F H:d: α F + R:d: α F H:d: α F )

+λ3( W 1 H:d: α F + W 2 R:d: α F + W 3 T:d: α F )

Proof. To simplify the notations, we denote ˆ H:d: as ˆ H, ˆR:d: as ˆR and ˆT:d: as ˆT . To prove the inequality, we only need to prove p

λ1/λ4( ˆ H α F + ˆR α F + ˆT α F )

λ4/λ1( ˆ W 2 ˆR 3 ˆT α F + ˆ W 3 ˆT 1 ˆ H α F + ˆ W 1 ˆ H 2 ˆR α F )

λ2/λ3( ˆT α F ˆR α F + ˆT α F ˆ H α F + ˆR α F ˆ H α F )

λ3/λ2( ˆ W 1 ˆ H α F + ˆ W 2 ˆR α F + ˆ W 3 ˆT α F )

Let c1 = λ1λ3/ λ1λ3, for any X, X = PD/P d=1 (c 3 1 ˆ W ) 1 (c1 ˆ H:d:) 2 (c1 ˆR:d:) 3 (c1 ˆT:d:) is a decomposition of X, thus we only need to prove

c2( ˆ H α F + ˆR α F + ˆT α F ) + 1

c2 ( ˆ W 2 ˆR 3 ˆT α F + ˆ W 3 ˆT 1 ˆ H α F + ˆ W 1 ˆ H 2 ˆR α F )

>c2( ˆT α F ˆR α F + ˆT α F ˆ H α F + ˆR α F ˆ H α F ) + 1

c2 ( ˆ W 1 ˆ H α F + ˆ W 2 ˆR α F + ˆ W 3 ˆT α F )

where c2 = p

If X R1 1 1, let c = X2 + 100, ˆ H = ˆR = ˆT = c, ˆ W = X

c3 , we have

c2( ˆ H α F + ˆR α F + ˆT α F ) + 1

c2 ( ˆ W 2 ˆR 3 ˆT α F + ˆ W 3 ˆT 1 ˆ H α F + ˆ W 1 ˆ H 2 ˆR α F )

=c2(X2 + 100)2 + 1

(X2 + 100)2

<c2( ˆT α F ˆR α F + ˆT α F ˆ H α F + ˆR α F ˆ H α F ) + 1

c2 ( ˆ W 1 ˆ H α F + ˆ W 2 ˆR α F + ˆ W 3 ˆT α F )

=c2(X2 + 100)4 + 1

(X2 + 100)4

For any X Rn1 n2 n3, max{n1, n2, n3} > 1, let ˆ W be a diagonal tensor, i.e., ˆ Wi,j,k = 1 if i = j = k and otherwise 0 and let ˆ W Rn1n2n3 n1n2n3, ˆ H Rn1 n1n2n3, ˆR Rn2 n1n2n3, ˆT Rn3 n1n2n3, ˆ Hi,m = Xijk, ˆRj,m = 1, ˆTk,m = 1, m = (i 1)n2n3 + (j 1)n3 + k and otherwise 0. An example of X R2 2 2 is as follows:

ˆ H = X1,1,1 X1,1,2 X1,2,1 X1,2,2 0 0 0 0 0 0 0 0 X2,1,1 X2,1,2 X2,2,1 X2,2,2

ˆR = 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1

ˆT = 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1

Thus X = ˆ W 1 ˆ H 2 ˆR 3 ˆT is a decomposition of X, for X = (c 3 1 ˆ W ) 1 (c1 ˆ H) 2 (c1 ˆR) 3 (c1 ˆT ) we have

c2( ˆ H α F + ˆR α F + ˆT α F ) + 1

c2 ( ˆ W 2 ˆR 3 ˆT α F + ˆ W 3 ˆT 1 ˆ H α F + ˆ W 1 ˆ H 2 ˆR α F )

=c2( ˆ H α F + ˆR α F + ˆT α F ) + 1

c2 ( ˆ W(1)( ˆT ˆR)T α F + ˆ W(2)( ˆT ˆ H)T α F + ˆ W(3)( ˆR ˆ H)T α F )

c2 ( ( ˆT ˆR)T α F + ( ˆT ˆ H)T α F + ( ˆR ˆ H)T α F ) + c2( ˆ H α F + ˆR α F + ˆT α F )

=c2( ˆT α F ˆR α F + ˆT α F ˆ H α F + ˆR α F ˆ H α F ) + 1

c2 ( ˆ H ˆ W(1) α F + ˆR ˆ W(2) α F + ˆT ˆ W(3) α F )

=c2( ˆT α F ˆR α F + ˆT α F ˆ H α F + ˆR α F ˆ H α F ) + 1

c2 ( ˆ W 1 ˆ H α F + ˆ W 2 ˆR α F + ˆ W 3 ˆT α F )

For any X Rn1 n2 n3, let W = X/2

2In1 n1, R =

2In2 n2, T =

2In3 n3, thus

c2( H α F + R α F + T α F ) + 1

c2 ( W 2 R 3 T α F + W 3 T 1 H α F + W 1 H 2 R α F )

=c2(2n1 + 2n2 + 2n3) + 1

c2 ( X(1) 2 F /2 + X(2) 2 F /2 + X(3) 2 F /2)

c2( T α F R α F + T α F H α F + R α F H α F ) + 1

c2 ( W 1 H α F + W 2 R α F + W 3 T α F )

=c2(4n3n2 + 4n3n1 + 4n2n1) + 1

c2 ( X(1) 2 F /4 + X(2) 2 F /4 + X(3) 2 F /4)

Thus if X(1) 2 F > 16c2 2(n3n2 + n3n1 + n2n1), we have

c2( H α F + R α F + T α F ) + 1

c2 ( W 2 R 3 T α F + W 3 T 1 H α F + W 1 H 2 R α F )

>c2( T α F R α F + T α F H α F + R α F H α F ) + 1

c2 ( W 1 H α F + W 2 R α F + W 3 T α F )

end the proof.

C Experimental details

Datasets The statistics of the datasets, WN18RR, FB15k-237, YAGO3-10 and Kinship, are shown in Table 4.

Table 4: The statistics of the datasets.

Dataset #entity #relation #train #valid #test

WN18RR 40,943 11 86,835 3,034 3,134 FB15k-237 14,541 237 272,115 17,535 20,466 YGAO3-10 123,188 37 1,079,040 5,000 5,000 Kinship 104 25 8,548 1,069 1,069

Table 5: The settings for total embedding dimension D and number of parts P.

WN18RR FB15k-237 YAGO3-10

D P D P D P

CP 2000 1 2000 1 1000 1 Compl Ex 4000 2 4000 2 2000 2 Simpl E 4000 2 4000 2 2000 2 ANALOGY 4000 4 4000 4 2000 4 Quat E 4000 4 4000 4 2000 4 Tuck ER 256 1 256 1 256 1

Table 6: The settings for power α and regularization coefficients λi.

WN18RR FB15k-237 YAGO3-10

α λ1 λ2 α λ1 λ2 α λ1 λ2

CP 3.0 05 0.07 2.25 0.005 0.07 2.25 0.0007 0.005 Compl Ex 3.0 0.05 0.03 2.25 0.005 0.07 2.25 0.0005 0.007 Simpl E 3.0 0.03 0.07 2.25 0.005 0.07 2.25 0.0007 0.007 ANALOGY 3.0 0.01 0.07 2.25 0.005 0.07 2.25 0.0007 0.007 Quat E 3.0 0.07 0.03 2.25 0.005 0.05 2.25 0.0005 0.005 Tuck ER 2.0 0.01 0.03 2.0 0.001 0.01 2.0 0.0003 0.003

Evaluation Metrics MR= 1

N PN i=1 ranki, where ranki is the rank of ith triplet in the test set and N is the number of the triplets. Lower MR indicates better performance.

N PN i=1 1 ranki . Higher MRR indicates better performance.

Hits@N = 1 N PN i=1 I(ranki N), where I( ) is the indicator function. Hits@N is the ratio of the ranks that no more than N, Higher Hits@N indicates better performance.

Hyper-parameters We use a heuristic approach to choose the hyper-parameters and reduce the computation cost with the help of Hyperopt, a hyper-parameter optimization framework based on TPE [Bergstra et al., 2011].

For the hyper-parameter α, we search for the best α in {2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5} with fixed hyper-parameters λi = 0.0001.

For the hyper-parameter λi, we set λ1 = λ3 and λ2 = λ4 for all models to reduce the number of hyper-parameters because we notice that the first row of Eq.(4) Hid: α F + Rjd: α F + Tkd: α F is equal to the third row of Eq.(4) W 1 Hid: α F + W 2 Rjd: α F + W 3 Tkd: α F and the second row of Eq.(4) Tkd: α F Rjd: α F + Tkd: α F Hid: α F + Rjd: α F Hid: α F is equal to the

Table 7: The results on WN18RR dataset with different α.

α 2.0 2.25 2.5 2.75 3.0 3.25 3.5

MRR 0.483 0.486 0.485 0.486 0.494 0.486 0.487 H@1 0.443 0.445 0.441 0.442 0.449 0.443 0.444 H@10 0.556 0.564 0.573 0.572 0.581 0.572 0.570

fourth row of Eq.(4) W 2 Rjd: 3 Tkd: α F + W 3 Tkd: 1 Hid: α F + W 1 Hid: 2 Rjd: α F for CP and Compl Ex. We then search the regularization coefficients λ1 and λ2 in {0.001, 0.003, 0.005, 0.007, 0.01, 0.03, 0.05, 0.07} for WN18RR dataset and FB15k-237 dataset, and search λ1 and λ2 in {0.0001, 0.0003, 0.0005, 0.0007, 0.001, 0.003, 0.05, 0.007} for YAGO3-10 dataset. Thus, we need 64 runs for each model to find the best λi. To further reduce the number of runs, we use Hyperopt, a hyper-parameter optimization framework based on TPE [Bergstra et al., 2011], to tune hyper-parameters. In our experiments, we only need 20 runs to find the best λi.

We use Adagrad [Duchi et al., 2011] with learning rate 0.1 as the optimizer. We set the batch size to 100 for WN18RR dataset and FB15k-237 dataset and 1000 for YAGO3-10 dataset. We train the models for 200 epochs. The settings for total embedding dimension D and number of parts P are shown in Table 5. The settings for power α and regularization coefficients λi are shown in Table 6.

Random Initialization We run each model three times with different random seeds and report the mean results. We do not report the error bars because our model has very small errors with respect to random initialization. The standard deviations of the results are very small. For example, the standard deviations of MRR, H@1 and H@10 of CP model with our regularization are 0.00037784, 0.00084755 and 0.00058739 on WN18RR dataset, respectively. This indicates that our model is not sensitive to the random initialization.

The hyper-parameter α We analyze the impact of the hyper-parameter, the power of the Frobenius norm α. We run experiments on WN18RR dataset with Compl Ex model. We set α to {2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5}. See Table 7 for the results.

The results show that the performance generally increases as α increases and then decreases as α increases. The best α for WN18RR dataset is 3.0. Therefore, we should set a more appropriate α value to obtain better performance.

The hyper-parameter λi We analyze the impact of the hyper-parameter, the regularization coefficient λi. We run experiments on WN18RR dataset with Compl Ex model. We set λi to {0.001, 0.003, 0.005, 0.007, 0.01, 0.03, 0.05, 0.07, 0.1}. See Table 8 and Table 9 for the results.

The experimental results show that the model performance first increases and then decreases with the increase of λi, without any oscillation. Thus, we can choose suitable regularization coefficients to prevent overfitting while maintaining the expressiveness of TDB models as much as possible.

The Number of Parts P In Section 3.1, we show that the number of parts P can affect the expressiveness and computation. Thus, we study the impact of P on the model performance. We evaluate the model on WN18RR dataset. We set the total embedding dimension D to 256, and set the parts P to {1, 2, 4, 8, 16, 32, 64, 128, 256}. See Table 10 for the results. The time is the AMD Ryzen 7 4800U CPU running time on the test set.

The results show that the model performance generally improves and the running time generally increases as P increases. Thus, the larger the part P, the more expressive the model and the more the computation.

A Pseudocode for IVR We present a pseudocode of our method in Alg.(1).

Table 8: The performance of Compl Ex on WN18RR dataset with different λ1.

λ1 0.001 0.003 0.005 0.007 0.01 0.03 0.05 0.07 0.1

MRR 0.473 0.475 0.476 0.476 0.480 0.487 0.491 0.486 0.467 H@1 0.435 0.436 0.436 0.436 0.439 0.442 0.445 0.436 0.411 H@10 0.545 0.555 0.555 0.557 0.562 0.576 0.583 0.585 0.570

Table 9: The performance of Compl Ex on WN18RR dataset with different λ2.

λ2 0.001 0.003 0.005 0.007 0.01 0.03 0.05 0.07 0.1

MRR 0.464 0.464 0.464 0.465 0.465 0.467 0.471 0.473 0.472 H@1 0.429 0.430 0.430 0.430 0.432 0.432 0.435 0.437 0.435 H@10 0.528 0.528 0.529 0.529 0.530 0.536 0.540 0.540 0.540

Table 10: The results on WN18RR dataset with different P.

P 1 2 4 8 16 32 64 128 256

MRR 0.437 0.449 0.455 0.455 0.463 0.466 0.485 0.497 0.501 H@1 0.405 0.413 0.413 0.415 0.425 0.428 0.446 0.456 0.460 H@10 0.499 0.520 0.536 0.533 0.536 0.539 0.565 0.574 0.579

Time 1.971s 2.096s 2.111s 2.145s 2.301s 2.704s 3.520s 6.470s 16.336s

Algorithm 1 A pseudocode for IVR Input:

Core tensor: W , embeddings: (H, R, T ), triplet: (i, j, k). Regularization coefficients: λl(l = 1, 2, 3, 4), the number of parts P. power coefficients: α Output:

Output of model Eq.(1): Xijk, IVR regularization Eq.(6): reg(Xijk) 1: Initialization: reg := 0, x1d := Hid:, x2d := Rjd:, x3d := Tkd: 2: reg := reg + PD/P d=1 λ1( x1d α F + x2d α F + x3d α F )

3: reg := reg + PD/P d=1 λ2( x1d α F x2d α F + x1d α F x3d α F + x2d α F x3d α F ) 4: x1d := W 1 Hid:, x2d := W 2 Rjd:, x3d := W 3 Tkd: 5: reg := reg + PD/P d=1 λ3( x1d α F + x2d α F + x3d α F ) 6: x1d := x1d 2 Rjd:, x2d := x2 3 Tkd:, x3d := x3d 1 Hid: 7: reg := reg + PD/P d=1 λ4( x1d α F + x2d α F + x3d α F )

8: x := PD/P d=1 x1d 3 Tkd: 9: return: x, reg

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our claims reflect well the effect of our proposed regularization on knowledge graph completion. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We shortly discuss the limitations in the Conclusion Section. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: We have provided the detailed assumptions and proofs in the Appendix B. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We have provided the experimental details in Appendix C. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We have provided the code in supplemental material.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes] (See Appendix C Paragraph "Random Initialization")

Justification: We have provided the experimental details in Appendix C.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We have provided the information about the statistical significance of the experiments in Appendix C Paragraph "Random Initialization".

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We have provided the related compute resources in Appendix C.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Our research complies in all respects with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: There is no societal impact of the work performed.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: Our paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.