# human_cognitioninspired_hierarchical_fuzzy_learning_machine__acea5c0d.pdf

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Junbiao Cui 1 Qin Yue 1 Jianqing Liang 1 Jiye Liang 1

Classification is a cornerstone of machine learning research. Most of the existing classifiers assume that the concepts corresponding to classes can be precisely defined. This notion diverges from the widely accepted understanding in cognitive science, which posits that real-world concepts are often inherently ambiguous. To bridge this big gap, we propose a Human Cognition-Inspired Hierarchical Fuzzy Learning Machine (HC-HFLM), which leverages a novel hierarchical alignment loss to integrate rich class knowledge from human knowledge system into learning process. We further theoretically prove that minimizing this loss can align the hierarchical structure derived from data with those contained in class knowledge, resulting in clear semantics and high interpretability. Systematic experiments verify that the proposed method can achieve significant gains in interpretability and generalization performance.

1. Introduction

Classification stands as one of the most fundamental and extensively studied problems in machine learning. Over the years, a vast array of classifiers has been developed (Delgado et al., 2014), with their methodologies broadly categorized into two main approaches. The first approach relies directly on the discrete 0-1 loss, as seen in classifiers such as k-nearest neighbors (Cover & Hart, 1967), decision trees (Quinlan, 1986), and naive Bayes (Domingos & Pazzani, 1997), etc. The second approach employs continuous functions as proxies for the 0-1 loss, including the hinge loss used in support vector machine (Cortes & Vapnik, 1995), the exponential loss adopted by Ada Boost (Freund & Schapire, 1997; Friedman et al., 2000), and the

1Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China. Correspondence to: Jianqing Liang <liangjq@sxu.edu.cn>, Jiye Liang <ljy@sxu.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

cross-entropy loss widely adopted in deep neural networks (Le Cun et al., 2015), etc. Despite their differences, these loss functions share a common underlying assumption, i.e., the concepts corresponding to classes can be precisely defined. This assumption inherently implies that class relation are crispy, where any two classes are either identical or distinct, leaving no room for ambiguity.

In the classification problem, each class corresponds to a concept. Unlike existing classifiers, human solves the classification problem mainly based on concept cognition (Murphy, 2004). Over the centuries, the classical theory of concept representation prevailed, positing that all concepts can be precisely defined. However, in 1953, the philosopher Ludwig Wittgenstein challenged this proposition by exploring the concept of game , questioning whether a precise definition of such a concept truly exists (Wittgenstein, 1953). This seminal inquiry sparked a paradigm shift, leading the philosophy community to widely accept that real-world concepts are often inherently ambiguous and resist precise definition.

Building on the above consensus, modern cognitive science has developed several influential theories for concept representation. Prototype theory (Rosch, 1975) models a concept using an idealized prototype that encapsulates its most representative features. Exemplar theory (Medin & Schaffer, 1978) represents a concept through a set of typical and representative instances. Knowledge theory (Murphy & Medin, 1985) posits that concepts are deeply embedded within human knowledge system, asserting that human cognition of a concept is inseparable from the contextual and relational structure of this system. As the most sophisticate framework, knowledge theory not only aligns closely with human cognitive processes but also provides a compelling foundation for advancing machine learning models that aim to mimic human-like cognition.

Recently, we proposed a fuzzy learning machine (FLM) inspired by concept cognition (Cui & Liang, 2022). FLM leverages fuzzy set theory (Zadeh, 1965) to capture the inherent fuzziness of concepts (Mc Closkey & Glucksberg, 1978; Marti et al., 2023) and employs exemplar theory to capture the typicality effects (Smith et al., 1974), achieving good interpretability, robustness, and generalization.

However, FLM still faces limitations in understanding con-

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

cepts. While it employs exemplar theory for concept representation, an approach well-suited to purely data-driven machine learning. Unlike knowledge theory (Murphy & Medin, 1985), it separates concept representation from human knowledge system, limiting its ability to form a profound understanding of concepts. Recent research in cognitive science further highlights that many concepts are understood through their relations with other concepts (Piantadosi et al., 2024).

Meanwhile, the community has accumulated a wealth of digitized knowledge base, including the conventional knowledge graph, e.g., Word Net (Miller, 1995), Concept Net (Speer et al., 2017), and the fashionable pre-trained language model, e.g., Word2Vector (Mikolov et al., 2013), Glo Ve (Pennington et al., 2014), and GPT (Open AI, 2023), etc. These resources make it feasible to develop the knowledgedriven method for concept cognition.

Accordingly, this paper explores relations between concepts contained in human knowledge system and then leverages these relations to guide the learning of concepts from data, enhancing the model s ability to understand and represent concepts. The main contributions are as follows.

We propose a unified modeling framework that represents various types of knowledge as fuzzy similarity relation (FSR) on the class space, resulting in the knowledge-infused concept representation.

We design a novel hierarchical alignment (HA) loss inspired by the principles of concept cognition, enabling the knowledge-infused concept representation to guide the learning process.

We introduce the FSR-based quotient space theory, and then prove that minimizing the HA loss can align the hierarchical structures derived from both data and knowledge in quotient space.

Extensive experiments verify the effectiveness of the proposed method in improving interpretability and generalization performance.

2. Notions and Related Works

2.1. Problem Formalization

Task. Given a (X, Y, fc)-classification problem, X Rd, Y, and fc : X Y are input space, class space and unknown classification function, respectively. To solve the classification problem, we need to determine the unknown classification function fc. For the sake of discussion, the classes in Y are numbered as 1, 2, , |Y|.

Data. Let D = X Y be data space. In machine learning, experience is often given in the form of training data, de-

noted as D = {(xi, yi) |xi X , yi = fc (xi) Y}n i=1 D. The goal is to use D to get ˆfc : X Y as close as possible to the unknown fc. k Y, let Xk = {xi | (xi, yi) D, yi = k } be the set of training samples belonging to class k. Let X = {xi | (xi, yi) D} be the set of all training samples.

Class knowledge. Let K be human knowledge system, which contains various types of knowledge. In practice, each class in Y corresponds to a concept, and the concept name corresponds to a word in natural language. Therefore, we can collect the class knowledge about Y by concept name, denoted as K K.

2.2. Preliminaries

Fuzzy similarity relation is the core notion of the proposed method. Specifically, Definition 2.1. (Bandler & Kohout, 1988) Given a set A, the mapping F : A A [0, 1] is called a fuzzy binary relation on A. Definition 2.2. (Bandler & Kohout, 1988) Given a set A and a fuzzy binary relation F on A. If F satisfies (1) reflexivity: a A, R((a, a)) = 1, and (2) symmetry: a, b A, R((a, b)) = R((b, a)), the F is called a fuzzy similarity relation (FSR) on A.

More preliminaries, including binary relation, fuzzy binary relation, and fuzzy quotient space are given in Appendix A.

2.3. Fuzzy Learning Machine

In FLM (Cui & Liang, 2022), we achieve concept cognition according to exemplar theory. Specifically,

Learning. The training process of FLM is

i,j=1,i =j LFP (f ((xi, xj) ; Θ) , yi, yj)+γR(Θ), (1)

where LFP and R are fuzziness permissible loss and regularization term, respectively. γ > 0 is trade-off parameter. f (( , ) ; Θ) is a FSR on X with learnable parameters Θ. The fuzziness permissible (FP) loss is defined as

LFP (sij, yi, yj) = max (sij α, 0) , yi = yj max (β sij, 0) , yi = yj , (2)

where sij = f ((xi, xj) ; Θ) is the predicting similarity between xi and xj. 0 < α < β < 1 are hyper-parameters to control the degree of concept fuzziness.

Concept representation. Let Θ be the local optimal solution of formula (1). The FSR f (( , ) ; Θ ) forms the basis of concept representation. According to exemplar theory, the set of exemplars of class k is defined as

Ek = x| x Xk, µ (x, k) is the top-nexe k largest value in {µ (xi, k) | xi Xk }

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

where µ (x, k) = 1 |Xk| P

xj Xk f ((x, xj) ; Θ ) and nexe k is a manually specified parameter.

Predicting. Given a test sample x X, FLM predicts the class label of x according to the following formula

ˆy = argmax k Y

xi Ek f ((x, xi) ; Θ ) . (4)

To sum up, FLM captures the fuzziness of concept well. However, it ignores relations between concepts contained in human knowledge system, limiting its ability to concept cognition. This paper aims to alleviate the limitation.

3. Proposed Method

The overall framework of the proposed HC-HFLM is shown in Figure 1, including method design and theory analysis. The method design is given below, and the theory analysis will be given in Section 4.

3.1. Method Design

We first introduce the following two definitions.

Definition 3.1. Given α (0, 1), a set A, and a FSR S on A. If a, b A, a = b, 0 < S(a, b) < α, then S is called a α-FSR on A.

Definition 3.2. Given a finite set A = {ai}|A| i=1 R. ai A, let sort(ai; A) {1, 2, , |A|} be the rank of ai in A by ascending order. Let b = sortvec(A) = b1, b2, , b|A| T R|A| be the vector formed by arranging the elements in ascending order A. Obviously, the following propositions are true. (1) ai, aj A, i = j, ai > aj iff sort (ai; A) > sort (aj; A). (2) ai A, ai = bsort(ai;A). (3) i = 1, 2, , |A|, bi A and sort (bi; A) = i.

(1) Mining Class FSR from Human Knowledge System

According to knowledge theory (Murphy & Medin, 1985), to understand concepts profoundly we need to obtain the knowledge-infused concept representation from human knowledge system K. It is well known that K contains various types of knowledge. In this paper, we do not limit the type of class knowledge K K. It can be the attribute description vector of concept name used in literature (Lampert et al., 2014). It can be the embedding vector of concept name by pre-trained language models, such as Word2Vector (Mikolov et al., 2013), Glo Ve (Pennington et al., 2014), GPT (Open AI, 2023), etc. It can be the knowledge graph, such as Word Net (Miller, 1995), Concept Net (Speer et al., 2017), etc. It can also be human themselves.

To ensure the universality, different types of class knowledge K are uniformly modeled as a α-FSR on Y, i.e., SK

(0, 1]|Y| |Y|, where i, j Y,

s K ij = 1, i = j sim(i, j; K), i = j , (5)

where sim(i, j; K) (0, α) is the similarity between classes i and j measured by class knowledge K. α is fuzziness parameter. Appendix B.1 gives several types of class knowledge and the corresponding sim(i, j; K).

In practice, the quality of class knowledge K varies significantly, so we need to further refine the class FSR. Let

SK, case 1 : K is of high quality S(D,K) , case 2 : K is of medium quality S(D,K) coa , case 3 : K is of low quality (6)

be the refined class FSR, where SK is obtained by formula (5). case 1: SK needs no refinement. case 2: SK needs to be calibrated by training data. Appendix B.2 gives the procedure for computing S(D,K) . case 3: SK needs to be coarsened and then be calibrated by training data. Appendix B.3 gives the procedure for computing S(D,K) coa .

In formula (6), T is a symmetric matrix. Each row of T is a |Y|-dimensional embedding representation of one concept, which is composed of relations between concepts. Meanwhile, T is mined from human knowledge system. Therefore, T can be regarded as the knowledge-infused concept representation.

(2) Learning Sample FSR from Data

We design a novel hierarchical alignment (HA) loss, which uses class FSR T, the knowledge-infused concept representation, to guide the learning of sample FSR from training data, enhancing model s ability to concept cognition.

HA loss. Let T be the α-FSR obtained by formula (6). Let

V = {tij| i, j Y} , U = V {0, α, β}, u = sortvec(U) = u1, u2, , u|U| T , (7)

where 0 < α < β < 1 are fuzziness parameters. (xi, yi) , (xj, yj) D, xi = xj, let sij = f ((xi, xj) ; Θ) be the predicting similarity between xi and xj. Based on formula (7), the HA loss is defined as

LHA(sij, yi, yj)=

urij 1 sij, sij urij 1 0, sij urij 1, urij

sij urij, sij urij , (8)

where rij = sort tyiyj; U . The underlying cognitive principles are as follows.

(a) In the ideal case, samples xi and xj can perfectly represent concepts yi and yj, respectively. The similarity sij can

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Method Design

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

Quotient Space Q

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

1 2 3 4 9 10 5 6 7 8 , , , , , , , , , x x x x x x x x x x

, , tig c t er a dog

, , tig c t er a dog

, , tig c t er a dog

Y Class sapce

Input sapce X

D X Y = Data Space

cat tiger dog ...

K Human Knowledge System

Different types of knowledge Vector Kmnowledg Graph

cat tiger dog

F e lid a e

m a m m a l

Mining Learning

Represent 0,1

Y Y FSR on Y

-representation D -representation K , D K -representation

Theory Analysis

Figure 1. The overall framework of HC-HFLM

accurately depict the similarity between yi and yj. Therefore, sij should be equal to the similarity between concepts measured by class knowledge, i.e., sij = tyiyj.

(b) In the real case, it is difficult that the sample perfectly represent the corresponding concept. For example, it is almost impossible to find a cat in the real world that perfectly represents the concept of cat . The sij cannot represent the similarity between concepts yi and yj perfectly. Therefore, sij should be less than the similarity between concepts measured by class knowledge, i.e., sij < tyiyj = usort(tyiyj ;U).

(c) In the real case, although the similarity sij cannot perfectly represent the similarity between concepts yi and yj, it hardly violates the rank. For example, it is difficult to find a cat, a tiger, and a dog in the real world such that cat and dog are more similar than cat and tiger. Therefore, sij should not be too different from the similarity measured by class knowledge, i.e., sij > usort(tyiyj ;U) 1 = max w U, w<tyiyj w.

(d) Combining (a)-(c), when sij falls into the interval usort(tyiyj ;U) 1, usort(tyiyj ;U)

, the loss is 0. Otherwise, the nonzero loss should be generated.

Training. Based on the HA loss, the class FSR T can guide the learning process. Formally, we have

i,j=1,i =j LHA (f ((xi, xj) ; Θ) , yi, yj)+γR(Θ), (9)

where f (( , ) ; Θ) is a FSR on X with learnable parameters

Θ. R is regularization term. γ > 0 is trade-off parameter. Formula (9) can be solved efficiently by stochastic gradient descent method (Kingma & Ba, 2015).

Let Θ be the local optimal solution of formula (9), then f (( , ) ; Θ ) forms a FSR on X. If HA loss of f (( , ) ; Θ ) is 0, then f (( , ) ; Θ ) can effectively approximate the fuzzy equivalence relation (FER). In theory, we have

Theorem 3.3 (see Appendix C.1.1 for proof). For a (X, Y, fc)-classification problem, Given (a1) fuzziness parameters 0 < α < β < 1 (see formula (2)). (a2) training data set D = {(xi, yi)|xi X, yi = fc(xi) Y}n i=1. (a3) α-FSR T on Y constructed from class knowledge K (see Definition 3.1 and formula (6)). (a4) the local optimal solution Θ of formula (9) and the FSR f (( , ) ; Θ ) on X. Let (b1) X = {xi | (xi, yi) D} be the set of training samples. (b2) S [0, 1]n n, i, j = 1, 2, , n, sij = f ((xi, xj) ; Θ ) be the predicted FSR on X. (b3) ˆS = traclo(S) be transitive-closure of S (see Definition A.19) and also be a FER on X (see Theorem 4.1). (b4) L1 = Pn i=1 Pn j=1,j =i LHA (sij, yi, yj) be the corresponding HA loss of S (see formula (8)). (b5) L2 = Pn i=1 Pn j=1,j =i LFP (sij, yi, yj) be the corresponding fuzzy permissible loss of S (see formula (2)). (b6) L3 = Pn i=1 Pn j=1,j =i LFP (ˆsij, yi, yj) be the corre-

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

sponding fuzzy permissible loss of ˆS (see formula (2)).

If L1 = 0, then (1) L2 = 0, (2) L3 = 0.

Theorem 3.3 indicates that the HA loss is more stringent than the FP loss and inherits the ability of PF loss in approximating FER.

(3) Concept Representation and Predicting

By replacing f (( , ) ; Θ ) in formulas (3) and (4) with f (( , ) ; Θ ), concept representation and predicting will be completed.

Both f (( , ) ; Θ ) learned by FLM and f (( , ) ; Θ ) learned by HC-HFLM are FSRs on X. Guided by the knowledge-infused concept representation, the f (( , ) ; Θ ) integrates the class knowledge K, thus the cognition of concepts is profounder. Therefore, f (( , ) ; Θ ) can obtain better concept representation and predicting result.

Appendix B.4 gives the algorithm description of the proposed HC-HFLM.

4. Theory Analysis

This section demonstrates how the knowledge-infused conceptual representation guides the learning process.

4.1. FSR-based Quotient Space Theory

Literature (Zhang & Zhang, 2014) introduces the FER-based quotient space theory. In practice, it is difficult to obtain FER directly. Hence, we leverage FSR as an alternative for modeling data and knowledge. Consequently, we need to develop the FSR-based quotient space theory.

Literature (Bandler & Kohout, 1988) introduces the transitive-closure of fuzzy binary relation. This paper only focuses on the transitive-closure of FSR, which satisfies the following special properties.

Theorem 4.1 (see Appendix C.2.1 for proof). Given a finite set A and a FSR S on A. The following propositions are true. (1) k = 0, 1, ..., S2k is a FSR on A. (2) k = 0, 1, ..., S2k S2k+1. (3) Sequence t(k) = S2k, k = 0, 1, , converges to traclo(S). And traclo(S) is the transitiveclosure of S. (4) traclo(S) is a FER on A.

Based on Theorem 4.1 and the definitions of cut relation and partition (see Definition A.17 and A.8), we have the following theorem.

Theorem 4.2 (see Appendix C.2.2 for proof). Given a finite set A and a FSR on A. λ [0, 1], traclo(S)[λ] = traclo(S[λ]).

Based on Theorem 4.2, we can construct a quotient space from a FSR as follows.

Definition 4.3. Given a finite set A and a FSR on A. Q (A, S) = A/traclo S[λ] λ [0, 1] is called the quotient space derived by S.

The meaning of Definition 4.3 is as follows.

(a) λ [0, 1], S[λ] is the λ-cut relation of S (see Definition A.17). According to Theorem A.18, S[λ] is a SR on A (see Definition A.4).

(b) traclo S[λ] is the transitive-closure of S[λ] (see Definition A.19). According to Theorem 4.1, traclo S[λ] is an ER on A (see Definition A.5).

(c) A/traclo S[λ] is the partition on A derived by traclo S[λ] (see Definition A.6, A.7, and A.8).

(d) Combining (a)-(c), Q (A, S) is a set of partitions on A that are derived by S.

(e) Given a finite set A and a FER T on A. λ [0, 1], according to Theorem A.18, T [λ] is an ER on A. According to Definition A.19, traclo T [λ] = T [λ]. Then Q (A, T) = A/T [λ] λ [0, 1] degenerate into the hierarchical structure given in literature (Zhang & Zhang, 2014).

The main properties of Definition 4.3 are as follows. Theorem 4.4 (see Appendix C.2.3 for proof). Given a finite set A and a FSR on A. Let V = {S((a, b))| a, b A}, then Q (A, S) = A/traclo S[λ] λ V . Theorem 4.5 (see Appendix C.2.4 for proof). Given a finite set A and a FSR on A, then Q (A, S) = Q (A, traclo(S)).

Theorem 4.5 establishes the connection between the quotient space derived by FSR and FER. Due to the existence and uniqueness of transitive-closures of FSR (Bandler & Kohout, 1988), the theories developed based on FER in literature (Zhang & Zhang, 2014) can be easily extended to the quotient space derived by FSR. Theorem 4.6 (see Appendix C.2.5 for proof). Given a finite set A, a FSR S on A, and , described in Definition A.10. The following propositions are true. (1) (Q (A, S) , ) is a partially ordered set. (2) P, Q Q (A, S), P = Q, P and Q are comparable under the sense of . Theorem 4.7 (see Appendix C.2.6 for proof). Given a finite set A and a FSR S on A. Let U = {S((a, b))| a, b A}. The following propositions are true. (1) P, Q Q (A, S), If P = Q, then |P| = |Q|. (2) |Q (A, S)| |A|. (3) If |U| > |Q (A, S)|, then V U, |V | = |Q (A, S)|, such that Q (A, S) = A/traclo(S[λ]) λ V . (4) If S satisfies transitivity, i.e., S is FER on A, then |U| = |Q (A, S)|. Theorem 4.8 (see Appendix C.2.7 for proof). Given a finite set A and two FSRs R, S on A. If the ranking of elements in R is consistent with that in S, i.e., a, b, c, d A, R((a, b)) = R((c, d)) iff S((a, b)) = S((c, d)) R((a, b)) > R((c, d)) iff S((a, b)) > S((c, d)) , (10)

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

then Q (A, R) = Q (A, S).

Given the above properties, we can construct a hierarchical structure in the quotient space derived by FSR as follows.

Definition 4.9. Given a finite set A, a FSR S on A, and the described in Definition A.10. According to Theorem 4.6, P, Q Q (A, S) , P = Q, P and Q are comparable under the sense of . According to Theorem 4.7, |Q (A, S) | |A|. Without loss of generality, let Q (A, S) = {Pi}|Q(A,S)| i=1 and i = 1, 2, , |Q (A, S) | 1, Pi Pi+1. The (Q (A, S) , ) = P1 P2 P|Q(A,S)| is called hierarchical structure derived by S.

4.2. Represent Data and Knowledge in Quotient Space

Data representation. Guided by class FSR T, formula (9) learns a FSR f (( , ) ; Θ ) from D. Let S [0, 1]n n, i, j = 1, 2, , n, sij = f ((xi, xj) ; Θ ) be the predicting FSR on training samples set X. According to Definition 4.3 and 4.9, the hierarchical structure derived by S is

(Q (X, S) , ) = Q1 Q2 Q|Q(X,S)| , (11)

denoted as D-representation.

Knowledge representation. In formula (6), class knowledge K is modeled as a FSR T on Y. According to Definition 4.3 and 4.9, the hierarchical structure derived by T is

(Q (Y, T) , ) = O1 O2 O|Q(Y,T)| , (12)

denoted as K-representation. It has clear semantics and high interpretability. See Appendix B.5 for detailed analysis.

4.3. Align Data and Knowledge in Quotient Space

In Section 4.2, K-representation (Q (Y, T) , ) and Drepresentation (Q (X, S) , ) are defined on Y and X, respectively. Due to the gap between Y and X, it is not feasible to align these two representations directly. Therefore, we need to construct a bridge between them.

Definition 4.10. For a (X, Y, fc)-classification problem, given a set of samples ϕ X X and a fuzzy binary relation F on Y. xi, xj X, let G ((xi, xj)) = F ((fc (xi) , fc (xj))). G is called fuzzy binary relation on X that is spanned by F and fc, denoted as G = span(F, fc, X).

For a (X, Y, fc)-classification problem, given the set of training samples X and a class FSR T obtained by formula (6). According to Definition 4.10, 4.3, and 4.9, the hierarchical structure derived by span (T, fc, X) is

(Q (X, span (T, fc, X)) , ) = P1 P2 P|Q(X,span(T,fc,X))| , (13)

denoted as (D, K)-representation.

First, we align K-representation and (D, K)-representation by the following theorem.

Theorem 4.11 (see Appendix C.2.8 for proof). Given the symbols defined in Theorem 3.3, let (Q (Y, T) , ) and (Q (X, span (T, fc, X)) , ) be the K-representation given by formula (12) and the (D, K)-representation given by formula (13), respectively. If y Y, x X, such that fc(x) = y, then the alignment between (Q (Y, T) , ) and (Q (X, span (T, fc, X)) , ) is as follows.

(1) |Q (Y, T)| = |Q (X, span (T, fc, X))|.

(2) i = 1, 2, , |Q (Y, T)|, the alignment relation between Oi and Pi is (a) O Oi, {x | x X, fc(x) O} Pi, (b) P Pi, {fc(x) | x P } Oi, (c) |Oi| = |Pi|.

Then, we align D-representation and (D, K)-representation by the following theorem.

Theorem 4.12 (see Appendix C.2.9 for proof). Given the symbols defined in Theorem 3.3, let (Q (X, S) , ) and (Q (X, span (T, fc, X)) , ) be the D-representation given by formula (11) and the (D, K)-representation given by formula (13), respectively. If the HA loss of S is 0, then i = 1, 2, , |Q (X, span (T, fc, X))|, j {1, 2, , |Q (X, S)|}, such that Pi = Qj.

At last, we align D-representation and K-representation by the following corollary.

Corollary 4.13. Given the symbols defined in Theorem 3.3, let (Q (Y, T) , ) and (Q (X, S) , ) be the Kpresentation given by formula (12) and the D-presentation given by formula (11), respectively. If y Y, x X, such that fc(x) = y, and the HA loss of S is 0, then

(Q (Y, T) , )

Align See Theorem 4.11

(Q (X, span (T, fc, X)) , )

Align See Theorem 4.12

(Q (X, S) , ) .

Corollary 4.13 shows that minimizing HA loss can align the hierarchical structures of data and class knowledge. This means that the class knowledge contained in human knowledge system is effectively integrated into the learned sample FSR f (( , ) ; Θ ). Thus HC-HFLM can utilize human knowledge system to enhance its understanding of concepts.

5. Experiments

5.1. Interpretability Analysis

We show the working mechanism of the HC-HFLM on the data set MNIST (Le Cun et al., 1998). The class space

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Table 1. Class knowledge and its representation on handwritten digit classification task

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

SD SK S(D,K)

4 9 1 7 3 8 5 6 2 0

0 1 2 3 4 5 7 8 6 9 1 O

0 1 2 3 4 5 7 8 6 9

Q Y, SD , Q Y, SK , Q Y, S(D,K) ,

consist of 10 digit characters. For the sake of display, 10 samples are randomly selected for each class to form a training data set. Ten volunteers are recruited to construct a class FSR SK. Then the calibrated class FSR S(D,K)

is obtained by formula (26). Appendix D.1 gives more experiment details. We analyze the experimental results from the following three aspects.

(1) Class Knowledge Representation

Table 1 shows the process of class knowledge representation. We have the following observations.

(a) SK is the class FSR constructed by 10 volunteers. Q Y, SK , is the hierarchical structure derived by SK. It can be observed that each class forms an equivalence class in the finest partition O1. In O1, the similarity between classes 6 and 9 is the biggest. They are merged to get the second partition O2. And so on. Finally all the classes are merged into one equivalence class, resulting in the coarsest partition O9.

(b) SD is the estimated class FSR by the pre-trained f (( , ); Θ ) (see formula (25)). Comparing SD and SK, it can be observed that the scale of SD obtained from data and SK obtained from knowledge is quite different. In addition, the ranking of elements is also different.

(c) Q Y, SD , is the hierarchical structure derived by SD. Comparing Q Y, SD , and Q Y, SD , , it can be observed that the difference between SD and SK

directly determines the difference between the hierarchical structure derived by them. Therefore, the hierarchical structure can capture the difference of FSR.

(d) S(D,K) is the optimal solution of formula (26)). Q Y, S(D,K) , is the hierarchical structure derived by S(D,K) . Comparing SK and S(D,K) , the ranking of elements in them are the same. Comparing Q Y, SK ,

and Q Y, S(D,K) , , these two hierarchical structures are also the same (see Theorem 4.8).

(2) Data Representation

The left part of Figure 2 gives the hierarchical structure (Q (X, S) , ) derived by S. S [0, 1]100 100 is the predicting FSR by f (( , ); Θ ). (Q (X, S) , ) contains 100 partitions, which is equal to the training samples (see Theorem 4.7). In the finest partition Q1, each sample forms an equivalence class. In the coarsest partition Q100, all samples form an equivalence class. Q91 contains 10 equivalence classes. And each equivalence class is just the set of training samples of a class.

(3) Alignment between Data and Class Knowledge

In Figure 2, the left part and the right part are the hierarchical structure of data and class knowledge in quotient space, respectively, i.e., (Q (X, S) , ) and Q Y, S(D,K) , . The alignment between them is as follows.

(a) In the partition O1, each equivalence class corresponds to a class. In the partition Q91, each equivalence class corresponds to a set of training samples belonging to a class. The two correspond one by one, see the black dotted line.

(b) Based on the alignment between Q91 and O1, the left and right equivalence classes are merged in the same way, forming coarser partitions Q92 and O2, respectively. Therefore, Q92 and O2 are aligned. Analogously, Q93 and O3,

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

0 X 1 X 2 X 3 X 4 X 5 X 7 X 8 X 9 X 91 Q

6 X 0 1 2 3 4 5 7 8 6 9 1 O

9 O , , X S Q

* ( , ) , , D K S Q Y

1x 2x 3x 4x 98 x 95 x 100 x 97 x 99 x 1 Q 5x 96 x

Figure 2. The alignment between data and class knowledge on handwritten digit classification task

Table 2. The basic information of data sets

TASK DATA CLASS KNOWLEDGE

DATA SET SAMPLE FEATURE CLASS CLASS DESCRIPTION VECTOR Tree(Z, E) DERIVED FROM WORDNET

COARSE GRAINED CLASSIFICATION

APY 15399 2048 32 64-D DOMAIN EXPERT 18 LAYERS, 97 NODES

IMAGENET1K 1331167 2048 1000 768-D LLMS 18 LAYERS, 1814 NODES

MEDIUM GRAINED CLASSIFICATION

AWA1 30475 2048 50 85-D DOMAIN EXPERT 17 LAYERS, 111 NODES AWA2 37322 2048 50 85-D DOMAIN EXPERT

FINE GRAINED CLASSIFICATION

FLO 8189 2048 102 1024-D LLMS N/A

CUB 11788 2048 200 312-D DOMAIN EXPERT N/A

Q94 and O4 are also aligned.

(c) Based on the alignment between Q94 and O4, the left equivalence classes are merged twice to get a coarser partition Q96. The right equivalence classes are merged once to get coarser partition O5, see the red dot line. Therefore, Q96 and O5 is aligned.

(d) Based on the alignment between the partitions Q96 and O5, the equivalence classes on the left and right continue to merge in the same way. Therefore, k = 1, 2, , 4, Q96+k and O5+k are aligned.

(e) According to (a)-(d), the substructure of the left hierarchical structure [Q91 Q94 Q96 Q100] is the common data-knowledge hierarchical structure (Q (X, span (T, fc, X)) , ) in formula (13), Theorem

4.11, and Theorem 4.12.

In summary, this experiment demonstrates that the HCHFLM models the class knowledge and the traning data as hierarchical structures in quotient space and further aligns these two hierarchical structures.

5.2. Generalization Analysis

Data sets. This section verifies the generalization performance of the HC-HFLM on 6 public data sets. Two types of knowledge, i.e., class description vector (CK1) and Word Net (Miller, 1995) (CK2), are integrated into the HC-HFLM, denoted as CK1-HFLM and CK2-HFLM, respectively. Table 2 gives the basic information of data sets, where NA denotes that the CK2 cannot be obtained for the data set.

Setting. 6 different classifiers are choose for comparison. Among them, k-nearest neighbors (KNN), decision tree (DT), support vector machine (SVM) and naive Bayes (NB) are 4 classical classifiers. Cross entropy classifier (CEC) and FLM are based on deep neural network. Appendix D.2 gives more experiment details.

Results and analysis. Table 3 records the test accuracy of 8 methods on 6 data sets, where denotes that the result can not been obtained within 7 days. We have the following observations.

(a) Compared with the classifiers KNN, DT, SVM and NB,

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Table 3. The test accuracy of different methods (%)

APY IMAGENET1K AWA1 AWA2 FLO CUB KNN 85.35 75.49 86.61 89.83 83.39 47.30 DT 63.13 63.47 70.09 42.65 21.64 SVM 84.54 84.26 89.06 86.56 43.89 NB 76.13 73.66 84.67 87.68 85.53 60.27 CEC 89.08 75.62 88.48 91.67 93.58 61.49 FLM 89.42 76.02 89.95 92.79 94.05 66.19 CK1-HFLM 90.23 76.16 91.10 93.59 95.06 68.78 CK2-HFLM 90.21 76.20 90.87 93.35 N/A N/A

CEC achieves better performance due to the powerful expression ability of the deep neural networks. When the backbone network is the same, the performance of FLM is superior to CEC. This is because FLM designs more reasonable optimization objective according to the concept cognition principles.

(b) Compared with FLM, CK1-HFLM and CK2-HFLM achieve better performance, which shows that the concept cognition of the proposed method has been improved under the guidance of class knowledge. Especially on the finegrained classification data set CUB, significant performance improvement is obtained by integrating class knowledge.

(c) On APY, AWA1, and AWA2, the performance of CK1HFLM integrated with class description vector is superior to CK2-HFLM integrated with Word Net knowledge. On these three data sets, class description vector are constructed by domain experts. Compared with Word Net, the class description vector is more suitable for these tasks.

(d) On Image Net1K, the performance of CK2-HFLM integrated with Word Net knowledge is superior to the CK1HFLM integrated with class description vector. The class space is subset (Russakovsky et al., 2015) of words in Word Net. Compared with the class description vector obtained by the pre-trained language model, the class knowledge contained in Word Net is more suitable for this task.

In summary, the HC-HFLM obtains significant performance improvement benefiting from the guidance of the class knowledge. Meanwhile, the more class knowledge meets the requirements of the task, the greater the performance gain is obtained.

6. Conclusion

Inspired by human cognition, we propose a hierarchical fuzzy learning machine. The method constructs the knowledge-infused concept representation by mining class fuzzy similarity relation from human knowledge system and employs a novel hierarchical alignment loss to guide the learning process. Meanwhile, we introduce the fuzzy similarity relation-based quotient space theory. Based on

this theory, we prove that the proposed method can align the hierarchical structures of data and knowledge, which guarantees that the proposed method can leverage human knowledge system to enhance concept cognition. Experimental results validate the proposed method s ability to improve both interpretability and generalization. With its capability to learn from human knowledge system, the proposed method holds great promise for open-world learning tasks, such as zero-shot learning and continual learning.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 62376141, U21A20473, 62376142) and the UK China Joint Laboratory of Security and Control on Smart Energy 202104041101020.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Al Mousa, M., Benlamri, R., and Khoury, R. Exploiting nontaxonomic relations for measuring semantic similarity and relatedness in Word Net. Knowledge-Based Systems, 212:106565, 2021.

Bandler, W. and Kohout, L. J. Special properties, closures and interiors of crisp and fuzzy relations. Fuzzy Sets and Systems, 26(3):317 331, 1988.

Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.

Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273 297, 1995.

Cover, T. M. and Hart, P. E. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 (1):21 27, 1967.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Cui, J. and Liang, J. Fuzzy learning machine. In Advances in Neural Information Processing Systems, pp. 36693 36705, 2022.

Delgado, M. F., Cernadas, E., Barro, S., and Amorim, D. G. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1):3133 3181, 2014.

Domingos, P. M. and Pazzani, M. J. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103 130, 1997.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337 407, 2000.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Lampert, C. H., Nickisch, H., and Harmeling, S. Attributebased classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453 465, 2014.

Lastra-D ıaz, J. J. and Garc ıa-Serrano, A. A novel family of IC-based similarity measures with a detailed experimental survey on Word Net. Engineering Applications of Artificial Intelligence, 46:140 153, 2015.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Le Cun, Y., Bengio, Y., and Hinton, G. E. Deep learning. Nature, 521(7553):436 444, 2015.

Marti, L., Wu, S., Piantadosi, S. T., and Kidd, C. Latent diversity in human concepts. Open Mind, 7:79 92, 2023.

Mc Closkey, M. E. and Glucksberg, S. Natural categories: Well defined or fuzzy sets? Memory & Cognition, 6(4): 462 472, 1978.

Medin, D. L. and Schaffer, M. M. Context theory of classification learning. Psychological Review, 85(3):207 238, 1978.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.

Miller, G. A. Word Net: A lexical database for english. Communications of the ACM, 38(11):39 41, 1995.

Murphy, G. The Big Book of Concepts. MIT Press, Cambridge, Massachusetts, USA, 2004.

Murphy, G. L. and Medin, D. L. The role of theories in conceptual coherence. Psychological Review, 92(3):289 316, 1985.

Open AI. GPT-4 technical report. https://doi.org/ 10.48550/ar Xiv.2303.08774, 2023.

Pennington, J., Socher, R., and Manning, C. D. Glo Ve: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing, pp. 1532 1543, 2014.

Piantadosi, S. T., Muller, D. C., Rule, J. S., Kaushik, K., Gorenstein, M., Leib, E. R., and Sanford, E. Why concepts are (probably) vectors. Trends in Cognitive Sciences, 28(9):844 856, 2024.

Quinlan, J. R. Induction of decision trees. Machine Learning, 1(1):81 106, 1986.

Rosch, E. Cognitive representations of semantic categories. Journal of Experimental Psychology: General, 104(3): 192, 1975.

Rosen, K. H. Discrete Mathematics and Its Applications. Mc Graw-Hill Education, New York, USA, 2007.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Image Net large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

Silla, C. N. and Freitas, A. A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22:31 72, 2011.

Smith, E. E., Shoben, E. J., and Rips, L. J. Structure and process in semantic memory: A featural model for semantic decisions. Psychological Review, 81(3):214 241, 1974.

Speer, R., Chin, J., and Havasi, C. Concept Net 5.5: An open multilingual graph of general knowledge. In AAAI Conference on Artificial Intelligence, pp. 4444 4451, 2017.

Wang, Y., Hu, Q., Zhu, P., Li, L., Lu, B., Garibaldi, J. M., and Li, X. Deep fuzzy tree for large-scale hierarchical visual classification. IEEE Transactions on Fuzzy Systems, 28(7):1395 1406, 2020.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Wang, Y., Wang, Z., Hu, Q., Zhou, Y., and Su, H. Hierarchical semantic risk minimization for large-scale classification. IEEE Transactions on Cybernetics, 52(9):9546 9558, 2022.

Wittgenstein, L. Philosophical Investigations. Blockwell Publishing, 1953.

Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. Zeroshot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2251 2265, 2019.

Zadeh, L. A. Fuzzy sets. Information and Control, 8(3): 338 353, 1965.

Zhang, L. and Zhang, B. Quotient Space Based Problem Solving: A Theoretical Foundation of Granular Computing. Tsinghua University Press, Beijing, China, 2014.

A. Preliminaries

A.1. Binary Relation

To unify the expression, we adopt the form of mapping to define binary relation and fuzzy binary relation.

Definition A.1. (Bandler & Kohout, 1988) Given a set A, the mapping R : A A {0, 1} is called a binary relation on A.

Definition A.2. (Bandler & Kohout, 1988) Given a set A and a binary relation R on A, the three properties of R are defined as: (1) reflexivity: a A, R((a, a)) = 1, (2) symmetry: a, b A, R((a, b)) = R((b, a)), (3) transitivity: a, b, c A, if R((a, b)) = 1, R((b, c)) = 1, then R((a, c)) = 1. Obviously, transitivity can be equivalently defined as a, b A, R((a, b)) max c A min[R((a, c)), R((c, b))].

Definition A.3. (Bandler & Kohout, 1988) Given a set A. Let R and S be two binary relations on A. (1) R S iff a, b A, R((a, b)) S((a, b)), (2) R S iff R S and a, b A, such that R((a, b)) < S((a, b)).

Definition A.4. (Bandler & Kohout, 1988) Given a set A and a binary relation R on A, if R satisfies reflexivity and symmetry, then R is called a similarity relation (SR) on A.

Definition A.5. (Bandler & Kohout, 1988) Given a set A and a SR R on A, if R satisfies transitivity, then R is called an equivalence relation (ER) on A.

Definition A.6. (Rosen, 2007) Given twos set A and P = {P1, P2, , Pm}, m 1, if P satisfies the following properties: (1) Pi P, Pi = ϕ, (2) S

Pi P = A, (3) Pi, Pj P, if i = j, then Pi Pj = ϕ, then P is called a partition on A. Partition is a special type of quotient set.

Definition A.7. (Rosen, 2007) Given a set A and an ER R on A. a A, [A]R = {b| b A, R((a, b)) = 1} is called an equivalence class of a derived by R.

Definition A.8. Given a set A and an ER R on A. Obviously, A/R = {[a]R| a A} is a partition on A. And A/R is called a partition on A derived by R.

Remark A.9. According to Definition A.6, A.7 and A.8, given a set A, the ERs on A and the partitions on A can form a one-to-one mapping. Based on it, in Proposition 1 of literature (Cui & Liang, 2022), we model the classification problem as the equivalent relation problem on the input space, which establishes the equivalence between the classification problem and binary classification problem.

Definition A.10. Given a set A and two partitions P, Q on A, if P P, Q Q, such that P Q, we call that Q is

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

coarser than P, denoted as P Q. Equivalently, we call that P is finer than Q, denoted as Q P. Additionally, if P Q and P P, Q Q, such that P Q, we call that Q is coarser than P strictly, denoted as P Q, Equivalently, we call that P is finer than Q strictly, denoted as Q P.

Remark A.11. According to Remark A.9 and Definition A.10, the following propositions are true. (1) Given a set A and two ERs R, S on A, if R S, then A/R A/S. (2) Given a set A and two partitions P, Q on A. Let R and S be the ERs corresponding to P and Q respectively. If P Q, then R S.

Remark A.12. Given a set A. Let P be the set of partitions on A. Obviously, the described in Definition A.10 satisfies the following properties. (1) Reflexivity: P P, P P. (2) Antisymmetry: P, Q P, if P Q and Q P, then P = Q. (3) Transitivity: P, Q, R P, if P Q and Q R, then P R. In summary, is a order relation on P. Lemma A.13. Given a set finite A. Let P and Q be two partitions on A. If P = Q and P Q, then |P| = |Q|.

Proof. Reduction to absurdity.

Assume that |P| = |Q| = m.

Because |P| = |Q| = m, without losing generality, let P = {P1, P2, , Pm}, Q = {Q1, Q2, , Qm}. Meanwhile, because P Q, without losing generality, let

i = 1, 2, , m, Pi Qi, (15a)

i = 1, 2, , m, |Pi| |Qi| , i.e. |Qi| |Pi| 0. (15b)

Because P and Q are both partition on A, according to Definition A.6,

i=1 |Pi| , (16a)

i=1 |Qi| . (16b)

Subtracting the left and right sides of formula (16a) and (16b) respectively,

i=1 (|Qi| |Pi|) . (17)

According to formula (17) and (15b),

i = 1, 2, , m, |Qi| |Pi| = 0, i.e., |Qi| = |Pi| . (18)

According to formula (18) and (15a),

i = 1, 2, , m, Pi = Qi. (19)

According to formula (19) and (15), P = Q. It contradicts with P = Q. Therefore, the original proposition is true.

In this paper, we use O to represent the partition on class space Y, use P and Q to represent the partitions on the set of training samples X. Based on them, we use O O, P P, and Q Q to represent the equivalence classes in partition O, P, and Q, respectively.

A.2. Fuzzy Binary Relation

Definition A.14. (Bandler & Kohout, 1988) Given a set A, the mapping F : A A [0, 1] is called a fuzzy binary relation on A.

Comparing Definition A.1 and A.14, the binary relation is a special type of fuzzy binary relation. Definition A.15. (Bandler & Kohout, 1988) Given a set A and a fuzzy binary relation F on A. If F satisfies (1) reflexivity: a A, R((a, a)) = 1, (2) symmetry: a, b A, R((a, b)) = R((b, a)), F is called a Fuzzy Similarity Relation (FSR) on A. Definition A.16. (Bandler & Kohout, 1988) Given a set A and a FSR S on A. If F satisfies transitivity: a, c A, F((a, c)) max b A min[F((a, b)), F((b, c))], F is called a

fuzzy equivalence relation (FER) on A.

The following definition intuitively establishes the connection between binary relation and fuzzy binary relation. Definition A.17. (Zhang & Zhang, 2014) Given a set A and a fuzzy binary relation F on A, λ [0, 1], the binary relation, a, b A,

F [λ]((a, b)) = I[F((a, b)) λ] = 1, F((a, b)) λ 0, otherwise ,

is called the λ-cut binary relation, or λ-cut relation for short.

Based on Definition A.17, the following theorem further establishes the connection between FSR (FER) and SR (ER). Theorem A.18. (The proof is intuitive and omitted.) Given a set A and a fuzzy binary relation F on A, λ [0, 1], let F [λ] be the λ-cut relation of F. (1) If F is a FSR on A, then F [λ] is a SR on A. (2) If F is a FER on A, then F [λ] is an ER on A.

Definition A.19. (Bandler & Kohout, 1988) Given a set A and a fuzzy binary relation F on A. If the fuzzy binary relation G on A satisfies (1) transitivity, (2) F G, (3) fuzzy binary relation H on A, F H, if H satisfies transitivity, then G H, G is called the transitive-closure of F, denoted as G = traclo(F). Obviously, if F satisfies transitivity, then traclo(F) = F.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Literature (Bandler & Kohout, 1988) gives general definitions of P-closure. The Definition A.19 is a special case among them. Moreover, this paper only focuses on the transitive-closure of FSR.

Definition A.20. (Bandler & Kohout, 1988) Given a set A. Let R and S be two fuzzy binary relation on A. The product fuzzy binary relation of R and S is defined as T = R S, a, b A, T((a, b)) = max c A min[R((a, c)), S((c, b))].

Given , we can define the power of fuzzy binary relation as follows.

Definition A.21. Given a set A and a fuzzy binary relation F on A. F 1 = F, F 2 = F 1 F, F 3 = F 2 F, , and so on.

Definition A.22. Given a finite set A and a fuzzy binary relation F on A. Without losing generality, we can sequentially number the elements in A as a1, a2, , a|A|. Based on it, F can be equivalently represented by a matrix, i.e.,

F [0, 1]|A| |A|, i, j = 1, 2, , |A|, fij = F ((ai, aj)) .

Definitions A.19-A.22 are also applicable to binary relation. In this paper, we use matrices SK, SD, S(D,K), T [0, 1]|Y| |Y| to represent the FSRs on class space Y, use matrix S [0, 1]n n to represent the FSR on the set of training samples X, and use f(( , ); Θ) to represent the parameterized FSR on input space X.

A.3. Fuzzy Quotient Space

Based on the preliminaries given in Appendix A.1 and A.2, literature (Zhang & Zhang, 2014) systematically develops the FER-based fuzzy quotient space theory. We list the contents used in this paper as follows.

Theorem A.23. (Zhang & Zhang, 2014) Given a set A, the following assertions are equivalent. (1) Given a FER on A (see Definition A.16). (2) Given a a normalized isosceles distance d on A, where

normalized: a, b A, d(a, b) [0, 1],

isosceles: a, b, c A, d(a, b), d(b, c), and d(a, c) form a an isosceles triangle and and its congruent legs are the longest side.

(3) Given a hierarchical structure on A, i.e.,

a set of partitions on A, and

these partitions form an ordered queue under the sense of described in Definition A.10.

According to Definition A.16, FER needs to satisfy transitivity. In practice, it is relatively difficult to directly obtain

fuzzy binary relation that satisfy transitivity. Therefore, we develop the FSR-based quotient space theory (see Section 4.1). Based on it, the class knowledge and data are represented as hierarchical structures in the quotient space (see Section 4.2). And by designing a appropriate loss, we align these two hierarchical structures (see Section 4.3). As a result, the understanding of class concepts contained in the class knowledge is integrated into the model.

B. Details of Method

B.1. Knowledge Modeling

This section discusses different types of class knowledge and the corresponding methods for calculating class FSR.

Algorithm 1 Generate Class Tree Structure from Word Net, d Denoted as Word Net2Class Tree( ) Input: (1) Knowledge base Word Net. (2) Class space Y. k Y, let wk be the concept name of class k. y1, y2 Y, there is no hypernym relation between wy1 and wy2. Output: A tree structure Tree (Z, E) on Y, where Z is the set of nodes, E is the set of edges, and Y Z is set of leaves. 1: Initialize the set of words corresponding to class space Y: W {wy|y Y};

// (1) Mark the nodes corresponding to the words in W.

2: Initialize queue: Q {Word Net.entity.n.01}; // entity.n.01 is the common hypernym of other words. 3: while Q is not empty do 4: node Q.out(); // Out of the queue 5: if node.word belongs to W then 6: Mark node and its hypernym nodes as valid, recursively; 7: else 8: Mark node as invalid; 9: end if 10: node set node.hyponyms(); // The set of hyponyms of node 11: for node node set do 12: Q.In(node); // Entering the queue 13: end for 14: end while

// (2) Delete the invalid nodes. 15: Initialize queue: Q {Word Net.entity.n.01}; 16: while Q is not empty do 17: node Q.out(); 18: if node is valid then 19: node set node.hyponyms(); 20: for node node set do 21: Q.In(node); 22: end for 23: else 24: Delete the edge between node and its hypernym node; 25: end if 26: end while 27: Return the tree with Word Net.entity.n.01 as the root;

(1) K = ai ai Rd K, i Y , where ai is a description vector of class i. It can be designed by domain experts.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Algorithm 2 Generate Hierarchical Structure from Class Tree Structure, denoted as Class Tree2Hie Stru( ) Input: (1) Class space Y. (2) Class tree structure Tree (Z, E), where Z is the set of nodes, E is the set of edges, and Y Z is set of leaves. Output: A set of partitions on Y.

1: Initialize the set of partitions H ϕ; 2: Initialize the partition P ϕ; 3: Initialize the set of nodes N {Tree (Z, E) .root}; 4: while |P| < |Y| do 5: P ϕ, N ϕ; 6: for node N do 7: P P S{node.leaf set}; 8: N N S node.child set; 9: end for 10: P P , N N , 11: H H S{P}; // The duplicate partitions are deleted. 12: end while

Algorithm 3 Generate FER from Class Hierarchical Structure, denoted as Class Hie Stru2FER( ) Input: (1) Class space Y. (2) A hierarchical structure H on Y. (3) Fuzzy parameter 0 < α < 1.0. (4) A strictly monotonically increasing function f : {1, 2, , |Y| 1} (0, α). Output: A α-FSR on Y.

1: Initialize the FSR S {0}|Y| |Y|; 2: Sort the partitions in H according to described in Definition A.10. Let P1 P|H| be the sorted result; 3: for i = 1, 2, , |Y| do 4: sii 1; 5: for j = i + 1, i + 2, , |Y| do 6: for k = |H| 1, |H| 2, , 1 do 7: if A Pk, such that i, j A then 8: sij f (|Pk|), sji f (|Pk|); 9: break; //Theorem 2.5 in (Zhang & Zhang, 2014). 10: end if 11: end for 12: end for 13: end for

It can also be obtained from pre-trained language models. Given K, i, j Y, i = j, the similarity between classes i and j can be calculated as follows1.

sim(i, j; K) = α

1 + a T i aj ai 2 aj 2

(0, α). (20)

(2) K is a knowledge graph, e.g., Word Net (Miller, 1995). Each class label corresponding to a word in Word Net. Currently, a large number of methods have been proposed to calculate the semantic similarity between two words based on Word Net (Lastra-D ıaz & Garc ıa-Serrano, 2015; Al Mousa et al., 2021). Given the Word Net, i, j Y, i = j, the simi-

1In practice, we can introduce a small constant ϵ > 0 to make the value of sim( , ; ) fall into the interval (0, α).

larity between classes i and j can be calculated as follows2.

sim(i, j; K) = α

z Word Net (wi, wj) (0, α), (21)

where wi and wj are the words corresponding to class i and j, respectively. Word Net (wi, wj) is the semantic similarity between wi and wj. And z is the normalization factor.

In addition, Word Net stores the hypernym relation to represent the Is-A relation. For example, the hypernym of cat is animal , the corresponding to the semantic is that cat is a type of animal . Based on the hypernym relation, we can obtain the tree structure on class space by Algorithm 1.

Tree(Z, E)=Word Net2Class Tree(Word Net, Y), (22)

where Z is the set of nodes, E is the set of edges. In this paper, we assume that there is no hypernym relation between classes in Y. Obviously, Y Z is the set of leaves. Given a Tree(Z, E) on Y, we can generate FERs on Y from Tree(Z, E), see Algorithm 2 and 3.

(3) In many applications, there is a natural hierarchical structure between classes (Silla & Freitas, 2011; Wang et al., 2022). The class knowledge K can be represented as a tree structure Tree(Z, E) on Y, where Z is the set of nodes, E is the set of edges. If there is no hypernym relation between classes in Y, then Y Z is the set of leaves. By Algorithm 2, we can drive a hierarchical structure on Y from Tree(Z, E), i.e.,

H = Class Tree2Hie Stru(Tree(Z, E)). (23)

Given the above hierarchical structure H on Y, according to Theorem A.23, we can obtain a α-FER on Y, i.e.,

S = Class Hie Stru2FER(H) (see Algorithm 3). (24)

Let SK = S, which is the required α-FER on Y.

The following example demonstrates how to use the hypernym relation in Word Net to construct a class FSR.

Example B.1. Given the class space Y = {car, airplane, tiger, cat, dog} and Word Net 3.0. Figure 3 shows the process of sequentially calling Algorithm 1, 2, and 3 to obtain a FSR on Y.

(1) As shown in Figure 3(a), the Word Net tree with entity. n.01 as the root node has a total of 96308 nodes, of which 74897 are leaf nodes, and the depth is 20.

(2) Input Y and Word Net into Algorithm 1. By deleting nodes that are not related to Y, Algorithm 1 obtains a tree structure Tree(Z, E) on Y (see Figure 3(b)). Its depth is 15, and there are 31 nodes and 5 leaves. Each leaf corresponds to a class in Y. In Tree(Z, E), the position of classes can

2Same as 1.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

1st layer entity.n.01

2nd layer physical_entity.n.01

3rd layer object.n.01

4th layer whole.n.02

5th layer artifact.n.01 living_thing.n.01

6th layer instrumentality.n.03 organism.n.01

7th layer container.n.01 conveyance.n.03 animal.n.01

10th layer motor_vehicle.n.01 aircraft.n.01 mammal.n.01

11th layer car.n.01 heavier-than-air_craft.n.01 placentall.n.01

12th layer airplane.n.01 carnivore.n.01

13th layer feline.n.01

8th layer wheeled_vehicle.n.01 vehicle.n.01 chordate.n.01 domestic_animal.n.01

9th layer self-propelled_vehicle.n.01 craft.n.02 vertebrate.n.01 dog.n.01

15th layer tiger.n.02

14th layer big_cat.n.01 cat.n.01

1 car,airplane,tiger,cat,dog P

2 car,airplane,tiger,cat,dog P

3 car,airplane,tiger,cat,dog P

4 car,airplane,tiger,cat,dog P

5 car,airplane , tiger,cat,dog P

6 car,airplane , tiger,cat,dog P

7 car , airplane , tiger,cat,dog P

8 car , airplane , tiger,cat , dog P

9 car , airplane , tiger,cat , dog P

10 car , airplane , tiger,cat , dog P

11 car , airplane , tiger,cat , dog P

12 car , airplane , tiger,cat , dog P

13 car , airplane , tiger,cat , dog P

14 car , airplane , tiger , cat , dog P

15 car , airplane , tiger , cat , dog P

5 car,airplane,tiger,cat,dog O

4 car,airplane , tiger,cat,dog O

2 car , airplane , tiger,cat , dog O

3 O car , airplane , tiger,cat,dog

1 car , airplane , tiger , cat , dog O

(a) Word Net ( ) , b Tree E Z ( ) c H ( ) d S Word Net2Class Tree Class Tree2Hie Stru Hie Stru2FER

Figure 3. The execution processes of Algorithm 1, 2, and 3

reflect the semantic similarity between them. For example, the path between cat and tiger is shorter than the path between cat and dog , which indicates that the semantic similarity between cat and tiger is bigger.

(3) Next, we input Tree(Z, E) into Algorithm 2. We show intermediate results of Algorithm 2 between Figure 3(b) and Figure 3(c). There are 5 different partitions among 15 layers in Tree(Z, E).

P1 = = P4 = {{car, airplane, tiger, cat, dog}}, P5 = P6 = {{car, airplane}, {tiger, cat, dog}}, P7 = {{car}, {airplane}, {tiger, cat, dog}} P8 = = P13 = {{car}, {airplane}, {tiger, cat}, {dog}}, P14 = P15 = {{car}, {airplane}, {tiger}, {cat}, {dog}}.

Obviously, the above 5 partitions form a ordered queue under the sense of (see Figure 3(c)).

(4) We input H into Algorithm 3. We show intermediate results of Algorithm 3 between Figure 3(c) and Figure 3(d). Given the hierarchical structure H and the set of thresholds {λi}5 i=1, by executing lines 3-12 of Algorithm 3, we can obtain a FSR S on Y, as shown in Figure 3(d). Obviously, S is also a FER on Y.

B.2. Knowledge Calibration

In practice, the sources of data D and class knowledge K are different, so the scales by which they characterize class similarity are not entirely the same. In addition, there may be certain differences in the class similarity reflected by data and knowledge (Wang et al., 2020). For example, whales and sharks have significant visual similarities.

In the biology, whales and sharks belong to two major categories, mammal and fish , respectively. Therefore, it is necessary to calibrate SK obtained by formula (5) before using it guide the learning process,

B.2.1. OPTIMIZATION MODELING

By solving the formula (1), FLM can learn FSR on input space X from the training data D. Let Θ be the (local) optimal solution of formula (1). We can use f (( , ); Θ ) and training data D to estimate the similarity between classes, i.e., SD [0, 1]|Y| |Y|, i, j Y,

( 1, i = j P

xl Xj f((xk,xl);Θ )

|Xi| |Xj| , i = j . (25)

Then, we design the following optimization problem3

min S(D,K) S(D,K) SD 2 F s.t. s(D,K) ij = s(D,K) kl , i, j, k, l Y, s K ij = s K kl s(D,K) ij > s(D,K) kl , i, j, k, l Y, s K ij > s K kl s(D,K) ii = 1, i Y s(D,K) ij = s(D,K) ji , i, j Y, i = j 0 < s(D,K) ij < α, i, j Y, i = j

For formula (26), we have the following statements. (1) The objective function measures the closeness between variable S(D,K) and SD.

3In practice, a small constant ϵ > 0 can be introduced to ensure strict inequality constraints

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

(2) The first two constraints ensure that the ranking of elements in optimization variable S(D,K) is consistent with that in SK, which can maintain the hierarchical structure derived by SK (see Theorem 4.8 and Definition 4.9). (3) The last three constraints require that optimization variable S(D,K) is a α-FSR on Y (see Definition 3.1). (4) It can be transformed into a standard convex quadratic programming problem (Boyd & Vandenberghe, 2004).

B.2.2. OPTIMIZATION SOLVING

In formula (26), the size of the optimization variable and constraint condition are O |Y|2 and O |Y|4 , respectively. When |Y| is large, solving it directly faces huge time costs and space costs. Fortunately, we can transform it into the following.

min t R|V K| t T Dt 2 s T Dt

s.t. ti < ti+1, i, j = 1, 2, , V K 1 0 < ti < α, i = 1, 2, , V K , (27)

V K = s K ij i, j Y, i = j , (28a)

i = 1, 2, , V K ,

Ii = (k, l) k, l Y, sort s K kl; V K = i , (28b)

i = 1, 2, , V K , si =

(k,l) Ii s D kl |Ii| , (28c)

s = s1, s2, , s|V K| T R|V K|, (28d)

D R|V K| |V K|, i, j = 1, 2, , V K ,

dij = |Ii| , i = j 0, i = j . (28e)

According to (28), V K < 1

2|Y|2. In practice, class knowledge K is usually expressed by higher-order semantic information i.e., V K 1

2|Y|2. Therefore, formula (27) can be solved more efficiently. Importantly, by solving formula (27), we can obtain the optimal solution of formula (26).

Theorem B.2 (see Appendix C.3.1 for proof). Let t

be the optimal solution of formula (27). Let S(D,K) [0, 1]|Y| |Y|, i, j Y,

( 1, i = j t sort(s K ij;V K), i = j . (29)

The following propositions are true. (1) S(D,K) is a α-FSR on Y. (2) S(D,K) is the optimal solution of formula (26).

B.3. Knowledge Coarsening and Calibration

In practice, the SK obtained by formula (5) may contain a large number of elements with small differences. These

may be caused by insufficient quality of K, or by the characteristics of formula (5). It is necessary to pre-processing SK appropriately before using it.

First, we introduce coarsening function fcoa : V K [0, 1] which needs to satisfy the following conditions.

u, v V K, u < v, fcoa(u) fcoa(v), (30a)

u, v V K, u < v, such that fcoa(u) = fcoa(v). (30b)

Second, we use fcoa to coarsen SK as follows.

SK coa [0, 1]|Y| |Y|, i, j Y, s K coaij = fcoa s K ij . (31)

Third, we replace SK in formula (26) with the coarsening result SK coa. According to Theorem B.2, we can obtain

min r R|V K| r T ˆDr 2ˆs T ˆDr

s.t. ri < ti+1, i, j = 1, 2, , V K coa 1 0 < ri < α, i = 1, 2, , V K coa , (32)

V K coa = fcoa s K ij i, j Y, i = j , (33a)

i = 1, 2, , V K coa , ˆIi = (k, l) k, l Y,sort fcoa s K kl ; V K coa = i , (33b)

i = 1, 2, , V K coa , ˆsi =

(p,q) ˆIi s D pq ˆIi , (33c)

ˆs = ˆs1, ˆs2, , ˆs|V K coa| T R|V K coa|, (33d)

ˆD R|V K coa| |V K coa|, i, j = 1, 2, , V K coa ,

( ˆIi , i = j 0, i = j . (33e)

Compared with formula (32) and (27), the size of optimization variables and constraints in formula (32) are both O V K coa . According to formula (30b), (31), and (33a), V K coa < V K . In practice, by setting appropriate coarsening functions fcoa, we can achieve V K coa V K . This can not only reduce the scale of optimization problem, but also make more rational use of class knowledge K.

At last, let r be the optimal solution of formula (32). Similar to (29), we can obtain a calibrated α-FSR on Y. S(D,K) coa [0, 1]|Y| |Y|, i, j Y,

s(D,K) coarse ij =

( 1, i = j r sort(fcoa(s K ij);V K coa), i = j . (34)

S(D,K) coa is not the optimal solution to the formula (26). However, S(D,K) coa selectively preserves the more robust

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

sorting information contained in SK. In addition, let SK

be obtained by formula (5). And let U = s K ij i, j Y . According to Theorem 4.4 and 4.7, if |U| > |Y|, then there are at least |U| |Y| redundant elements in U. Therefore, the coarsening of SK Y is reasonable.

B.4. Algorithm Description of HC-HFLM

Algorithm 4 shows the workflow of the proposed method.

B.5. The Interpretability of Knowledge Representation

The following example demonstrates the FSR and hierarchical structure derived by class knowledge. Example B.3. Given a class space Y = {cat, tiger, dog}.

(1) Based on the concept names corresponding to the classes, we can collect class knowledge K related to Y from human knowledge system Y. Assuming that the class FSR constructed from class knowledge K is

T((cat, cat)) = T((tiger, tiger)) = T((dog, dog)) = 1 > T((cat, tiger)) = T((tiger, cat)) = 0.7 > T((cat, dog)) = T((dog, cat)) = 0.6 > T((tiger, dog)) = T((dog, tiger)) = 0.3.

The matrix representation of T is

1 0.7 0.6 0.7 1 0.3 0.6 0.3 1

(2) The view of Euclidean space. Given the class FSR matrix T, each row of T corresponds to a description vector of a concept. And the feature of every dimension corresponds to a concept, which has clear semantics. Therefore, the FSR matrix T has high interpretability.

(3) The view of quotient space. Given the class FSR T, according to Definition 4.3 and 4.9, we can obtain the hierarchical structure derived by T, i.e.,

(Q (Y, T) , ) =

O1 = {{cat}, {tiger}, {dog}} O2 = {{cat, tiger}, {dog}} O3 = {{cat, tiger, dog}}

For (Q (Y, T) , ), we have the following observations.

When threshold λ = 1, we obtain the finest partition O1 = Y/traclo T [1] . In O1, each class forms an equivalent class.

When threshold λ = 0.7, cat and tiger are merged to obtain an equivalent class {cat, tiger}. So we obtain the coarser partition O2.

When threshold λ = 0.6, cat and dog are merged to obtain an equivalent class {cat, tiger, dog}. So we obtain the coarsest partition O3.

When threshold λ [0, 0.6), we obtain the coarsest partition O3.

From the above process, we can seen that the > relation of class similarity is transformed into the relation between partitions in quotient space. And the a set partitions forms a hierarchical structure in quotient space. According to Theorem 4.8, the same class similarity ranking can lead to the same hierarchical structure. Therefore, the hierarchical structure can effectively capture the understanding of concepts contained in class knowledge K.

In addition, each element in the hierarchical structure is a partition on Y, and each element in the partition is a subset of Y, both of which have clear semantics and high interpretability.

C.1. Proofs of Theorems in Section 3

C.1.1. PROOF OF THEOREM 3.3

Proof. For proposition (1).

For sake of discussion, we copy formula (7) as follows.

V = {tij| i, j Y} , U = V {0, α, β},

u = sortvec(U) = u1, u2, , u|U| T .

Because T is a α-FSR on Y, therefore i Y, tii = 1 and i, j Y, i = j, 0 < tij = tji < α, therefore 0 / V , α / V , β / V . Based on it, according to Definition 3.2,

sort(1; U) = |U|, u|U| = 1, (35a)

sort(β; U) = |U| 1, u|U| 1 = β, (35b)

sort(α; U) = |U| 2, u|U| 2 = α, (35c)

sort(0; U) = 1, u1 = 0, (35d)

v V, 2 sort(v; U) |U|. (35e)

According to the definition of L1, L1 = 0, iff ( L1a = Pn i,j=1,i =j,yi=yj LHA (sij, yi, yj) = 0, L1b = Pn i,j=1,i =j,yi =yj LHA (sij, yi, yj) = 0.

Because L1a = 0, according to (8), i, j = 1, 2, , n, i = j, yi = yj, LHA (sij, yi, yj) = 0, i.e.,

sij h usort(tyiyi;U) 1, usort(tyiyi;U)

Because T is a FSR on Y, therefore tyiyi = 1. According to formula (35a) and (35b),

usort(tyiyi;U) 1 = usort(1;U) 1 = u|U| 1 = β, usort(tyiyi;U) = usort(1;U) = 1,

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Algorithm 4 Human Cognition-Inspired Hierarchical Fuzzy Learning Machine, denoted as HC-HFLM Input: Task. The (X, Y, fc)-classification problem. Data. The training data D = {(xi, yi) | xi X, yi = fc (xi) Y }n i=1 and the test data Dte = {xj| xj X}n+m j=n+1. Knowledge. The class knowledge K related to the concepts in class space Y. Setting. The hyper-parameters for controlling the degree of fuzziness 0 < α < β < 1. The network architecture of FLM f(( , ); Θ) = g((h( ; Θ), h( ; Θ))). The regular term R( ). The trade-off parameter γ. The number of examples required for each concept k Y, nexek. The way to pre-processing K, i.e., the value of case in formula (6). Output: The predicting result {(xj, ˆyj) |xj Dte}.

// (1) Pre-training FLM (see Section 2.3)

1: Initialize the learnable parameters Θ randomly; 2: Using stochastic gradient descent method to solve formula (1) and obtain the local optimal solution Θ ;

// (2) Mining class FSR from class knowledge K (see Section 3)

3: Initialize class FSR T NULL; 4: Construct class FSR SK from class knowledge K by formula (5) and Appendix B.1; 5: if case = case1 of formula (6) then 6: T SK

7: else 8: Compute SD from D and f(( , ); Θ ) by formula (25); 9: if case = case2 of formula (6) then 10: Construct formula (27) from SK and SD by formula (28); 11: Solve formula (27) and obtain optimal solution t ; 12: Construct S(D,K) from t by formula (29); 13: T S(D,K) ; 14: else if case = case3 of formula (6) then 15: Coarsening SK by formula (31) and obtain SK coa; 16: Construct formula (32) from SK coa and SD by formula (33); 17: Solve formula (32) and obtain optimal solution r ; 18: Construct S(D,K) coa from r by formula (34); 19: T S(D,K) coa ; 20: end if 21: end if

// (3) Learning sample FSR, guided by T (see section 3) 22: Initialize the learnable parameters Θ Θ ; 23: Using stochastic gradient descent method to solve formula (9) and obtain local optimal solution Θ ;

// (4) Concept representation and prediction (see section 3) 24: Using f( , ); Θ ) to select exemplar set for each class in Y by formula (3) and obtain Ek, k Y; 25: xj Dte, using f( , ); Θ ) and Ek, k Y to obtain the predicted class label ˆyj by formula (4);

Therefore sij [β, 1]. According to formula (2), LFP (sij, yi, yj) = max [β sij, 0] = 0. Therefore

j=1,j =i,yi=yj LFP (sij, yi, yj) = 0.

Because L1b = 0, according to formula (8), i, j = 1, 2, , n, i = j, yi = yj, LHA (sij, yi, yj) = 0, i.e.,

sij h usort(tyiyi;U) 1, usort(tyiyi;U)

Because T is a α-FSR on Y, therefore 0 < tyiyj < α. According to Definition 3.2 and formula (35e), we have 2 1 sort (tyiyi; U) 1 < sort (tyiyi; U) < sort (α; U), and 0 = u1 usort(tyiyi;U) 1 < usort(tyiyi;U) <

usort(α;U) = α, Therefore sij [0, α). According to formula (2), LFP (sij, yi, yj) = max [sij α, 0] = 0. There-

j=1,j =i,yi =yj LFP (sij, yi, yj) = 0.

Therefore, L2 = L2a + L2b = 0 + 0 = 0. So proposition (1) is proven.

For proposition (2).

Because L1 = 1, according to proposition (1) L2 = 0. According to Theorem 2 in literature (Cui & Liang, 2022), L3 = 0. So proposition (2) is proven.

In summary, proposition (1) and (2) are proven.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

C.2. Proofs of Theorems in Section 4

C.2.1. PROOF OF THEOREM 4.1

(1) We give some properties of the operation described in Definition A.20.

Lemma C.1. Given a set A and a fuzzy binary relation F on A. The following propositions are true.

(1) If F satisfies reflexivity, then F 2 = F F satisfies reflexivity.

(2) If F satisfies reflexivity, then F F 2 = F F.

(3) If F satisfies symmetry, then F 2 = F F satisfies symmetry.

(4) F satisfies transitivity iff F 2 = F F F.

Proof. For proposition (1).

If F satisfies reflexivity, i.e., a A, F((a, a)) = 1, then b A,

F 2((b, b)) = max c A min[F((b, c)), F((c, b))]

( max c A,c =b min[F((b, c)), F((c, b))],

min[F(b, b), F((b, b))]

= max max c A,c =b min[F((b, c)), F((c, b))], 1

i.e., F 2 satisfies reflexivity. So proposition (1) is proven.

For proposition (2).

If F satisfies reflexivity, i.e., c A, F((c, c)) = 1, then a, b A,

F 2((a, b)) = max c A min[F((a, c)), F((c, b))]

( max c A,c =a min[F((a, c)), F((c, b))],

min[F((a, a)), F((a, b))]

( max c A,c =a min[F((a, c)), F((c, b))],

min[1, F((a, b))]

= max max c A,c =a min[F((a, c)), F((c, b))], F((a, b))

i.e., F F 2. So proposition (2) is proven.

For proposition (3).

If F satisfies symmetry, i.e., a, b A, F((a, b)) =

F((b, a)), then c, d A,

F 2((c, d)) = max e A min[F((c, e)), F((e, d))]

= max e A min[F((e, d)), F((c, e))]

= max e A min[F((d, e)), F((e, c))]

= F 2((d, c))

i.e., F 2 satisfies symmetry. So proposition (3) is proven.

For proposition (4).

F 2 F a, b A, F((a, b)) F 2((a, b)) = max c A min [F((a, c)), F((c, b))], i,e., proposition (4) is proven.

In summary, proposition (1)-(4) are proven.

Lemma C.2. Given a set A and two fuzzy binary relation R, S on A. If R S, then R2 S2.

Proof. Because R S, according to Definition A.3, a, b A, R((a, b)) S((a, b)). Therefore c, d A,

R2((c, d)) = max e A min[R((c, e)), R((e, d))]

max e A min[S((c, e)), S((e, d))] = S2((c, d)) ,

i.e., R2 S2.

(2) Based on these conclusions, we prove Theorem 4.1.

Proof. Using mathematical induction to prove proposition (1).

When k = 0, S2k = S1 = S is a FSR on A.

When k = 1, S21 = S2 = S S. Because S is a FSR on A, therefore S satisfies reflexivity and symmetry. According to proposition (1) and (3) of Lemma C.1 S2 satisfies reflexivity and symmetry. Therefore S2 is a FSR on A.

When k = m > 1, assume that S2m is a FSR on A.

When k = m + 1, S2m+1 = S2m S2m. Because S2m is a FSR on A, therefore S2m satisfies reflexivity and symmetry, According to proposition (1) and (3) of Lemma C.1 S2m+1

satisfies reflexivity and symmetry. Therefore S2m+1 is a FSR on A. So proposition (1) is proven.

For proposition (2).

k = 0, 1, 2, , S2k+1 = S2k S2k. Because S is a FSR on A, according to proposition (1), S2k is a FSR on A. Therefore S2k satisfies reflexivity. According to proposition (2) of Lemma C.1 S2k S2k S2k = S2k+1. So proposition (2) is proven.

For proposition (3).

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

First, let s prove the convergence. According to proposition(1) t = 0, 1, 2, , t(k) is a FSR on A. According to Definition A.14, the sequence t(k) has an upper bound. According to proposition (2), the sequence t(k) is monotonically increasing. Therefore the sequence t(k) is convergent.

Second, let s prove that the sequence t(k) converges to the transitive-closure of S. Without losing generality, assume that t(ˆk) = t(ˆk + 1), i.e., S2 ˆk = S2 ˆk+1 = S2 ˆk S2 ˆk. (a) Because S2 ˆk+1 = S2 ˆk, therefore S2 ˆk+1 S2 ˆk. According to proposition (4) of Lemma C.1, S2 ˆk satisfies transitivity. (b) According to proposition (2), S S2 ˆk. (c) fuzzy binary relation T on A, assume that S T and T satisfies transitivity. Because S T, according to Lemma C.2

S T = S2 T 2 = S4 T 4 = S2 ˆk T 2 ˆk. (36)

Meanwhile, because T satisfies transitivity, according to proposition (4) of Lemma C.1, T 2 T. According to Lemma C.2,

T 2 T = T 4 T 2 = T 2 ˆk T 2 ˆk 1. (37)

Combining formula (36) and (37),

S2 ˆk T 2 ˆk T 2 ˆk 1 T 2 T.

(d) Combining (a)-(c), S2 ˆk satisfies the three propositions in Definition A.19. Therefore S2 ˆk is the transitive-closure of S. So proposition (3) is proven.

For proposition (4).

According to proposition (3), traclo(S) can be written in the form of S2 ˆk. And then, according to proposition (1), traclo(S) is a FSR on A. Because traclo(S) satisfies transitivity, according to Definition A.16, traclo(S) is a FER on A. So proposition (4) is proven.

In summary, proposition (1)-(4) are proven.

C.2.2. PROOF OF THEOREM 4.2

(1) We introduce the following lemma.

Lemma C.3. Given a finite set A and a SR matrix S on A. Let G = (A, S) be the unweighted graph, where A is the set nodes and S is the adjacency matrix of the graph. k 1, 2, ..., a, b A, Sk((a, b)) = 1 in graph G, there is a path a, , b and the length of a, , b is k.

Proof. Using mathematical induction to prove.

When k = 1, Sk = S1 = S. a, b A, S((a, b)) = 1 in graph G, there is a edge (a, b)

in graph G, there is a path a, b and the length of path a, b is 1.

When k = 2, S2 = S S. a, b A, S2((a, b)) = 1 max c A min[S((a, c)), S((c, b))] =

1 c AI[min[S((a, c)), S((c, b))] = 1] = 1

c A(I[S((a, c)) = 1] I[S((c, b)) = 1]) = 1 in graph

G, there are two edges (a, c) and (c, b) in graph G, there is a path a, c, b and the length of path a, c, b is 2.

When k = m > 2, assume that a, b A, Sm((a, b)) = 1 in graph G, there is a path a, , b and the length of path a, , b is m.

When k = m + 1, Sm+1 = Sm S. a, b A, Sm+1((a, b)) = 1 max c A min[Sm((a, c)), S((c, b))] = 1

c AI[min[Sm((a, c)), S((c, b))] = 1] = 1

c A(I[Sm((a, c)) = 1] I[S((c, b)) = 1]) = 1

in graph G, there is a path a, , c and the length of a, , c is m and in graph G, there is a edge (c, b) in graph G, there is a path a, , b and the length of path a, , b is m + 1.

In summary, the lemma is proven.

Remark C.4. Comparing Definition A.1 and A.14, it can be seen that binary relation is a special type of fuzzy binary relation. Therefor, Definition A.19, A.20, and A.21 are applicable to binary relation. Based on it, Lemma C.1 and C.2 are applicable to binary relation. Therefor, Theorem 4.1 is applicable to binary relation. According to Theorem 4.1, given a finite set A and a similarity relation (SR) S on A, the the transitive-closure of S can also be written as S2k.

(2) Based on the above lemma, we prove Theorem 4.2.

Proof. According to Remark C.4, k {0, 1, 2, }, such that traclo(S) = S2k. Because S is a FSR on A, according to Theorem A.18, λ [0, 1], S[λ] is a SR on A. According to Remark C.4, l {0, 1, 2, }, such that

traclo(S[λ]) = S[λ]2l . Let m = max[2k, 2l]. According to Theorem 4.1 traclo(S) = Sm and traclo(S[λ]) = S[λ]m. Therefore, to prove

λ [0, 1], tracol(S)[λ] = tracol(S[λ]),

we need to prove

λ [0, 1], Sm[λ] = S[λ]m,

we need to prove

λ [0, 1], a, b A, Sm[λ]((a, b)) = S[λ]m((a, b)).

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

According to Definition A.17 λ [0, 1], a, b A, Sm[λ]((a, b)) {0, 1}, S[λ]m((a, b)) {0, 1}, Therefore, we need to prove

λ [0, 1], a, b A,

Sm[λ]((a, b)) = 1 S[λ]m((a, b)) = 1 . (38)

Next, we prove the formula (38).

Given the unweighted G = A, S[λ] . a, b A, according to Lemma 2 of literature (Cui & Liang, 2022), Sm[λ]((a, b)) = I [Sm((a, b)) λ] = 1 Sm((a, b)) λ in graph G, there is a path a, , b and the length of path a, , b is m. Meanwhile, according to Lemma C.3, in graph G there is a path a, , b and the length of path a, , b is m S[λ]m((a, b)) = 1. That is to say, the formula (38) is proven. So the theorem is proven.

C.2.3. PROOF OF THEOREM 4.4

Proof. According to Definition A.14 a, b A, S((a, b)) [0, 1]. Therefore V = {S((a, b))| a, b A} [0, 1]. And then, according to Definition 4.3, Q (A, S) = A/traclo S[λ] λ V S{A/traclo S[λ] |λ [0, 1] V }. Therefore, to prove

Q (A, S) = n A/traclo S[λ] λ V o ,

we need to prove n A/traclo S[λ] λ [0, 1] V o n A/traclo S[λ] λ V o ,

we need to prove λ [0, 1], λ / V, ω V, such that

A/traclo S[λ] = A/traclo S[ω] .

According to Definition A.8, we need to prove λ [0, 1], λ / V, ω V , such that

traclo S[λ] = traclo S[ω] .

According to Definition A.19, we need to prove

λ [0, 1], λ / V, ω V, such that S[λ] = S[ω].

According to Definition A.17, we need to prove λ [0, 1], λ / V, ω V , such that a, b A,

S[λ]((a, b)) = I(S((a, b)) λ) = I(S((a, b)) ω)

= S[ω]((a, b)). (39)

λ [0, 1], λ / V , Let ω = min v V, v>λv. Obviously, λ < ω

Next we prove formula (39) in two cases. a, b A,

(a) If I(S(a, b) λ) = 1, i.e., S(a, b) λ. Because λ / V and S((a, b)) V , Therefore S(a, b) > λ, Therefore

S((a, b)) min v V, v>λv = ω, i.e., I(S(a, b) ω) = 1.

(b) If I(S(a, b) λ) = 0, i.e., S((a, b)) < λ. Because λ < ω, therefore S((a, b)) < λ < ω, i.e., I(S(a, b) ω) = 0.

Combining (a) and (b), the formula (39) is true. So the theorem is proven.

C.2.4. PROOF OF THEOREM 4.5

Proof. According to Definition 4.3,

Q (A, S) = A/traclo(S[λ]) λ [0, 1] , Q (A, traclo(S)) = A/traclo traclo(S)[λ] λ [0, 1]

= A/traclo(S)[λ] λ [0, 1] .

Therefore, to prove Q (A, S) = Q (A, traclo(S)), we need to prove λ [0, 1], A/traclo(S[λ]) = A/traclo(S)[λ]. According to Definition A.8, we need to prove λ [0, 1], traclo(S[λ]) = traclo(S)[λ], According to Theorem 4.2, the above formula is true. So the theorem is proven.

C.2.5. PROOF OF THEOREM 4.6

Proof. For proposition (1).

According to Definition 4.3, Q (A, S) is a set of partitions on A. Meanwhile, according to Remark A.12 is a order relation on the partitions on A. Therefore (Q (A, S) , ) is a partial ordered set. So proposition (1) is proven.

For proposition (2).

Let T = tracol(S). According to Theorem 4.5 Q (A, S) = Q (A, T) = A/T [λ] λ [0, 1] . According to Theorem 4.1, T is a FER on A. According to Theorem A.18, λ [0, 1], T [λ] is an ER on A. 0 λ1 < λ2 1, obviously, T [λ2] T [λ1]. According to Remark A.11 A/T [λ2] A/T [λ1]. So proposition (2) is proven.

In summary, the theorem is proven.

C.2.6. PROOF OF THEOREM 4.7

Proof. For proposition (1).

Because P, Q Q (A, S) and P = Q, according to proposition (2) of Theorem 4.6, P and Q are comparable under the sense of . Without losing generality, assume that P Q, then P Q. According to Lemma A.13, |P| = |Q|. So proposition (1) is proven.

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

For proposition (2).

According to Definition 4.3, P Q (A, S), P is a partition on A. And then, according to Definition A.6,

1 = |{A}| |P| |{{a}| a A}| = |A|. (40)

Combing proposition (1) and formula (40), |Q (A, S)| |A|. So proposition (2) is proven.

For proposition (3). Let

C = {|P|| P Q (A, S)} , (41a)

c C, π(c)= n u u U, A/traclo S[u] = c o , (41b)

V = max u π(c)u c C . (41c)

First, we prove |V | = |Q (A, S)|. According to proposition (1) |C| = |Q (A, S)|. And then, according to formula (41), |C| = |V |. Therefore, |V | = |Q (A, S)|.

Second, we prove V U. According to formula (41), V U. Because |U| > |Q (A, S)| = |V |, therefore, V U.

At last, we prove Q (A, S) = A/traclo S[λ] λ V . According to Theorem 4.4, Q (A, S) = A/traclo S[λ] λ U . Therefore, we need to prove n A/traclo S[λ] λ U o = n A/traclo S[λ] λ V o ,

we need to prove n A/traclo S[λ] λ U o n A/traclo S[λ] λ V o ,

(42a) n A/traclo S[λ] λ U o n A/traclo S[λ] λ V o .

Because V U, therefore, (42a) is true.

Next, we prove formula (42b). To prove formula (42b), we need to prove u U, v V , such that

A/traclo S[u] = A/traclo S[v] .

u U, c C, such that A/traclo S[u] = c. According to formula (41), v V such that A/traclo S[v] = c, i.e., u U, v V such that A/traclo S[u] = A/traclo S[v] . According to the inverse negation of proposition (1), A/traclo S[u] = A/traclo S[v] , i.e., formula (42b) is true.

So far, formula (42a) and formula (42b) are proven. So proposition (3) is proven.

For proposition (4).

According to Theorem 4.4, Q (A, S) = A/traclo S[λ] λ U , therefore |U| |Q (A, S) |. Therefore, to prove |U| = |Q (A, S) |, we need to prove λ1, λ2 U, λ1 < λ2,

A/traclo S[λ1] = A/traclo S[λ2] ,

Because S is a FER on A, according to Theorem A.18, u U, S[u] is an ER on A. Therefore S[u] satisfies transitivity. According to Definition A.19, traclo S[u] = S[u], therefore we need to prove

λ1, λ2 U, λ1 < λ2, A/S[λ1] = A/S[λ2].

According to Definition A.8, we need to prove

λ1, λ2 U, λ1 < λ2, S[λ1] = S[λ2],

we need to prove λ1, λ2 U, λ1 < λ2,

a, b A, such that S[λ1]((a, b)) = S[λ2]((a, b)).

Because λ1 U, therefore, a, b A, such that S((a, b)) = λ1. In this case,

S[λ1]((a, b)) = I[S((a, b)) λ1] = I[λ1 λ1] = 1

= S[λ2]((a, b)) = I[S((a, b)) λ2] = I[λ1 λ2] = 0 ,

therefore proposition (4) is proven.

In summary, propositions (1)-(4) are proven. So the theorem is proven.

C.2.7. PROOF OF THEOREM 4.8

U = {R((a, b))| a, b A} , u = sortvec(U) = u1, u2, , u|U| T , V = {S((a, b))| a, b A} , v = sortvec(V ) = v1, v2, , v|V | T .

According to formula (10),

|U| = |V |, (43a)

a, b A, sort(R((a, b)); U)=sort(S((a, b)); V ), (43b)

a, b, c, d A,

R((a, b)) < R((c, d)) iff S((a, b)) < S((c, d)). (43c)

According to Theorem 4.4, Q (A, R) = A/traclo R[λ] λ U , Q (A, S) = A/traclo S[λ] λ V . Therefore, to prove Q (A, R) = Q (A, S), we need to prove n A/traclo R[λ] λ U o = n A/traclo S[λ] λ V o ,

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

we need to prove n A/traclo R[λ] λ U o n A/traclo S[λ] λ V o ,

(44a) n A/traclo S[λ] λ V o n A/traclo R[λ] λ U o .

First, we prove formula (44a). To prove formula (44a), we need to prove u U, v V , such that

A/traclo R[u] = A/traclo S[v] ,

according to Definition A.8, we need to prove u U, v V, such that

traclo R[u] = traclo S[v] ,

according to Definition A.19, we need to prove

u U, v V, such that R[u] = S[v],

according to Definition A.17, we need to prove u U, v V , such that a, b A, R[u]((a, b)) = S[v]((a, b)), i.e.,

I (R((a, b)) u) = I (S((a, b)) v) . (45)

According to Definition 3.2 and formula (43a), sort(u; U) {1, 2, , |U|} = {1, 2, , |V |}, therefore, let v = vsort(u;U) V . u U, Because u U, therefore, c, d A, such that u = R((c, d)). According to (43b), sort(S((c, d)); V ) = sort(R((c, d)); U) = sort(u; U) = sort(v; V ), therefore v = vsort(u;U) = vsort(S((c,d));V ) = S((c, d)).

Next, we prove formula (45) in two cases.

(a) I (R((a, b)) u) = 1, i.e., R((a, b)) u = R((c, d)). According to (10), S((a, b)) S((c, d)) = v, i.e., I (S((a, b)) v) = 1.

(b) I (R((a, b)) u) = 0, i.e., R((a, b)) < u = R((c, d)). According to(43c), we have S((a, b)) < S((c, d)) = v, i.e., I (S((a, b)) v) = 0.

Combining (a) and (b), formula (45) is true. So formula (44a) is true. Similarly, we can prove formula (44b). So far, formula (44) is proven. So the theorem is proven.

C.2.8. PROOF OF THEOREM 4.11

(1) We introduce the following lemmas.

Lemma C.5. For a (X, Y, fc)-classification problem, given a set of samples ϕ X X and a fuzzy binary relation S on Y. Let span(S, fc, X) be the fuzzy binary relation on

X described in Definition 4.10. The following propositions are true.

(1) If S satisfies reflexivity, then span (S, fc, X) satisfies reflexivity.

(2) If S satisfies symmetry, then span (S, fc, X) satisfies symmetry.

(3) If S satisfies transitivity, then span (S, fc, X) satisfies transitivity.

(4) λ [0, 1], span (S, fc, X)[λ] = span S[λ], fc, X .

(5) The span( , , ) can be directly applied to binary relation and proposition (1)-(4) are also true for binary relation.

Proof. For proposition (1).

Because S satisfies reflexivity, i.e., y Y, S((y, y)) = 1. x X, span (S, fc, X) ((x, x)) = S ((fc(x), fc(x))) = 1, i.e., span (S, fc, X) satisfies reflexivity. So proposition (1) is proven.

For proposition (2).

Because S satisfies symmetry, i.e., y1, y2 Y, S((y1, y2)) = ((y2, y1)). x1, x2 X,

span (S, fc, X) ((x1, x2)) = S ((fc(x1), fc(x2))) = S ((fc(x1), fc(x2))) = span (S, fc, X) ((x1, x2)),

i.e., span (S, fc, X) satisfies symmetry. So proposition (2) is proven.

For proposition (3).

Because X X, therefore Y = {fc(x)| x X} Y. Because S satisfies transitivity, i.e., y1, y3 Y, S((y1, y3)) max y2 Y min [S((y1, y2)), S((y2, y3))].

span (S, fc, X) (x1, x3) = S ((fc (x1) , fc (x3))) max y2 Y min [S ((fc (x1) , y2)) , S ((y2, fc (x3)))]

max y2 Y min [S ((fc (x1) , y2)) , S ((y2, fc (x3)))]

= max x2 X min [S ((fc (x1) , fc (x2))) , S ((fc (x2) , fc (x3)))] ,

i.e. span (S, fc, X) satisfies transitivity. So proposition (3) is proven.

For proposition (4).

λ [0, 1], x1, x2 X,

span (S, fc, X)[λ] (x1, x2) = I (span (S, fc, X) (x1, x2) λ) = I (S (fc (x1) , fc (x2)) λ) = S[λ] (fc (x1) , fc (x2)) = span S[λ], fc, X ,

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

i.e., proposition (4) is proven.

For proposition (5).

Obviously, the above proofs of proposition (1)-(4) are also valid under the sense of binary relation.

In summary, proposition (1)-(5) are proven. So the lemma is proven.

Lemma C.6. For a (X, Y, fc)-classification problem, given a set of samples ϕ X X and a fuzzy binary relation S on Y. Let span(S, fc, X) be the fuzzy binary relation on X described in Definition 4.10. If y Y, x X, such that fc(x) = y, then span (S, fc, X)2 = span S2, fc, X .

Proof. Because y Y, x X, such that fc(x) = y, therefore, Y = {fc(x)| x X} = Y. x1, x3 X,

span (S, fc, X)2 ((x1, x3))

= max x2 X min

" span (S, fc, X) ((x1, x2)) ,

span (S, fc, X) ((x2, x3))

= max x2 X min [S ((fc(x1), fc(x2))) , S ((fc(x2), fc(x3)))]

= max y Y =Y min [S ((fc(x1), y)) , S ((y, fc(x3)))]

= S2 ((fc(x1), fc(x3))) = span S2, fc, X ((x1, x3)),

i.e., the lemma is proven.

Lemma C.7. For a (X, Y, fc)-classification problem, given a set of samples ϕ X X and a FSR S on Y. Let span(S, fc, X) be the fuzzy binary relation on X described in Definition 4.10. If y Y, x X, such that fc(x) = y, then traclo (span (S, fc)) = span (traclo (S) , fc, X).

Proof. Because S is a FSR on Y, therefore, S satisfies reflexivity and symmetry. According to Lemma C.5, span (S, fc, X) satisfies reflexivity and symmetry, i.e., span (S, fc, X) is a FSR on X.

According to Theorem 4.1, traclo (S) can be written in form of S2k, k 0, and traclo (span (S, fc, X)) can be Written in form of span (S, fc, X)2l, l 0. Let m = max (k, l), then traclo (S) = S2m, traclo (span (S, fc, X)) = span (S, fc, X)2m .

Therefore, to prove

traclo (span (S, fc)) = span (traclo (S) , fc, X) ,

we need to prove

span (S, fc, X)2m = span S2m, fc, X . (46)

According to Lemma C.6,

y Y, x X, such that fc(x) = y span (S, fc, X)2 = span S2, fc, X

span (S, fc, X)4 = span S4, fc, X

span (S, fc, X)2m 1 = span S2m 1, fc, X

span (S, fc, X)2m = span S2m, fc, X ,

i.e., formula (46) is true. So the lemma is proven.

Lemma C.8. For a (X, Y, fc)-classification problem, given a set of samples ϕ X X and an ER R on Y. Let O = Y/R be the partition Y derived by R. Let span(R, f C, X) be the binary relation on X described in Definition 4.10. According to Lemma C.5, span(R, f C, X) is an ER on X. Let P = X/span(R, fc, X) be the partition on X derived by span(R, fc, X). If y Y, x X such that fc(x) = y, then the following propositions are true.

(1) O O, {x | x X, fc(x) O} P.

(2) P P, {fc(x)| x P} O.

(3) |O| = |P|.

Proof. For proposition (1).

O O, obviously, ϕ = O Y. y O, according to Definition A.7 and A.8,

O = [y ]R = {y : y Y, R((y , y)) = 1} . (47)

Because y Y, x X such that fc(x) = y. Without losing generality, assume that x X and fc(x ) = y . And then,

[x ]span(R,fc,X) = {x | x X, span (R, fc, X) ((x , x)) = 1} = {x | x X, R((fc(x ), fc(x))) = 1} = {x | x X, R((y , fc(x))) = 1} = {x | x X, fc(x) [y ]R = O} .

According to Definition A.8, [x ]span(R,fc,X) P, i.e., proposition (1) is proven.

For proposition (2).

P P, obviously, ϕ = P X, x P, according to Definition A.7 and A.8,

P = [x ]span(R,fc,X) = {x | x X, span(R, fc, X)((x , x)) = 1}. (48)

Let y = fc(x ), obviously, y Y, then

[y ]R = {y | y Y, R((y , y)) = 1} = {y | y Y, R((fc(x ), y)) = 1} = {f(x) | x X, R((fc(x ), f(x))) = 1} = {f(x) | x X, span(R, fc, X)((x , x)) = 1} = f(x) x X, x [x ]span(R,fc,X) = P .

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

According to Definition A.8, [y ]R O, i.e., proposition (2) is proven.

For proposition (3).

Proposition (1) and proposition (2) actually establish a oneto-one mapping between the equivalence classes in O and the equivalence classes in P, therefore, |O| = |P|.

In summary, proposition (1)-(3) are proven. So the lemma is proven.

Lemma C.9. For a (X, Y, fc)-classification problem, given a set of samples ϕ X X and two FSRs R, S on Y. If Q(Y, R) = Q(Y, S) and k Y, x X, such that fc(x) = k, then Q (X, span (R, fc, X)) = Q (X, span (S, fc, X)).

Proof. To prove Q (X, span (R, fc, X)) = Q (X, span (S, fc, X)), according to Theorem 4.5, we need to prove Q (X, traclo (span (R, fc, X))) = Q (X, traclo (span (S, fc, X))). Because k Y, x X, such that fc(x) = k, according to Lemma C.7, we need to prove

Q (X, span (traclo (R) , fc, X)) =

Q (X, span (traclo (S) , fc, X)) . (49)

Because R and S are both FSR on Y. According to Theorem 4.1, traclo(R) and traclo(S) are both FER on Y. According to Lemma C.5, span (traclo (R) , fc, X) and span (traclo (S) , fc, X) are both FER on X. According to Theorem A.18, µ [0, 1], span (traclo (R) , fc, X)[µ]

and span (traclo (S) , fc, X)[µ] are both ER on X. According to Definition 4.3, to prove formula (49), we need to prove n X/span (traclo (R) , fc, X)[λ] λ [0, 1] o = n X/span (traclo (S) , fc, X)[ω] ω [0, 1] o .

According to Definition A.8, we need to prove n span (traclo (R) , fc, X)[λ] λ [0, 1] o = n span (traclo (S) , fc, X)[ω] ω [0, 1] o .

According to Lemma C.5, we need to prove n span traclo (R)[λ] , fc, X λ [0, 1] o = n span traclo (S)[ω] , fc, X ω [0, 1] o .

According to Definition 4.10, we need to prove n traclo (R)[λ] λ [0, 1] o = n traclo (S)[ω] ω [0, 1] o . (50)

Next, we prove formula (50). Because Q(Y, R) = Q(Y, S), according to Theorem 4.5, Q(Y, traclo(R)) = Q(Y, traclo(S)). Because R and S are both FSR on Y, according to Theorem 4.1, traclo(R) and traclo(S) are both FER on Y. Based on it, according to Definition 4.3, Y/traclo(R)[λ] λ [0, 1] = Y/traclo(S)[ω] ω [0, 1] . Because traclo(R) and traclo(S) are both FER on Y, according to Theorem A.18, µ [0, 1], traclo(R)[µ] and traclo(S)[µ] are both ER on Y. According to Definition A.8,

n traclo(R)[λ] λ [0, 1] o = n traclo(S)[ω] ω [0, 1] o ,

i.e., formula (50) is proven. So formula (49) is proven. So the lemma is proven.

(2) Based on these conclusions, we prove Theorem 4.11.

Proof. Let traclo(T) be the transitive-closure of T (see Definition A.19). Let

U = {traclo(T)ij| i, j Y} , (51a)

V = n span (traclo(T), fc, X)ij xi, xj X o . (51b)

V = n span (traclo(T), fc, X)ij xi, xj X o

= traclo(T)fc(xi)fc(xj) xi, xj X

= traclo(T)yiyj yi, yj Y

Because T is a FSR on Y, according to Theorem 4.1, traclo(T) is a FER on Y. According to Theorem A.18, λ [0, 1], traclo(T)[λ]is an ER on Y. And then, according to Definition A.19,

λ [0, 1], traclo traclo(T)[λ] = traclo(T)[λ]. (53)

Meanwhile, because traclo(T) is FER on Y, according to Lemma C.5 span (traclo(T), fc, X) s a FER on X. According to Theorem A.18, λ [0, 1], span (traclo(T), fc, X)[λ] is an ER on X. And then, according to Definition A.19, λ [0, 1],

traclo span (traclo(T), fc, X)[λ] =

span (traclo(T), fc, X)[λ] . (54)

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Given V , Q (Y, T) can be written as follows

Q (Y, T) =Q (Y, traclo(T))

= n Y/traclo traclo (T)[λ] λ [0, 1] o

= n Y/traclo (T)[λ] λ [0, 1] o

= n Y/traclo (T)[λ] λ U o

= n Y/traclo (T)[λ] λ V o .

Given V , Q (X, span(T, fc, X)) can be written as follows

Q (X, span(T, fc, X)) =Q (X, traclo (span (T, fc, X))) =Q (X, span (traclo (T) , fc, X))

= n X/traclo span (traclo (T) , fc, X)[λ] λ [0, 1] o

= n X/span (traclo (T) , fc, X)[λ] λ [0, 1] o

= n X/span (traclo (T) , fc, X)[λ] λ [0, 1] o

= n X/span (traclo (T) , fc, X)[λ] λ V o

= n X/span (traclo (T) , fc, X)[λ] λ [0, 1] o

= n X/span traclo (T)[λ] , fc, X λ V o . (56)

For proposition (1).

In formula (55), because traclo(T) is a FER on Y, according to Theorem 4.7, |Q (Y, T)| = |V |. In formula (56), because span (traclo(T), fc, X) is a FER on X, according to Theorem 4.7, |Q (X, span(T, fc, X))| = |V |. Soproposition (1) is proven.

For proposition (2).

Without losing generality, we arrange the elements in the set V in descending order to obtain λ1 > λ2 > > λ|V |. According to Definition A.17, i = 1, 2, , |V | 1, traclo (T)[λi]

Y/traclo (T)[λi+1], span traclo (T)[λi] , fc, X

span traclo (T)[λi+1] , fc, X . And then, ac-

cording to Theorem 4.6, Y/traclo (T)[λi]

Y/traclo (T)[λi+1], X/span traclo (T)[λi] , fc, X

X/span traclo (T)[λi+1] , fc, X . Based on it, according to Definition 4.3 and 4.9,

(Q (Y, T) , ) =

O1 = Y/traclo (T)[λ1] O2 = Y/traclo (T)[λ2]

O|V | = Y/traclo (T)[λ|V |]

(Q (X, span (T, fc, X)) , ) =

P1 = X/span traclo (T)[λ1] , fc, X

P2 = X/span traclo (T)[λ2] , fc, X

P|V | = X/span traclo (T)[λ|V |] , fc, X

In formula (57) and (58), because T is a FSR on Y, according to Theorem 4.1, traclo(T) is a FER on Y. And then, according to Theorem A.18, i = 1, 2, , |V |, traclo (T)[λi] is an ER on Y. Meanwhile, because y Y, x X such that fc(x) = y. Based on it, i = 1, 2, , |V |, according to Lemma C.8, Oi and Pi satisfy the three conditions in proposition (2). So proposition (2) is proven.

In summary, proposition (1) and (2) are proven. So the theorem is proven.

C.2.9. PROOF OF THEOREM 4.12

Proof. For sake of discussion, let T = span (T, fc, X). According to Definition 4.3 and 4.9, to prove i = 1, 2, , Q X, T , j {1, 2, , |Q (X, S)|}, such that Pi = Qj we need to prove P Q X, T , Q Q (X, S), such that P = Q. we need to prove

Q X, T Q (X, S) . (59)

Next, we prove formula (59). For sake of discussion, we copy formula (7) as follows

V = {tij| i, j Y} , U = V {0, α, β},

u = sortvec(U) = u1, u2, , u|U| T .

According to Definition 4.10, let

V = n tij tij=span (T, fc, X)ij =tyiyj, i, j =1, , n o ,

and V V . And then, according to Theorem 4.4,

Q X, T = n X/traclo T[λ] λ V o .

Therefore, to prove formula (59), we need to prove λ V , ω [0, 1], such that X/traclo T[λ] = X/traclo S[ω] . According to Definition A.8, we need to prove λ V , ω [0, 1], such that traclo T[λ] = traclo S[ω] . According to Theorem 4.1, we need to prove λ V , ω [0, 1], such that, T[λ] = S[ω]. According to Definition A.17, we need to prove λ V , ω [0, 1], such that i, j = 1, 2, , n, I [ tij λ] = I [sij ω]. According to Definition 4.10, i, j = 1, 2, , n, tij = tyiyj. Therefore, we need to prove λ V , ω [0, 1], such that i, j = 1, 2, , n,

I tyiyj λ = I [sij ω] . (60)

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Next, we prove formula (60). λ V V , according to formula (35e), 2 sort (λ, U) |U|. Therefore 1 sort (λ, U) 1 |U| 1. Let ω = usort(λ,U) 1, according to formula (35d) and (35a), 0 = u1 ω < λ u|U| = 1, i.e., ω [0, 1). Because L1 = 0, therefore, i, j = 1, 2, , n, i = j, LHA (sij, yi, yj; α, β, T) = 0, i.e., 4

sij usort(tyiyi;U) 1, usort(tyiyi;U)

Next, we prove that i, j = 1, 2, , n, I tyiyj λ = I [sij ω].

(a) If i = j. Because T is a FSR on Y, S is a FSR on X, therefore, I [tyiyi λ] = I [1 λ] = 1 = I [1 ω] = I [sii ω].

(b) If i = j, I tyiyj λ = 1, i.e., tyiyj λ. According to formula (7), sort tyiyj; U sort (λ; U), then sort tyiyj; U 1 sort (λ; U) 1. And then, according to formula (61), sij > usort(tyiyi;U) 1 usort(λ;U) 1 =

ω, i.e., I [sij ω] = 1.

(c) If i = j, I tyiyj λ = 0, i.e., tyiyj < λ. According to formula (7), sort tyiyj; U < sort (λ; U), therefore sort tyiyj; U sort (λ; U) 1. And then, according to formula (61), sij < usort(tyiyi;U) usort(λ;U) 1 = ω,

i.e., I [sij ω] = 0.

Combining (a)-(c), formula (60) is true. Therefore, formula (59) is true. So the theorem is proven.

C.3. Proofs of Theorems in Appendix B

C.3.1. PROOF OF THEOREM B.2

(1) To prove the theorem B.2, we introduce the following derivation process. Remark C.10. Given the following convex quadratic programming problem

min r J1(r) = Pm i=1 (ri si)2

s.t. ri = ti+1, i = 1, 2, , m 1 a < ri < b, i = 1, 2, , m . (62)

i = 1, 2, , m, let t = ri. Meanwhile, let s =

m . And then

J2(r) = Pm i=1 (ri si)2

= Pm i=1 r2 i 2risi + s2 i

= Pm i=1 r2 i 2 Pm i=1 risi + Pm i=1 s2 i = mt2 2 (Pm i=1 si) t + Pm i=1 s2 i = mt2 2m st + Pm i=1 s2 i = m t2 2 st + Pm i=1 s2 i = m (t s)2 m s2 + Pm i=1 s2 i

4According to (8), sij should fall into the close interval. In practice, we can introduce a small constant ϵ > 0 into formula (8) such that sij falls in the open interval.

Let J2(t) = (t s)2, J3(s) = Pm i=1 s2 i m s2, then J1(r) = m J2(t) + J3(s). Substituting them into the formula (62), we obtain the following optimization problem

min r J1(r) = Pm i=1 (ri si)2

s.t. ri = ti+1, i = 1, 2, , m 1 a < t < b min t m J2(t) + J3(s)

s.t. a < t < b min t m J2(t)

s.t. a < t < b.

Obviously, the last optimization problem in the above equation is a convex quadratic programming problem. Let t be the optimal solution of it. Then

r Rm, i = 1, 2, , m, r i = t ,

is the optimal solution of formula (62).

(2) Based on the above derivation process, we prove Theorem B.2.

Proof. For proposition (1).

According to formula (29), S(D,K) satisfies reflexivity and symmetry. Because t is the optimal solution of formula (27), therefore,

i = 1, 2, , |V K|, 0 < t i < α.

According to formula (29), i, j Y, i = j, 0 < s(D,K)

ij < α. Therefore, S(D,K) is a α-FSR on Y. Soproposition (1) is proven.

For proposition (2).

Because SK and SD both satisfy reflexivity and symmetry. Therefore, we can eliminate the third and fourth constraints and ignore the elements of the main diagonal and lower triangle positions of the matrix of the objective function in the formula (26). Therefore, we obtain the following equivalent optimization problem

min S(D,K) P

j Y,j>i s(D,K) ij s D ij 2

s.t. s(D,K) ij < s(D,K) jk , i, j, k Y, i < j < k, s K ij < s K jk

s(D,K) ij = s(D,K) jk , i, j, k Y, i < j < k, s K ij = s K jk

0 < s(D,K) ij < α, i, j Y, i < j

In formula (63), the S(D,K) satisfies the first two constraints iff the sorting of the triangular elements of S(D,K) is consistent with the sorting of the triangular elements of SK. Let

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

V K = s K ij i, j Y, i = j . Considering the situation where the triangular elements of SK have duplicates, let Ii = (k, l) k, l Y, k < l, sort s K kl; V K = i . Substituting V K and Ii into the formula (63), we can obtain the following equivalent optimization problem

min S(D,K) P|V K| i=1 P

s(D,K) kl s D kl 2

s.t. s(D,K) pq < s(D,K) rs , i, j = 1, 2, , V K , i < j, (p, q) Ii, (r, s) Ij

s(D,K) pq = s(D,K) rs , i = 1, 2, , V K , (p, q), (r, s) Ii, (p, q) = (r, s)

0 < s(D,K) ij < α, i, j Y, i < j.

i = 1, 2, , V K , let si =

(p,q) Ii s D pq |Ii| . According to Remark C.10, we can can eliminate the equality constraints in the formula (64) and obtain the following equivalent optimization problem.

min t R|V K| J1(t) = P|V K| i=1 |Ii| (ti si)2

s.t. ti < ti+1, i, j = 1, 2, , V K 1 0 < ti < α, i = 1, 2, , V K

Substituting s given in formula (28) and D into formula (65), we can obtain the following optimization problem

J1(t) = t T Dt 2 s T Dt + s T D s

By removing the unrelated s T D s and let J2(t) = t T Dt 2 s T Dt, we can obtain the following optimization problem

min t R|V K| J2(t) = t T Dt 2 s T Dt

s.t. ti < ti+1, i, j = 1, 2, , V K 1 0 < ti < α, i = 1, 2, , V K . (66)

In formula (66), according to formula (28), D is a symmetric positive definite matrix. Therefore J2(t) is a convex function w.r.t. t. Meanwhile, only linear inequality constraints are involved in the constraint conditions of formula (66). Therefore, formula (66) is a standard convex quadratic programming problem. Let t be the optimal solution of formula (66). And then, let S(D,K) R|Y| |Y|, i, j Y, we have

( 1, i = j t sort(s K ij;V K), i = j .

Obviously, S(D,K) satisfies the all constraint conditions of formula (26). Meanwhile, according to Remark C.10, S(D,K) is the optimal solution of formula (26). So proposition (2) is proven.

In summary, proposition (1) and (2) are proven. So the theorem is proven.

D. Details of Experiments

D.1. Details of Interpretability Analysis Experiments

(1) Task. To show the working mechanism of the proposed method, we conduct the experiments on handwritten digit classification task. The class space is Y = {0, 1, , 9}, which is familiar 10 numbers for human beings. Therefore, it is possible to evaluate the experimental results with common sense.

(2) Data. We adopt handwritten digit data set MNIST (Le Cun et al., 1998), consisting of 60,000 training samples and 10,000 test samples. Each sample is a 28 28 pixel grayscale image containing a handwritten number. For the convenience of presentation, 10 samples were randomly selected for each class to form the training data set.

(3) Class Knowledge. In this experiment, we recruit 10 volunteers. each volunteer is asked to construct a 10 10 rating matrix that is used to measure the similarity between different digits. The score range is {0, 1, 2, , 10}. The constructed rating matrix is averaged and normalized, denoted as SK. The human knowledge system K is the understanding of visual features of digital characters in the minds of 10 volunteers, and class knowledge K is the rating matrix constructed by 10 volunteers.

(4) Settings. Feature extraction network h : R28 28 R10 + adopts a 5-layer convolutional neural network. FSR network g : R10 + R10 + [0, 1] adopts cosine similarity. We adopt case 2 in formula (6) to obtain the final FSR T. The fuzziness parameters α = 0.7, β = 0.9. No regularization term was used in the experiment. We adopt Adam (Kingma & Ba, 2015) to solve formula (1) and (9), where the learning rate is set as 10 3. We implement the code based on Pytorch5 and conduct all experiments on an NVIDIA A100-PCIE-40GB GPU.

D.2. Details of Generalization Analysis Experiments

To verify the effectiveness of the proposed method, we conduct experiments on 6 public data sets compared with 6 different classifiers.

(1) Task. APY and Image-Net1K data sets contain 32 and 1000 classes respectively, involving different classes such as animals, furniture and vehicles. The differences among the classes are large, which are coarse-grained classification tasks. AWA1 and AWA2 data sets share the same class space, which contains 50 species of animals, such as horses, whales, tigers, etc. There are certain differences between the classes. Thus, they are medium-granularity classification tasks. FLO and CUB data sets contain 102 different species of flowers and 200 different species of birds, respectively.

5https://pytorch.org/

Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Table 4. The comparison methods and their settings

Method Abbreviation Setting K-Nearest Neighbor KNN We adopt Euclidean distance. The number of nearest neighbors is selected within {1, 3, 5, 7, 9, 11}.

Decision Tree DT We adopt two criterion, i.e., Entropy and Gini, to select split feature. We adopt two strategies, i.e., global search and random split, to determine the split threshold.

Support Vector Machine SVM We adopt one-to-rest to deal with multiple classification problem. We adopt Gaussian kernel function and the parameters of kernel function are default. The trade-off parameters are selected within {0.01, 0.1, 1, 10, 100}.

Naive Bayes NB We adopt four different distribution assumptions to estimate the probability density, i.e., Gaussian, Multinomial, Complement and Bernoulli.

Cross Entropy Classifier CEC

Feature extraction network is designed as h(x; Θ) = Re LU (Re LU (x W1 + b1) W2 + b2) W3 + b3, where Θ = {W1 R2048 1024, b1 R1024, W2 R1024 512, b2 R512, W3 R512 |Y|, b2 R|Y|} is the set of learnable parameters. We adopt cross entropy loss of Softmax (h(x)) and true labels to train network. We adopt Adam (Kingma & Ba, 2015) as optimizer. The size of batch is set to 256. The number of epoch is set to 200. The initial learning rate is set to 1e-3 and decays by 0.5 times every 20 epochs.

Fuzzy Learning Machine FLM

The overall network is f (( , ) ; Θ) = φ( ;Θ)T φ( ;Θ) φ( ;Θ) 2 φ( ;Θ) 2 , where φ ( ; Θ) = Sigmoid (h ( ; Θ)). h is the same as in CEC. We adopt Adam (Kingma & Ba, 2015) as optimizer. The size of batch is set to 1000. The number of epoch is set to 200. The initial learning rate is set to 1e-3 and decays by 0.5 times every 50 epochs. The fuzziness parameters α = 0.2, β = 0.8. Regularization term is not used in the experiment.

HC-HFLM with K1 CK1-HFLM

The overall network is the same as FLM. The pretrained FLM is used as the initial point. We adopt Adam (Kingma & Ba, 2015) as optimizer. The size of batch is set to 1000. The number of epoch is set to 50. On Image Net1K data set, the initial learning rate is set to 1e-9. On the remaining data sets, the initial learning rate is set to 1e-3. The learning rate decays by 0.5 times every 30 epochs. For class description vector, we adopt formula (20) in Appendix B.1 to calculate the FSR SK. We adopt case 3 in formula 6 to get final FSR T. On Image Net1K data set, SK is coarsened as 50 levels. On the remaining data sets, SK is coarsened as 100 levels.

HC-HFLM with K2 CK2-HFLM Different from CK1-HFLM, Algorithm 1, 2 and 3 are called successively to get FSR SK (see Appendix B.1). The number of levels is determined by knowledge graph (see Table 2). Based on it, we adopt case2 in formula 6 to get the final FSR T.

There are small differences between classes. Thus, they are fine-grained classification tasks.

(2) Data. For Image Net1K data set, we adopt common data partition, which contains 1,281,167 training samples and 50,000 test samples. We record the test accuracy of all methods. The remaining 5 data sets are evaluated using 5-fold cross-validation for each method, and the mean accuracy are recorded.

As is known to all, the way of extracting features from images seriously affects the performance of image classification. The focus of this paper is how to integrate knowledge to improve the classification performance rather than designing a better method of extracting features. In order to eliminate the influence of the feature extraction process, the 2048 dimensional features extracted by literature (Xian et al., 2019) are used in the whole experiments for all data sets. The detail information of data sets is summarised in Table 2.

(3) Class Knowledge. In this experiment, we adopt two types of knowledge, denoted as K1 and K2.

K1 represents the class description vector. For Image Net1K data set, the class description vector is obtained by Glo Ve (Pennington et al., 2014). For the remaining 5 data sets, the class description vectors are designed by domain expert. The detailed descriptions can be found in literature (Xian et al., 2019). We adopt formula (20) in Appendix B.1 to calculate class similarity relation SK.

K2 represents knowledge graph. Word Net (Miller, 1995) is adopted in this experiment. Algorithm 1, and 2 and 3 (see

Appendix B.1) are called successively to get the class FSR SK. The class spaces of FLO and CUB data sets are the sets of different followers and different birds, respectively. This makes it difficult to match with Word Net and take advantage of the knowledge provided by Word Net.

(4) Comparison Methods and Settings. In this experiment, we adopt 6 different classifiers as comparison methods. Knearest neighbor (KNN), decision tree (DT), support vector machine (SVM), and naive Bayes (NB) are classical and non-deep neural network methods. Cross entropy classifier (CEC) and FLM are methods based on deep neural network. Because the input features are extracted by the pre-trained deep neural network, we adopt 3 fully-connected layers as backbone for CEC and FLM.

In the proposed HC-HFLM, two types of knowledge K1 and K2 are integrated into FLM, respectively, denoted as CK1HFLM and CK2-HFLM. For fair comparison, the network architecture is the same as the FLM.

For each method, different parameter settings are adopted and the best experimental results are recorded. For the first four methods, we adopt the implementations by the Scikit-learn library6. The remaining methods are implemented based on Pytorch. We conduct all experiments on an NVIDIA A100-PCIE-40GB GPU. More details are listed in Table 4.

6https://scikit-learn.org/stable/index. html