# augmenting_transfer_learning_with_semantic_reasoning__8ff4b182.pdf Augmenting Transfer Learning with Semantic Reasoning Freddy L ecu e1,2 , Jiaoyan Chen3 , Jeff Z. Pan4,5 and Huajun Chen6,7 1Cort AIx Thales, Montreal, Canada 2Inria, Sophia Antipolis, France 3Department of Computer Science, University of Oxford, UK 4Department of Computer Science, The University of Aberdeen, UK 5Edinburgh Research Centre, Huawei, UK 6College of Computer Science, Zhejiang University, China 7ZJU-Alibaba Joint Lab on Knowledge Engine, China Transfer learning aims at building robust prediction models by transferring knowledge gained from one problem to another. In the semantic Web, learning tasks are enhanced with semantic representations. We exploit their semantics to augment transfer learning by dealing with when to transfer with semantic measurements and what to transfer with semantic embeddings. We further present a general framework that integrates the above measurements and embeddings with existing transfer learning algorithms for higher performance. It has demonstrated to be robust in two real-world applications: bus delay forecasting and air quality forecasting. 1 Introduction Transfer learning [Pan and Yang, 2010] aims at solving the problem of lacking training data by utilizing data from other related learning domains, each of which is referred to as a pair of dataset and prediction task. Transfer Learning plays a critical role in real-world applications of ML as (labelled) data is usualy not large enough to train accurate and robust models. Most approaches focus on similarity in raw data distribution with techniques such as dynamic weighting of instances [Dai et al., 2007] and model parameters sharing [Benavides-Prado et al., 2017] (cf. Related Work). Despite of a large spectrum of techniques [Weiss et al., 2016] in transfer learning, it remains challenging to assess a priori which domain and data set to elaborate from [Dai et al., 2009]. To deal with such challenges, [Choi et al., 2016] integrated expert feedback as semantic representation on domain similarity for knowledge transfer while [Lee et al., 2017] evaluated the graph-based representations of source and target domains. Both studies encode semantics but are limited by the expressivity, which restricts domains interpretability and inhibits a good understanding of transferability. There are also efforts on Markov Logic Networks (MLN) based transfer learning, by using first order [Mihalkova et al., 2007; Mihalkova and Mooney, 2009] or second order [Davis and Domingos, 2009; Van Haaren et al., 2015] rules as declarative prediction models. However, these efforts still cannot answer questions like: What ensures a positive domain transfer? Would learning a model from road traffic congestion in London be the best for predicting congestion in Paris? Or would an air quality model transfer better? In this paper, we propose to encode the semantics of learning tasks and domains with OWL ontologies and provide a robust foundation to study transferability between source and target learning domains. From knowledge materialization [Nickel et al., 2016], feature selection [Vicient et al., 2013], predictive reasoning [L ecu e and Pan, 2015], stream learning [Chen et al., 2017] to transfer learning explanation [Chen et al., 2018], all are examples of inference tasks where the semantics of data representation are exploited for deriving a priori knowledge from pre-established statements in ML tasks. We introduce a framework to augment transfer learning by semantics and its reasoning capability, as shown in Figure 1. It deals with (i) when to transfer by suitable transferability measurements (i.e., variability of semantic learning task and consistent transferability knowledge), (ii) what to transfer by embedding the semantics of learning domains and tasks with transferability vector, consistent vector and variability vector. In addition to expose semantics that drives transfer, a transfer boosting algorithm is developed to integrate the embeddings with existing transfer learning approaches. Our approach achieves high performance for multiple transfer learning tasks in air quality and bus delay forecasting. When to Transfer? What to Transfer? Source Domain Target Domain Transfer Learning Task Variability of Semantic Learning Task Consistent Transferability Knowledge Transferability Vector Consistent Vector Variability Vector Semantic Transfer Boosting Algorithm Figure 1: Ontology-based Transfer Learning Augmentation. 2 Background Our work uses OWL ontologies underpinned by Description Logic (DL) EL++ [Baader et al., 2005][Bechhofer et al., 2004] to model the semantics of learning domains and tasks. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 2.1 Description Logics EL++ and Ontology A signature Σ, noted (NC, NR, NI) consists of 3 disjoint sets of (i) atomic concepts NC, (ii) atomic roles NR, and (iii) individuals NI. Given a signature, the top concept , the bottom concept , an atomic concept A, an individual a, an atomic role expression r, EL++ concept expressions C and D in C can be composed with the following constructs: | | A | C D | r.C | {a} A DL ontology is composed of a TBox T and an ABox A. T is a set of concept, role axioms. EL++ supports General Concept Inclusion axioms (GCIs e.g., C D), Role Inclusion axioms (RIs e.g., r s ). A is a set of class assertion axioms, e.g., C(a), role assertion axioms, e.g., r(a, b), individual in/equality axioms e.g., a = b, a = b. Given an input ontology T A, we consider the closure of atomic ABox entailments (or simply entialment closure, denoted as G(T A)) as {g|T A |= g}, where g represents an atomic concept assertion A(b), or an atomic role assertion entailment r(a, b), involving only named concepts, named roles and named individuals. Entailment reasoning in EL++ is PTime-Complete. Example 1. (TBox and ABox Concept Assertion Axioms) Figure 2 presents (i) a TBox T where Road (1) denotes the concept of ways which are in a continent , and (ii) concept assertions (8-9) with individuals r0 and r1 being roads. Figure 2: Sample of an Ontology s TBox T and ABox A. 2.2 Learning Domain and Task To model the learning domain with ontology, we use Learning Sample Ontology and Target Entailment, as in [Chen et al., 2018]. A learning domain consists of an LSO set (i.e., dataset) and a target entailment set (i.e., prediction task). Definition 1. (Learning Sample Ontology (LSO)) A learning sample ontology O = T , A , S is an ontology T , A annotated by property-value pairs S. The annotation S acts as key dimensions to uniquely identify an input sample of ML methods. When the context is clear, we also use LSO to refer to its ontology T , A . Example 2. (An LSO in Context of Ireland Traffic) Assume an LSO is annotated by property-value pairs S := {topic: Road, road : C Way, country : UK}. Its TBox T includes static axioms like (1); its ABox A includes facts e.g., has Avg Speed(r0, Low) that are observed in C Way in UK. Definition 2. (Learning Domain and Target Entailment) A learning domain D = O, GY consists of a set of LSOs O that share the same TBox T , and target entailments GY, each of whose truth in an LSO is to be predicted. Its entailment closure, denoted as G(O) is defined as O OG(O). Definition 3 revisits supervised learning within a domain. In a training LSO, a target entailment is true if it is entailed by an LSO, and false otherwise. In a testing LSO, the truth of a target entailment is to be predicted instead of being inferred. Definition 3. (Semantic Learning Task) Given a learning domain D = O, GY , whose LSOs O are divided into two disjoint sets O and O , a semantic learning task, denoted by T, within D, is defined as: D, O , O , f( ) i.e., the task of identifying a function f( ) with O and GY to predict the truth of GY in each O in O . Here, O is called a training LSO set, while O is called a testing LSO set. Example 3. (Semantic Learning Task) Given a domain composed of LSOs annotated by {topic : Road, country : UK} and target entailments Cleared(r0) and Disrupted(r0), the LSOs are divided into a training set O and a testing set O according to the type of roads involved, the objective is to identify a function from O to predict the condition of road r0, namely the truth of Cleared(r0) and Disrupted(r0) in each LSO in O . 2.3 Transfer Learning Across Domains Definition 4 revisits transfer learning where DS and DT are called source domain and target domain and their entailment closures are denoted as GS and GT . Definition 4. (Transfer Learning) Given two learning domains DS = OS, GY S and DT = OT , GY T , where the LSOs of DT are divided into two disjoint sets O T and O T , transfer learning from DS to DT is a task of learning a prediction function f T |S( ) from OS, GY S , O T and GY T to predict the truth of GY T in each LSO in O β. Example 4. (Transfer Learning) Assume DT is the domain in Example 3, DS is a domain with LSOs annotated by {topic: Road, country: IE}, an example of transfer learning is to identify a function using all the LSOs of Dublin traffic and the training LSOs of London traffic (O T ) for predicting the traffic condition of road r0 in each testing LSO of London traffic (O T ). We demonstrate how ontology-based descriptions can drive transfer learning from one domain to another. To this end, similarities between domains are first characterized. We adopt the variability of ABox entailments [L ecu e, 2015] in Definition 5, where (10) reflects variant knowledge between two domains while (11) denotes invariant knowledge. Definition 5. (Entailment-based Domain Variability) Given a source learning domain DS and a target learning domain DT , let G = GS GT , the variability from DS to DT , denoted as (OT , OS) are ABox entailments: G[S],[T ] var = {g G | g GT g GS} (10) G[S],[T ] inv = {g G | g GT g GS} (11) Example 5. (Entailment-based Domain Variability) Let Figure 3, which capture the contexts in IE and UK, be ontologies of DS and DT respectively. Table 1 illustrates some variabilities of DS and DT through ABox entailments. For instance r1 as a disrupted road in DS is new (variant) w.r.t. knowledge in DT and axioms (1), (9) and (12-15). Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 3: [Up] Source Domain Ontologies OS in Context of IE Traffic; [Down] Target Domain Ontologies OT in Context of UK Traffic. Ontology Variability (DS, DT ) variant invariant Road(r3) Cleared(r1) Disrupted(r0) Table 1: Examples for Entailment-based Domain Variability. 3 Transferability We present (i) variability of semantic learning tasks, (ii) semantic transferability, as a basis for qualifying, quantifying transfer learning (i.e., when to transfer), together with (iii) indicators (i.e., what to transfer) driving transferability. They are pivotal properties, as any change in domains, their transfer function and consistency drastically impact the quality of derived models [Long et al., 2015; Chen et al., 2018]. 3.1 Variability of Semantic Learning Tasks Definition 6 extends entailment-based ontology variability (Definition 5) to capture the learning task variability, where ( )[YS],[YT ] represents using target entailments in (10) (11). Definition 6. (Variability of Semantic Learning Tasks) Let TS and TT be semantic learning tasks of source learning domain DS and target learning domain DT . The variability of semantic learning tasks (TS, TT ) is defined by (22), where | | refers to the cardinality of a set. |G[S],[T ] var | |G[S],[T ] var | + |G[S],[T ] inv | , |G[YS],[YT ] var | |G[YS],[YT ] var | + |G[YS],[YT ] inv | The variability of semantic learning tasks (22), also represented by ( (TS, TT )|O, (TS, TT )|Y) in [0, 1], captures the variability of source and target domain LSOs as well as the variability of target entailments. The higher values the stronger variability. The calculation of (22) is in worst case polynomial time w.r.t size OS, OT , YS, YT in EL++. Its evaluation requires (i) ABox entailment, (ii) basic set theory operations from Definition 5, both in polynomial time [Baader et al., 2005]. Example 6. (Variability of Semantic Learning Tasks) The variability of learning task between TS and TT in Example 4 is (2/3, 0) as the number of variant and invariant ABox entailments are respectively 6 and 3, and YS = YT . i.e., moderate variability of domains, none for target variables. 3.2 Semantic Transferability - When to Transfer? We define semantic transferability from a source to a target semantic learning task as the existence of knowledge that are captured as ABox entailments [Pan and Thomas, 2007] in the source and have positive effects on predictive quality of the prediction function of the target semantic learning task. Definition 7. (Semantic ε-Transferability) Let TS, TT be source, target semantic learning tasks with entailment closures GS, GT . Semantic ε-transferability TS ε7 TT occurs from TS to TT iff S OS : m(f T |S( )) m(f T ( )) > ε (23) GS = GT (24) where f T |S( ) is the predictive function f T ( ) w.r.t. OT S. GS is the ABox closures of S. S is knowledge from OS, to be used for over-performing the predictive quality of f T ( ) with a ε (0, 1] factor (23) while being new with respect to ABox entailments in GT (24). Example 7. (Semantic ε-Transferability) Let TS, TT be semantic learning tasks in DS, DT in Example 4, S be ABox entailment closure of (12-15) in OS, and m(f T |S( )) > m(f T ( )). Semantic ε-transferability occurs from TS to TT as (i) an ε > 0, satisfying condition (23), exists, and (ii) (24) is true cf. Table 1 w.r.t. S. Thus, knowledge S in IE traffic context (DS) ensures transferability from DS to DT for traffic prediction in UK. ABox entailments S satisfying Definition 7 are denoted as transferable knowledge while those contradicting (23) i.e., m(f T |S( )) m(f T ( )) ε are non-transferable knowledge as they deteriorate predictive quality of target function f T ( ). Example 8. (Transferable Knowledge) Consider entailments in S: (i) Disrupted(r4), derived from (13) (19-21), (ii) Cleared(r0), derived from (8) (12) (17-18). As part of knowledge S positively impacting the quality of the prediction task, they are also separate ε-transferable knowledge with max ε: .1, .07 (computation details omitted). 3.3 Consistent Transferable Knowledge Transferring knowledge across domains can derive to inconsistency. Definition 8 captures knowledge ensuring transferability while maintaining consistency in the target domain. Definition 8. (Consistent Transferable Knowledge) Let S be ABox entailments ensuring TS ε7 TT . S is consistent transferable knowledge from TS to TT iff S OT |= . ABox entailments S satisfying S OT |= are called inconsistent transferable knowledge. They are interesting ABox entailments as they expose knowledge contradicting the target domain while maintaining transferability. Evaluating if S is consistent transferable knowledge is in worst case polynomial time in EL++ w.r.t. size of S and OT . Example 9. ((In-)Consistent Transferable Knowledge) Disrupted(r4) in S of Example 8 is consistent transferable knowledge in TT as {Disrupted(r4)} OT |= . On contrary Cleared(r0) and Disrupted(r0) in S, derived from (16-18) are inconsistent (7). Thus, Cleared(r0) in OS is inconsistent transferable knowledge in TT . Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 4 Semantic Transfer Learning We tackle the problem of transfer learning by computing semantic embeddings (i.e., how to transfer) for knowledge transfer, and determining a strategy to exploit the semantics of the learning tasks (Section 3) in Algorithm 1. 4.1 Semantic Embeddings - How to transfer? The semantics of learning tasks exposes three levels of knowledge which are crucial for transfer learning: variability, transferability, consistency. They are encoded as embeddings through Definition 9, 10 and 11. Definition 9. (Transferability Vector) Let G = {g1, . . . , gm} be all distinct ABox entailments in OS OT . A transferability vector from TS to TT , denoted by t(G), is a vector of dimension m such that j [1, m]: tj .= εj if gj is εj-transferable knowledge, and tj .= 0 otherwise, with εj | ε j, εj < ε j and g is ε j-transferable knowledge. A transferability vector (Definition 9) is adapting the concept of feature vector [Bishop, 2006] in Machine Learning to represent the qualitative transferability from source to target of all ABox entailments. Each dimension captures the best of transferability of a particular ABox entailment. Example 10. (Transferability Vector) Suppose G .= {Disrupted(r4), Cleared(r0)}. Transferability vector t(G) is (.1, .07) cf. ε-transferability in Example 8. A consistency vector (Definition 10) is computed from all entailments by evaluating their (in-)consistency, either 1 or 0, when transferred in the target semantic learning task. Feature vectors are bounded to only raw data while transferability and consistency vectors, with larger dimensions, embed transferability and consistency of data and its inferred assertions. They ensure a larger, more contextual coverage. Definition 10. (Consistency Vector) Let G = {g1, . . . , gm} be all distinct ABox entailments in OS OT . A consistency vector from TS to TT , denoted by c(G), is a vector of dimension m such that j [1, m]: cj = 1 if {gj} OT |= , and cj = 0 otherwise The variability vector (Definition 11) is used as an indicator of semantic variability between the two learning tasks. It is a value in [0, 1] with an emphasis on the domain ontologies and / or label space depending on its parameterization (α, β). We characterize any variability weight above 1/2 as inter-domain transfer learning tasks, below 1/2 as intra-domain. Definition 11. (Variability Vector) Let G = {g1, . . . , gm} be ABox entailments in OS OT . A variability vector v(G, α, β) from TS to TT is a vector of dimension m with α, β [0, 1] such that vj, j [1,m] is: α( (TS, TT )|O) + β( (TS, TT )|Y) Example 11. (Variability Vector) Applying (25) on the variability of semantic learning tasks between TS and TT : (2/3, 0) in Example 6 results in v(G, α, β) = 1/3, which represents moderate variability. 4.2 Boosting for Semantic Transfer Learning Algorithm 1 presents an extension of the transfer learning method Tr Ada Boost [Dai et al., 2007] by integrating semantic embeddings. It aims at learning a predictive function f T |S( ) (line 20) using TS, OS , OT for TT . The semantic embeddings of all entailments in GS GT are computed (lines 7-8). They are defined through transferability, consistency, variability effects from source to target domain. Then, their importance / weight w are iteratively adjusted (line 9) depending on the evaluation of f t (lines 13-14) when comparing estimated prediction f t(ei) and real values YT (gi). The base model (lines 11-12), which can be derived from any weak learner e.g., Logistic Regression, is built on top of all entailments in source, target tasks. However, entailments from the source might be wrongly predicted due to tasks variability (Definition 6 - line 8) TS, TT . Thus, we follow the parameterization of γ and γt [Dai et al., 2007] by decreasing the weights of such entailments to reduce their effects (lines 17-19). In the next iteration, the misclassified source entailments, which are dissimilar to the target ones w.r.t. semantic embeddings, will affect the learning process less than the current iteration. Finally, St Ada B returns a binary hypothesis (line 20). Multi-class classification can be easily applied. Algorithm 1: St Ada B( DS, TS , DT , TT , O T , G, L, N, α, β) 1 Input: (i) Source/target learning domains and tasks DS, TS , DT , TT , (ii) a training LSO set of the target learning domain O T , (iii) distinct ABox entailments G = {g1, . . . , gm} of OS O T , (iv) a base learning algorithm L, (v) max. iterations N, (vi) α, β [0, 1]. 2 Result: f T |S: A predictive function by DS, TS, O T , GY T for TT . 4 % Initialization of weights for transferability, consistency, 5 % and variability vectors of all m ABox entailments in G. 6 Initialization of w1 .= (w1 1, , w1 3 m); 7 % Computation of semantic embeddings for all gi G. 8 ei (t(gi), c(gi), v(G, α, β)), i {1, , m}; 9 foreach t = 1, 2, ..., N do % Weight computation iteration 10 pt wt/P3m i=1 wt i; % Probability distribution of wt. 11 % Predictive function f t over OS O T . 12 (f t : ei YT (ei)) L(e, pt, YT ); 13 % Error computation of f t on TT , O T . i|gi GT wt i |ft(ei) YT (gi)| P i|gi GT ; 15 % Weights for reducing errors on TT over iteration. 16 γt ψt/(1 ψt); γ 1/(1 + p 2 ln(|GS |/N)); 17 % Weight update of source and target entailments in G. 18 % using γt, γ, and results from previous iteration: wt i. ( wt i γ |ft(ei) YT (ei)| t , if gi GT wt i γ|ft(ei) YT (ei)|, else 20 return Hypothesis ensemble: f T |S(e) = ( 1, if QN t= N/2 β ft(e) t QN t= N/2 β 1 2 t 0, else Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) A brute force approach would consist in generating an exponential number of models with any combination of entailments from source, target. St Ada B reduced its complexity by only evaluating atomic impact and (approximately) computing the optimal combination. As a side effect, St Ada B exposes entailments in the source which are driving transfer learning (cf. final weight assignment of embeddings). 5 Experimental Results St Ada B is evaluated by two Intra-domain transfer learning cases: (i) air quality forecasting from Beijing to Hangzhou (IBH), (ii) traffic condition prediction from London to Dublin (ILD), one Inter-domain case: (iii) from traffic condition prediction in London to air quality forecasting in Beijing (ILB). Accuracy with cross validation is reported. All tasks are performed with a respective value of .3, .4, .7 for variability v(G, α, β). α and β are set to .5. In IBH1, air quality knowledge in Beijing (source) is transferred to Hangzhou (target) for forecasting air quality index, ranging from Good (value 5), Moderate (4), Unhealthy (3), Very Unhealthy (2), Hazardous (1) to Emergent (0). The observations include air pollutants (e.g., PM2.5), meteorology elements (e.g., wind speed) and weather condition from 12 stations. The semantics of observations is based on a DL ALEH(D) ontology, including 48 concepts, 15 roles, 598 axioms. 1, 065, 600 RDF triples are generated on a daily basis. 18 (resp. 5) months of observations are used as training (resp. testing). Even though the ontologies are from the same domain, the proportion of similar concepts and roles are respectively .81 (i.e., 81% of concepts are similar) and .74. For instance, no hazardous air quality concept in Hangzhou. In ILD, bus delay knowledge in London (source) is transferred to Dublin (target) for predicting traffic conditions classified as Free (value 4), Low (3), Moderate (2), Heavy (1), Stopped (0). Source and target domain data include bus location, delay, congestion status, weather conditions. We enrich the data using a DL EL++ domain ontology (55 concepts, 19 roles, 25, 456 axioms). 178, 700, 000 RDF triples are generated on a daily basis. 24 (resp. 8) months of observations are used as training (resp. testing). The concept and role similarities among the two ontologies are respectively .73 and .77. In ILB, bus delay knowledge in London (source) is transferred to a very different domain: Beijing (target) for forecasting air quality index. Data and ontologies from IBH and ILD are considered. Both domains share some common and conflicting knowledge. Inconsistency might then occur. For instance, both domains have the concepts of City, weather such as Wind but are conflicting on their importance and impact on the targeted variable i.e., bus delay in London and air quality in Beijing. The concept and role similarities among the two ontologies are respectively .23 and .17. 5.1 Semantic Impact Table 2 reports the impact of considering semantics (cf. Sem. vs. Basic) and (in)consistency (cf. Consistency / Inconsistency) in semantic embeddings on Random Forest (RF), 1Air quality data: https://bit.ly/2BUx Ksi. See more about the application and data in [Chen et al., 2015]. Stochastic Gradient Descent (SGD), Ada Boost (AB). Basic models are models with no semantics attached. Plain models are modelling and prediction in the target domain i.e., no transfer learning, while TL refers to transferring entailments from the source. As expected semantics positively boosts accuracy of transfer learning for intra-domain cases (IBH and ILD) with an average improvement of 13.07% across models. More surprisingly it even over-performs in the inter-domain case (ILB) with an improvement of 20.03%. Inconsistency has shown to drive below-baseline accuracy. On the opposite results are much better when considering consistency for intra-domain cases (63.55%), and inter-domain cases (187.89%). Case Models RF SGD AB Plain TL Plain TL Plain TL Basic .61 .61 .59 .62 .59 .63 Consistency .65 .74 .62 .69 .64 .73 Inconsistency .56 .64 .52 .60 .49 .63 Cons. / Incons. +16.07% +19.23% +30.61% Semantic / Basic +13.93% +8.18% +12.17% Basic .68 .71 .57 .62 .63 .69 Consistency .75 .78 .65 .71 .75 .82 Inconsistency .44 .52 .26 .49 .24 .46 Cons. / Incons. +60.22% +102.86% +152.35% Semantic / Basic +10.07% +14.70% +19.42% Basic .62 .65 .60 .66 .61 .68 Consistency .74 .79 .69 .78 .73 .85 Inconsistency .23 .45 .29 .42 .18 .34 Cons. / Incons. +153.96% 166.25% +243.46% Semantic / Basic +20.44% +17.33% +22.33% Table 2: Forecasting Accuracy / Improvement over State-of-the-art Models (noted as Basic) with Consistency / Inconsistency (Consistency ratio .8) based Knowledge Transfer. Figure 4 reports the impact of consistency and inconsistency on transfer learning by analysing how the ratio of consistent transferable knowledge in [0, 1] is driving accuracy. Accuracy is reported for methods in Table 2 on intra- (average of IBH and ILD) and inter-domains (ILB). Max. (resp. min.) accuracy is ensured with ratio in [.9, .7) (resp. [.3, .1)). The more consistent transferable knowledge the more transfer for [.9, .1). Interestingly having only consistent (resp. inconsistent) transferable knowledge does not ensure best (resp. worst) accuracy. This is partially due to under- (resp. over-) populating the target task with conflicting knowledge, ending up to limited transferability. 5.2 Comparison with Baselines and Discussion We compare St Ada B (L = Logistic Regression, N = 800) with Transfer Ada Boost Tr AB [Dai et al., 2007], Transfer Component Analysis (TCA) [Pan et al., 2011], Tr SVM [Benavides-Prado et al., 2017] and Sem Tr [Lv et al., 2012] (cf. details in Section 6). We considered intra-domains: IBH, ILD and inter-domains: ILB and ILB (i.e., ILB with same level of semantic expressivity covered by Sem Tr). Results report that transfer learning has limitations in the Beijing - Hangzhou context cf. Figure 5(a). Although our approach over-performs other techniques (from 10.29% to 50%), accuracy does not exceed 74%. The latter is due to the context, which is limited by the (i) semantic expressivity and (ii) data availability in Hangzhou. The results show that Tr SVM and Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) [1,.9) [.9,.7) [.7,.5) [.5,.3) [.3,.1) [.1,.0] [Intra-Domain] RF TL [Intra-Domain] SGD TL [Intra-Domain] AB Plain [Intra-Domain] AB TL [Inter-Domain] RF TL [Inter-Domain] SGD TL [Inter-Domain] AB Plain [Inter-Domain] AB TL Forecasting Accuracy Ratio of Consistent Transferable Knowledge Figure 4: Forecasting Accuracy vs. Semantic Consistency. TCA reach similar results (average difference of 9.1%) in all the cases. However our approach and Tr AB tend to maximise the accuracy specially in inter-domains ILB and ILB in Figures 5(c) and 5(d) as both favour heterogenous domains by design. Interestingly the semantic context of ILB in Figure 5(d) (i) does not favour Sem Tr much (+7.46% vs. ILB), (ii) does not have impact for St Ada B compared to ILB, and more surprisingly (iii) does benefit Tr AB (+9.15% vs. ILB). This shows that expressivity of semantics is crucial in our approach to benefit from (in-)consistency in transfer. (a) Intra-Domain IBH (b) Intra-Domain ILD (c) Inter-Domains ILB (d) Inter-Domains ILB Figure 5: Baseline Comparison of Forecasting Accuracy. Adding semantics to domains for transfer learning has clearly shown the positive impact on accuracy, specially in context of inter-domains transfer. This demonstrates the robustness of models supporting semantics when common / conflicting knowledge is shared. The expressivity of semantics has also shown positive impacts, specially when (in- )consistency can be derived from the domain logics, although some state-of-the-art approaches benefit from taxonomy-like knowledge structure. Our approach also demonstrates that the more semantic axioms the more robust is the model and hence the higher the accuracy cf. Figure 5(a) vs. 5(b). Data size and axiom numbers are critical as they drive and control the semantics of domain and transfer, which improve accuracy, but not scalability (not reported in the paper). It is worst with more expressive DLs due to consistency checks, and with limited impact on accuracy. Enough training data in the source domain is required. Indeed logic reasoning could not help if important data or features are not mapped to the ontology. This is crucial for training and validation of semantics in transfer learning. Our approach is as robust as other transfer learning approaches, it only differentiate on valuing the transferability at semantic level. 6 Related Work We briefly divide the related work into instance transfer, model transfer and semantics transfer. Instance transfer selectively reuses source domain samples with weights [Dai et al., 2007]. [Tan et al., 2017] select data points from intermediate domains to obtain smooth transfer between largely distant domains. Model transfer reuses model parameters like features in the target domain. For example, [Pan et al., 2011] introduced a transfer component analysis for domain adaption; [Benavides-Prado et al., 2017] selectively shares the hypothesis components learnt by Support Vector Machines. These methods however usually ignore data semantics. Semantics transfer incorporates external knowledge to boost the above two groups, by using semantic nets [Lv et al., 2012] or knowledge graph-structure data [Lee et al., 2017] to derive similarity in data and features, with no reasoning applied. There are efforts on Markov Logic Networks (MLN) based transfer learning, by using first [Mihalkova et al., 2007; Mihalkova and Mooney, 2009] or second order [Davis and Domingos, 2009; Van Haaren et al., 2015] rules as declarative prediction models. However, these approaches do not address the problem of when is feasible to transfer. Our approach uses OWL reasoning to select transferable samples (addressing when to transfer ), then enriching the samples with embedded transferability semantics. It can support different machine learning models (and not just rules). 7 Conclusion We addressed the problem of transfer learning in expressive semantics settings, by exploiting semantic variability, transferability and consistency to deal with when to transfer and what to transfer, for existing instance-based transfer learning methods. It has been shown to be robust to both intraand inter-domain transfer learning tasks from real-world applications in Dublin, London, Beijing and Hangzhou. As for future work, we will investigate limits and explanations of transferability with more expressive semantics (e.g, based on approximate reasoning) [Pan et al., 2016; Du et al., 2019]. Acknowledgments This work is partially funded by NSFC91846204. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) [Baader et al., 2005] Franz Baader, Sebastian Brandt, and Carsten Lutz. Pushing the el envelope. In IJCAI, pages 364 369, 2005. [Bechhofer et al., 2004] Sean Bechhofer, Frank Van Harmelen, Jim Hendler, Ian Horrocks, Deborah L Mc Guinness, Peter F Patel-Schneider, Lynn Andrea Stein, et al. Owl web ontology language reference. W3C recommendation, 10(02), 2004. [Benavides-Prado et al., 2017] Diana Benavides-Prado, Yun Sing Koh, and Patricia Riddle. Acc Gen SVM: Selectively transferring from previous hypotheses. In IJCAI, pages 1440 1446, 2017. [Bishop, 2006] Christopher M Bishop. Pattern recognition. Machine Learning, 128:1 58, 2006. [Chen et al., 2015] Jiaoyan Chen, Huajun Chen, Daning Hu, Jeff Z Pan, and Yalin Zhou. Smog disaster forecasting using social web data and physical sensor data. In 2015 IEEE International Conference on Big Data (Big Data), pages 991 998. IEEE, 2015. [Chen et al., 2017] Jiaoyan Chen, Freddy L ecu e, Jeff Z Pan, and Huajun Chen. Learning from ontology streams with semantic concept drift. In IJCAI, pages 957 963, 2017. [Chen et al., 2018] Jiaoyan Chen, Freddy L ecu e, Jeff Z. Pan, Ian Horrocks, and Huajun Chen. Knowledge-based transfer learning explanation. In KR, pages 349 358, 2018. [Choi et al., 2016] Jonghyun Choi, Sung Ju Hwang, Leonid Sigal, and Larry S Davis. Knowledge transfer with interactive learning of semantic relationships. In AAAI, pages 1505 1511, 2016. [Dai et al., 2007] Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning, pages 193 200. ACM, 2007. [Dai et al., 2009] Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated learning: Transfer learning across different feature spaces. In Advances in neural information processing systems, pages 353 360, 2009. [Davis and Domingos, 2009] Jesse Davis and Pedro Domingos. Deep transfer via second-order Markov Logic. In ICML, pages 217 224, 2009. [Du et al., 2019] Jianfeng Du, Jeff Z. Pan, Sylvia Wang, Yuming Shen Kunxun Qi, and Yu Deng. Validation of Growing Knowledge Graphs by Abductive Text Evidences. In the Proc. of the 33rd National Conference on Artificial Intelligence (AAAI 2019), 2019. [L ecu e and Pan, 2015] Freddy L ecu e and Jeff Z Pan. Consistent knowledge discovery from evolving ontologies. In AAAI, pages 189 195, 2015. [L ecu e, 2015] Freddy L ecu e. Scalable maintenance of knowledge discovery in an ontology stream. In IJCAI, pages 1457 1463, 2015. [Lee et al., 2017] Jaekoo Lee, Hyunjae Kim, Jongsun Lee, and Sungroh Yoon. Transfer learning for deep learning on graph-structured data. In AAAI, pages 2154 2160, 2017. [Long et al., 2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 97 105, 2015. [Lv et al., 2012] Wenlong Lv, Weiran Xu, and Jun Guo. Transfer learning in classification based on semantic analysis. In Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on, pages 1336 1339. IEEE, 2012. [Mihalkova and Mooney, 2009] L. Mihalkova and R. J. Mooney. Transfer learning from minimal target data by mapping across relational domains, 2009. [Mihalkova et al., 2007] Lilyana Mihalkova, Tuyen Huynh, and Raymond J. Mooney. Mapping and revising Markov Logic Networks for transfer learning. In AAAI, pages 608 614, 2007. [Nickel et al., 2016] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11 33, 2016. [Pan and Thomas, 2007] Jeff Z. Pan and Edward Thomas. Approximating OWL-DL Ontologies. In the Proc. of the 22nd National Conference on Artificial Intelligence (AAAI-07), pages 1434 1439, 2007. [Pan and Yang, 2010] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2010. [Pan et al., 2011] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2011. [Pan et al., 2016] Jeff Z. Pan, Yuan Ren, and Yuting Zhao. Tractable approximate deduction for OWL. Arteficial Intelligence, pages 95 155, 2016. [Tan et al., 2017] Ben Tan, Yu Zhang, Sinno Jialin Pan, and Qiang Yang. Distant domain transfer learning. In AAAI, pages 2604 2610, 2017. [Van Haaren et al., 2015] Jan Van Haaren, Andrey Kolobov, and Jesse Davis. TODTLER: Two-order-deep transfer learning. In AAAI, pages 3007 3015, 2015. [Vicient et al., 2013] Carlos Vicient, David S anchez, and Antonio Moreno. An automatic approach for ontologybased feature extraction from heterogeneous textual resources. Engineering Applications of Artificial Intelligence, 26(3):1092 1106, 2013. [Weiss et al., 2016] Karl Weiss, Taghi M Khoshgoftaar, and Ding Ding Wang. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)