# predicting_subpopulation_specific_viral_evolution__422a5cdb.pdf Published in Transactions on Machine Learning Research (03/2025) Predicting sub-population specific viral evolution Wenxian Shi wxsh@mit.edu Department of Computer Science Massachusetts Institute of Technology Menghua Wu rmwu@mit.edu Department of Computer Science Massachusetts Institute of Technology Regina Barzilay regina@csail.mit.edu Department of Computer Science Massachusetts Institute of Technology Reviewed on Open Review: https: // openreview. net/ forum? id= Mae23i Eq PS Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making sub-population specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different sub-populations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-Co V-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis. The code is available at https://github.com/wxsh1213/vaxseer/tree/main/transmission. 1 Introduction Modeling virus evolution is important for vaccine development and disease monitoring. Specifically, the goal is to predict the change in the distribution of viral variants over time. Current approaches aggregate data across all locations of virus collection, resulting in models that represent monolithic, global distributions of each virus. However, this global landscape is composed of local sub-communities of variants with distinct but interdependent distributions. For example, in January 2021, the most common strain of SARS-Co V2 was different in Europe (B.1.1.7) and Asia (B.1.1.214), but two months later, the former had spread rapidly across Asia (Fig. 1). Faithfully modeling these dynamics is essential for understanding how the viral landscape evolves locally and designing vaccines effective for each community, especially those with less access to public health resources and are poorly represented by the data. In this paper, we aim to predict the likelihood of a viral variant occurring at a particular time in specific sub-populations, such as the geographical location where the viral sample was collected. Viral variants are represented by the amino-acid sequence of the primary antigenic protein responsible for infection. A Published in Transactions on Machine Learning Research (03/2025) Figure 1: SARS-Co V-2 clade distributions differ by geographical location and change interdependently over time. M K A I I A Transformer Transmission rate 饾拺饾挄(饾挋饾煍|饾挋饾煄:饾煋; 饾拪) |饾憠|: # of amino acids Sub-population: 饾拪. Date: 饾挄 饾憵: # of sub-populations 饾懆饾挋饾煍= 饾懓饾挋饾煄:饾煋 饾懆饾挋饾煍= 饾懆饾挋饾煄:饾煋 饾懆饾挋饾煍= 饾懖饾挋饾煄:饾煋 Figure 2: The transmission model: an auto-regressive, time-resolved generative model, parametrized in terms of transmission rate matrices. straightforward approach for modeling location-specific evolution is to separately analyze data for each location, assuming independent evolution. Inherently, this approach leads to data fragmentation, exacerbating the data sparsity problem. Moreover, this approach fails to capture transitions between sub-communities that shape local distributions, especially for highly transmissible diseases. Traditional epidemiology literature has studied transmission dynamics extensively (Sattenspiel & Dietz, 1995; Goel & Sharma, 2020; Balcan et al., 2010; 2009; Colizza et al., 2006), but the hand-crafted rules they employ fall short of capturing the actual complexity of these interactions. We propose a novel approach to model evolving protein distributions for sub-populations by explicitly capturing the transmissions between them. Specifically, the change in the occurrence of a protein sequence over time across all sub-populations is defined by an ordinary differential equation (ODE), parametrized by a transmission rate matrix related to that sequence. Solving this ODE yields the time-resolved probability of the given protein in different sub-populations, which is determined by the transmission rate matrix and boundary conditions. We leverage language models (Radford et al., 2019) to model the transmission rate matrix and boundary conditions. A practical challenge is that the number of transmission rates scales quadratically with the number of sub-populations, rendering eigen-calculations inefficient for fine-grained sub-populations (e.g. countries). To address this issue, we also propose a hierarchical variation of our model, which leverages the inherent structures that relate sub-populations, e.g. countries can be grouped by their geographical proximity. Specifically, we re-parameterize the transmission rate matrix in terms of intragroup and inter-group transmissions. Both versions of our method are advantageous for sub-populations with limited data, which can benefit from knowledge transferred via the transmission rates with other subpopulations. Moreover, the association defined by the transmission rates can be learned from the occurrences of variants in the dataset, eliminating the need for hand-crafted knowledge. We evaluate our approach on two viruses, influenza A/H3N2 and SARS-Co V-2, across multiple years, in over thirty countries and all six continents. The evaluation shows that our model outperforms state-of-the-art baselines in predicting future distributions of viral proteins. For SARS-Co V-2, the top 500 protein sequences generated by our model can cover 93.4% of the circulating sequences reported in various countries. Moreover, the transmission rates learned from the data are well aligned with the transmission pathways of viral variants discovered independently by phylogenetic analysis. Finally, our experiments focus primarily on geographical location due to data abundance, but our framework can be easily applied to various types of sub-populations, including individuals of varying ages and vaccination histories, as well as their combinations. Let x = (x1, ..., xl) V l denote the amino-acid sequence of the viral protein responsible for infections. For example, we model the Hemagglutinin (HA) protein for influenza and the Receptor Binding Domain (RBD) Published in Transactions on Machine Learning Research (03/2025) of Spike protein for SARS-Co V-2. V is the set of amino acids, and l is the length of the protein sequence. Our goal is to model pt(x; i), the likelihood of protein sequence x isolated at time t in sub-population i S. S is the set of m sub-populations, i.e., |S| = m. Throughout this paper, we use location, defined as the geographical region where the viral sample was collected, interchangeably with sub-population. 2.1 Transmission model The transmission model parameterizes pt(x; i) through the transmission rate of x between geographical locations. The model outputs Nt(x), the un-normalized probability (occurrence) of x at time t in m locations: Nt(x) = [nt(x; 1), ..., nt(x; m)]. We assume that the derivative of occurrence over time is a linear transformation of Nt(x), following Kermack & Mc Kendrick (1927); Obermeyer et al. (2022): dt = A(x; 胃)Nt(x), (1) in which A(x; 胃) Rm m + is called the transmission rate matrix of protein x. Intuitively, [A(x; 胃)]ij measures the number of people in location i infected by one person from location j during dt. In this work, the transmission rate matrix is output by a neural network parameterized by 胃 that takes x as input. For concision, we drop 胃 in the following equations wherever clear. The ordinary differential equation in Eq. 1 has a closed-form solution, j=1 uij(x) exp(位j(x)t) cj(x); cj(x) = i=1 U(x) 1 ji n0(x; i), (2) in which 位j(x) is the j-th eigenvalue of matrix A(x), and uij(x) is the i-th dimension of the j-th eigenvector of A(x). cj(x) is determined by the boundary conditions N0(x). U 1 is the inverse of eigenvector matrix U. The derivation can be found in Appendix A.1. Intuitively, 位j(x) describes the fitness of protein x in a latent location j. Larger 位j(x) indicates that strain x will reproduce faster in the latent location j. The eigenvectors combine latent locations with different growth rates 位j and determine the distributions of x in real locations. To calculate the probability of the sequence x in location i at time t, we need to normalize the occurrence nt(x; i), which is intractable for massive space of sequences. Inspired by autoregressive language models, instead of modeling the occurrence of all sequences, we model the occurrence of the s-th amino acid xs given the prefix amino acids x