Abstract Text: Predicting which viral strains will become predominant in human populations is challenging due to the complex relationship between the structure and composition of viruses and factors such as binding affinity and immunogenicity. However, a comprehensive understanding and anticipation of viral evolution is crucial for effective vaccine design and therapeutic interventions. Traditional methods in virology have been limited by considering sequence or structural data in isolation. Addressing this limitation, our study introduces ViSENet (Viral Sequence Evolution Network), a model that learns joint representations from protein structure and sequence data, along with key properties like binding affinity and immune escape, to create a more holistic model of viral evolution. ViSENet employs a deep transformer-based autoencoder to learn viral spike protein sequence embeddings that are chronologically organized, and from which viral sequences can be reconstructed. This is complemented by a geometric scattering autoencoder that captures the tertiary information of pertinent domains in the spike protein from AlphaFold-predicted structures. By marrying these two data representations in a joint latent space, our model surpasses the predictive performance of previous unimodal methods. Furthermore, by utilizing the neural ODE framework to navigate through the joint latent space, our model can project viral evolution dynamics several weeks into the future. Such forecasting capabilities are demonstrated through accurate predictions of COVID-19 and influenza evolutionary patterns, showcasing the potential of our approach in guiding future vaccine and therapeutic development.