A vast majority of deep learning methods are built to automate diagnostic tasks. However, in clinical practice, a more advanced question is how to predict the course of a disease. Current methods for this problem are complicated, and often require domain knowledge, making them difficult for practitioners to use. In this paper, we formulate the prognosis prediction task as a one-to-many sequence prediction problem. Inspired by a clinical decision making process with two agents -- a radiologist and a general practitioner -- we propose a generic end-to-end transformer-based framework to estimate disease prognosis from images and auxiliary data. The effectiveness and validation of the developed method are shown on synthetic data, and in the task of predicting the development of structural osteoarthritic changes in knee joints.