An integrated approach is proposed across visual and textual data to both determine and justify a medical diagnosis by a neural network. As deep learning techniques improve, interest grows to apply them in medical applications. To enable a transition to workflows in a medical context that are aided by machine learning, the need exists for such algorithms to help justify the obtained outcome so human clinicians can judge their validity. In this work, deep learning methods are used to map a frontal X-Ray image to a continuous textual representation. This textual representation is decoded into a diagnosis and the associated textual justification that will help a clinician evaluate the outcome. Additionally, more explanatory data is provided for the diagnosis by generating a realistic X-Ray that belongs to the nearest alternative diagnosis. With a clinical expert opinion study on a subset of the X-Ray data set from the Indiana University hospital network, we demonstrate that our justification mechanism significantly outperforms existing methods that use saliency maps. While performing multi-task training with multiple loss functions, our method achieves excellent diagnosis accuracy and captioning quality when compared to current state-of-the-art single-task methods.