Individual's general well-being is greatly impacted by mental health conditions including depression and Post-Traumatic Stress Disorder (PTSD), underscoring the importance of early detection and precise diagnosis in order to facilitate prompt clinical intervention. An advanced multimodal deep learning system for the automated classification of PTSD and depression is presented in this paper. Utilizing textual and audio data from clinical interview datasets, the method combines features taken from both modalities by combining the architectures of LSTM (Long Short Term Memory) and BiLSTM (Bidirectional Long Short-Term Memory).Although text features focus on speech's semantic and grammatical components; audio features capture vocal traits including rhythm, tone, and pitch. This combination of modalities enhances the model's capacity to identify minute patterns connected to mental health conditions. Using test datasets, the proposed method achieves classification accuracies of 92% for depression and 93% for PTSD, outperforming traditional unimodal approaches and demonstrating its accuracy and robustness.