With recent developments in smart technologies, there has been a growing focus on the use of artificial intelligence and machine learning for affective computing to further enhance the user experience through emotion recognition. Typically, machine learning models used for affective computing are trained using manually extracted features from biological signals. Such features may not generalize well for large datasets and may be sub-optimal in capturing the information from the raw input data. One approach to address this issue is to use fully supervised deep learning methods to learn latent representations of the biosignals. However, this method requires human supervision to label the data, which may be unavailable or difficult to obtain. In this work we propose an unsupervised framework reduce the reliance on human supervision. The proposed framework utilizes two stacked convolutional autoencoders to learn latent representations from wearable electrocardiogram (ECG) and electrodermal activity (EDA) signals. These representations are utilized within a random forest model for binary arousal classification. This approach reduces human supervision and enables the aggregation of datasets allowing for higher generalizability. To validate this framework, an aggregated dataset comprised of the AMIGOS, ASCERTAIN, CLEAS, and MAHNOB-HCI datasets is created. The results of our proposed method are compared with using convolutional neural networks, as well as methods that employ manual extraction of hand-crafted features. The methodology used for fusing the two modalities is also investigated. Lastly, we show that our method outperforms current state-of-the-art results that have performed arousal detection on the same datasets using ECG and EDA biosignals. The results show the wide-spread applicability for stacked convolutional autoencoders to be used with machine learning for affective computing.