In this work we propose a novel Convolutional Neural Network (CNN) architecture for the matching of pairs of image patches acquired by different sensors. Our approach utilizes two CNN sub-networks, where the first is a Siamese CNN and the second is a subnetwork consisting of dual non-weight-sharing CNNs. This allows simultaneous joint and disjoint processing of the input pair of multimodal image patches. The convergence of the training and the test accuracy is improved by introducing auxiliary losses, and a corresponding hard negative mining scheme. The proposed approach is experimentally shown to compare favorably with contemporary state-of-the-art schemes when applied to multiple datasets of multimodal images. The code implementing the proposed scheme was made publicly available.