Recent works using artificial neural networks based on distributed word representation greatly boost performance on various natural language processing tasks, especially the answer selection problem. Nevertheless, most of the previous works used deep learning methods (like LSTM-RNN, CNN, etc.) only to capture semantic representation of each sentence separately, without considering the interdependence between each other. In this paper, we propose a novel end-to-end learning framework which constitutes deep convolutional neural network based on multi-modal similarity metric learning (M$^2$S-Net) on pairwise tokens. The proposed model demonstrates its performance by surpassing previous state-of-the-art systems on the answer selection benchmark, i.e., TREC-QA dataset, in both MAP and MRR metrics.