Human behavior understanding is arguably one of the most important mid-level components in artificial intelligence. In order to efficiently make use of data, multi-task learning has been studied in diverse computer vision tasks including human behavior understanding. However, multi-task learning relies on task specific datasets and constructing such datasets can be cumbersome. It requires huge amounts of data, labeling efforts, statistical consideration etc. In this paper, we leverage existing single-task datasets for human action classification and captioning data for efficient human behavior learning. Since the data in each dataset has respective heterogeneous annotations, traditional multi-task learning is not effective in this scenario. To this end, we propose a novel alternating directional optimization method to efficiently learn from the heterogeneous data. We demonstrate the effectiveness of our model and show performance improvements on both classification and sentence retrieval tasks in comparison to the models trained on each of the single-task datasets.