Consider a set of n images of a scene with dynamic objects captured with a static or a handheld camera. Let the temporal order in which these images are captured be unknown. There can be n! possibilities for the temporal order in which these images could have been captured. In this work, we tackle the problem of temporally sequencing the unordered set of images of a dynamic scene captured with a hand-held camera. We propose a convolutional block which captures the spatial information through 2D convolution kernel and captures the temporal information by utilizing the differences present among the feature maps extracted from the input images. We evaluate the performance of the proposed approach on the dataset extracted from a standard action recognition dataset, UCF101. We show that the proposed approach outperforms the state-of-the-art methods by a significant margin. We show that the network generalizes well by evaluating it on a dataset extracted from the DAVIS dataset, a dataset meant for video object segmentation, when the same network was trained with a dataset extracted from UCF101, a dataset meant for action recognition.