Video object detection (VID) has been vigorously studied for years but almost all literature adopts a static accuracy-based evaluation, i.e., mean average precision (mAP). From a temporal perspective, the importance of recall continuity and localization stability is equal to that of accuracy, but the mAP is insufficient to reflect detectors' performance across time. In this paper, non-reference assessments are proposed for continuity and stability based on tubelets from multi-object tracking (MOT). These temporal evaluations can serve as supplements to static mAP. Further, we develop tubelet refinement for improving detectors' performance on temporal continuity and stability through short tubelet suppression, fragment filling, and history-present fusion. In addition, we propose a small-overlap suppression to extend VID methods to single object tracking (SOT) task. The VID-based SOT does not need MOT or traditional SOT model. A unified VID-MOT-SOT framework is then formed. Extensive experiments are conducted on ImageNet VID dataset, where the superiority of our proposed approaches are validated and verified. Codes will be publicly available.