Abstract:In this work we addressed the challenge of multi-class anomaly classification in Video Capsule Endoscopy (VCE)[1] with a variety of deep learning models, ranging from custom CNNs to advanced transformer architectures. The purpose is to correctly classify diverse gastrointestinal disorders, which is critical for increasing diagnostic efficiency in clinical settings. We started with a proprietary CNN and improved performance with ResNet[7] for better feature extraction, followed by Vision Transformer (ViT)[2] to capture global dependencies. Multiscale Vision Transformer (MViT)[6] improved hierarchical feature extraction, while Dual Attention Vision Transformer (DaViT)[4] delivered cutting-edge results by combining spatial and channel attention methods. This methodology enabled us to improve model accuracy across a wide range of criteria, greatly surpassing older methods.
Abstract:In this paper, we deal with the problem of temporal action localization for a large-scale untrimmed cricket videos dataset. Our action of interest for cricket videos is a cricket stroke played by a batsman, which is, usually, covered by cameras placed at the stands of the cricket ground at both ends of the cricket pitch. After applying a sequence of preprocessing steps, we have ~73 million frames for 1110 videos in the dataset at constant frame rate and resolution. The method of localization is a generalized one which applies a trained random forest model for CUTs detection(using summed up grayscale histogram difference features) and two linear SVM camera models(CAM1 and CAM2) for first frame detection, trained on HOG features of CAM1 and CAM2 video shots. CAM1 and CAM2 are assumed to be part of the cricket stroke. At the predicted boundary positions, the HOG features of the first frames are computed and a simple algorithm was used to combine the positively predicted camera shots. In order to make the process as generic as possible, we did not consider any domain specific knowledge, such as tracking or specific shape and motion features. The detailed analysis of our methodology is provided along with the metrics used for evaluation of individual models, and the final predicted segments. We achieved a weighted mean TIoU of 0.5097 over a small sample of the test set.