Human activity recognition in videos is a challenging problem that has drawn a lot of interest, particularly when the goal requires the analysis of a large video database. AOLME project provides a collaborative learning environment for middle school students to explore mathematics, computer science, and engineering by processing digital images and videos. As part of this project, around 2200 hours of video data was collected for analysis. Because of the size of the dataset, it is hard to analyze all the videos of the dataset manually. Thus, there is a huge need for reliable computer-based methods that can detect activities of interest. My thesis is focused on the development of accurate methods for detecting and tracking objects in long videos. All the models are validated on videos from 7 different sessions, ranging from 45 minutes to 90 minutes. The keyboard detector achieved a very high average precision (AP) of 92% at 0.5 intersection over union (IoU). Furthermore, a combined system of the detector with a fast tracker KCF (159fps) was developed so that the algorithm runs significantly faster without sacrificing accuracy. For a video of 23 minutes having resolution 858X480 @ 30 fps, the detection alone runs at 4.7Xthe real-time, and the combined algorithm runs at 21Xthe real-time for an average IoU of 0.84 and 0.82, respectively. The hand detector achieved average precision (AP) of 72% at 0.5 IoU. The detection results were improved to 81% using optimal data augmentation parameters. The hand detector runs at 4.7Xthe real-time with AP of 81% at 0.5 IoU. The hand detection method was integrated with projections and clustering for accurate proposal generation. This approach reduced the number of false-positive hand detections by 80%. The overall hand detection system runs at 4Xthe real-time, capturing all the activity regions of the current collaborative group.