In recent years, unmanned aerial vehicles (UAVs) have played an increasingly crucial role in supporting disaster emergency response efforts by analyzing aerial images. While current deep-learning models focus on improving accuracy, they often overlook the limited computing resources of UAVs. This study recognizes the imperative for real-time data processing in disaster response scenarios and introduces a lightweight and efficient approach for aerial video understanding. Our methodology identifies redundant portions within the video through policy networks and eliminates this excess information using frame compression techniques. Additionally, we introduced the concept of a `station point,' which leverages future information in the sequential policy network, thereby enhancing accuracy. To validate our method, we employed the wildfire FLAME dataset. Compared to the baseline, our approach reduces computation costs by more than 13 times while boosting accuracy by 3$\%$. Moreover, our method can intelligently select salient frames from the video, refining the dataset. This feature enables sophisticated models to be effectively trained on a smaller dataset, significantly reducing the time spent during the training process.