We consider the reconstruction problem of video compressive sensing (VCS) under the deep unfolding/rolling structure. Yet, we aim to build a flexible and concise model using minimum stages. Different from existing deep unfolding networks used for inverse problems, where more stages are used for higher performance but without flexibility to different masks and scales, hereby we show that a 2-stage deep unfolding network can lead to the state-of-the-art (SOTA) results (with a 1.7dB gain in PSNR over the single stage model, RevSCI) in VCS. The proposed method possesses the properties of adaptation to new masks and ready to scale to large data without any additional training thanks to the advantages of deep unfolding. Furthermore, we extend the proposed model for color VCS to perform joint reconstruction and demosaicing. Experimental results demonstrate that our 2-stage model has also achieved SOTA on color VCS reconstruction, leading to a >2.3dB gain in PSNR over the previous SOTA algorithm based on plug-and-play framework, meanwhile speeds up the reconstruction by >17 times. In addition, we have found that our network is also flexible to the mask modulation and scale size for color VCS reconstruction so that a single trained network can be applied to different hardware systems. The code and models will be released to the public.