Image datasets have been steadily growing in size, harming the feasibility and efficiency of large-scale 3D reconstruction methods. In this paper, a novel approach for scaling Multi-View Stereo (MVS) algorithms up to arbitrarily large collections of images is proposed. Specifically, the problem of reconstructing the 3D model of an entire city is targeted, starting from a set of videos acquired by a moving vehicle equipped with several high-resolution cameras. Initially, the presented method exploits an approximately uniform distribution of poses and geometry and builds a set of overlapping clusters. Then, an Integer Linear Programming (ILP) problem is formulated for each cluster to select an optimal subset of views that guarantees both visibility and matchability. Finally, local point clouds for each cluster are separately computed and merged. Since clustering is independent from pairwise visibility information, the proposed algorithm runs faster than existing literature and allows for a massive parallelization. Extensive testing on urban data are discussed to show the effectiveness and the scalability of this approach.