Identifying and segmenting moving objects from a moving monocular camera is difficult when there is unknown camera motion, different types of object motions and complex scene structures. To tackle these challenges, we take advantage of two popular branches of monocular motion segmentation approaches: point trajectory based and optical flow based methods, by synergistically fusing these two highly complementary motion cues at object level. By doing this, we are able to model various complex object motions in different scene structures at once, which has not been achieved by existing methods. We first obtain object-specific point trajectories and optical flow mask for each common object in the video, by leveraging the recent foundational models in object recognition, segmentation and tracking. We then construct two robust affinity matrices representing the pairwise object motion affinities throughout the whole video using epipolar geometry and the motion information provided by optical flow. Finally, co-regularized multi-view spectral clustering is used to fuse the two affinity matrices and obtain the final clustering. Our method shows state-of-the-art performance on the KT3DMoSeg dataset, which contains complex motions and scene structures. Being able to identify moving objects allows us to remove them for map building when using visual SLAM or SFM.