We address the localization of robots in a multi-MAV system where external infrastructure like GPS or motion capture system may not be available. We introduce a vision plus IMU system for localization that uses relative distance and bearing measurements. Our approach lends itself to implementation on platforms with several constraints on size, weight, and payload (SWaP). Particularly, our framework fuses the odometry with anonymous, visual-based robot-to-robot detection to estimate all robot poses in one common frame, addressing three main challenges: 1) initial configuration of the robot team is unknown, 2) data association between detection and robot targets is unknown, and 3) vision-based detection yields false negatives, false positives, inaccurate, noisy bearing and distance measurements of other robots. Our approach extends the Coupled Probabilistic Data Association Filter (CPDAF) to cope with nonlinear measurements. We demonstrate the superior performance of our approach over a simple VIO-based method in a simulation using measurement models obtained from real data. We also show how on-board sensing, estimation and control can be used for formation flight.