We develop a real-time state estimation system to recover the pose and contact formation of an object relative to its environment. In this paper, we focus on the application of inserting an object picked by a suction cup into a tight space, an enabling technology for robotic packaging. We propose a framework that fuses force and visual sensing for improved accuracy and robustness. Visual sensing is versatile and non-intrusive, but suffers from occlusions and limited accuracy, especially for tasks involving contact. Tactile sensing is local, but provides accuracy and robustness to occlusions. The proposed algorithm to fuse them is based on iSAM, an on-line optimization technique, which we use to incorporate kinematic measurements from the robot, contact geometry of the object and the container, and visual tracking. In this paper, we generalize previous results in planar settings to a 3D task with more complex contact interactions. A key challenge in using force sensing is that we do not observe contact point locations directly. We propose a data-driven method to infer the contact formation, which is then used in real-time by the state estimator. We demonstrate and evaluate the algorithm in a setup instrumented to provide groundtruth.