Human stereo vision uses occlusions as a prominent cue, sometimes the only cue, to localize object boundaries and recover depth relationships between surfaces adjacent to these boundaries. However, many modern computer vision systems treat occlusions as a secondary cue or ignore them as outliers, leading to imprecise boundaries, especially when matching cues are weak. In this work, we introduce a layered approach to stereo that explicitly incorporates occlusions. Unlike previous layer-based methods, our model is cooperative, involving local computations among units that have overlapping receptive fields at multiple scales, and sparse lateral and vertical connections between the computational units. Focusing on bi-layer scenes, we demonstrate our model's ability to localize boundaries between figure and ground in a wide variety of cases, including images from Middlebury and Falling Things datasets, as well as perceptual stimuli that lack matching cues and have yet to be well explained by previous computational stereo systems. Our model suggests new directions for creating cooperative stereo systems that incorporate occlusion cues in a human-like manner.