While a lot of work is dedicated to self-supervised learning, most of it is dealing with 2D images of natural scenes and objects. In this paper, we focus on \textit{volumetric} images obtained by means of the X-Ray Computed Tomography (CT). We describe two pretext training tasks which are designed taking into account the specific properties of volumetric data. We propose two ways to transfer a trained network to the downstream task of object localization with a zero amount of manual markup. Despite its simplicity, the proposed method shows its applicability to practical tasks of object localization and data reduction.