In this work, we investigate the performance of a joint sensing and communication (JSC) network consisting of multiple base stations (BSs) that cooperate through a fusion center (FC) to exchange information about the sensed environment while concurrently establishing communication links with a set of user equipments (UEs). Each BS within the network operates as a monostatic radar system, enabling comprehensive scanning of the monitored area and generating range-angle maps that provide information regarding the position of a group of heterogeneous objects. The acquired maps are subsequently fused in the FC. Then, a convolutional neural network (CNN) is employed to infer the category of the targets, e.g., pedestrians or vehicles, and such information is exploited by an adaptive clustering algorithm to group the detections originating from the same target more effectively. Finally, two multi-target tracking algorithms, the probability hypothesis density (PHD) filter and multi-Bernoulli mixture (MBM) filter, are applied to estimate the state of the targets. Numerical results demonstrated that our framework could provide remarkable sensing performance, achieving an optimal sub-pattern assignment (OSPA) less than 60 cm, while keeping communication services to UEs with a reduction of the communication capacity in the order of 10% to 20%. The impact of the number of BSs engaged in sensing is also examined, and we show that in the specific case study, 3 BSs ensure a localization error below 1 m.