We present a neural network architecture for jointly learning user locations and environment mapping up to isometry, in an unsupervised way, from channel state information (CSI) values with no location information. The model is based on an encoder-decoder architecture. The encoder network maps CSI values to the user location. The decoder network models the physics of propagation by parametrizing the environment using virtual anchors. It aims at reconstructing, from the encoder output and virtual anchor location, the set of time of flights (ToFs) that are extracted from CSI using super-resolution methods. The neural network task is set prediction and is accordingly trained end-to-end. The proposed model learns an interpretable latent, i.e., user location, by just enforcing a physics-based decoder. It is shown that the proposed model achieves sub-meter accuracy on synthetic ray tracing based datasets with single anchor SISO setup while recovering the environment map up to 4cm median error in a 2D environment and 15cm in a 3D environment