We investigate the geometry of the empirical risk minimization problem for $k$-layer neural networks. We will provide examples showing that for the classical activation functions $\sigma(x)= 1/\bigl(1 + \exp(-x)\bigr)$ and $\sigma(x)=\tanh(x)$, there exists a positive-measured subset of target functions that do not have best approximations by a fixed number of layers of neural networks. In addition, we study in detail the properties of shallow networks, classifying cases when a best $k$-layer neural network approximation always exists or does not exist for the ReLU activation $\sigma=\max(0,x)$. We also determine the dimensions of shallow ReLU-activated networks.