Can we distinguish between two wireless transmitters sending exactly the same message, using the same protocol? The opportunity for doing so arises due to subtle nonlinear variations across transmitters, even those made by the same manufacturer. Since these effects are difficult to model explicitly, we investigate learning device fingerprints using complex-valued deep neural networks (DNNs) that take as input the complex baseband signal at the receiver. Such fingerprints should be robust to ID spoofing, and to distribution shifts across days and locations due to clock drift and variations in the wireless channel. In this paper, we point out that, unless proactively discouraged from doing so, DNNs learn these strong confounding features rather than the subtle nonlinear characteristics that are the basis for stable signatures. Thus, a network trained on data collected during one day performs poorly on a different day, and networks allowed access to post-preamble information rely on easily-spoofed ID fields. We propose and evaluate strategies, based on augmentation and estimation, to promote generalization across realizations of these confounding factors, using data from WiFi and ADS-B protocols. We conclude that, while DNN training has the advantage of not requiring explicit signal models, significant modeling insights are required to focus the learning on the effects we wish to capture.