We identify and solve a hidden-layer model that is analytically tractable at any finite width and whose limits exhibit both the kernel phase and the feature learning phase. We analyze the phase diagram of this model in all possible limits of common hyperparameters including width, layer-wise learning rates, scale of output, and scale of initialization. We apply our result to analyze how and when feature learning happens in both infinite and finite-width models. Three prototype mechanisms of feature learning are identified: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling. In sharp contrast, neither of these mechanisms is present when the model is in the kernel regime. This discovery explains why large initialization often leads to worse performance. Lastly, we empirically demonstrate that discoveries we made for this analytical model also appear in nonlinear networks in real tasks.